Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Deep Learning in Natural Language Processing
Ashutosh VyasAssistant Manager, Mphasis NEXTlabs
2
AbstractDeep learning has emerged as a highly efficient technique in
machine learning to perform text analytics, which includes
text classification, text sequence learning, sentiment
analysis, question-answer engine, etc.
This paper presents the concept and steps of using deep
learning in NLP, ways to pull out the features from the text,
ways to prepare data for deep learning and popular methods
that are being used for NLP, including RNN, CNN, LSTM.
We also present the results of a working example based on
a movie review shared by IMDB.
Feature RepresentationWhen feeding textual data to Artificial Neural
Networks (ANN), we need to consider few points
about the way we create the data. We input the data
vector x in-dimensions and generate the output in
out-dimensions. Instead of creating an individual dimension
for each and every feature, we need to embed each feature
into a D-Dimensional space and represent as a dense vector
in the space.
The main advantage of dense vector representation is that
if there exists a similarity sequence which we can learn, that
can be captured best in dense vector representation.
In the example presented, we used continuous-bag-of-
words (CBOW) approach to build our dense vectors.
Feature ExtractionLet us now look at some of the methods used to extract
features from the given text:
CBOW: Continuous bag of words is a dense vector based
feature extraction approach. Neural networks assume a
fixed input. Thus for training the NN we need to extract
features from textual data. CBOW approach enables feature
extraction by expressing each word in the training data set
as unique feature. This creates a vector representation
for each sentence to be used in training NN. For specific
importance association we provide a weighted vector as an
input to NN. This weight association is done majorly using
two approaches – summing or averaging of embedded
features and weighted averaging of embedded features. An
example is tf-idf of collection of bag of words and giving
weight to different dimensions based upon importance.
Note: We can adopt other methods also as discussed in
section titled 'Word Embedding'.
Distance and Position Features: The distance can
be calculated by subtracting the feature vectors of the
corresponding words, and this information can be used to
train the Neural Networks.
Dimensionality: Decision needs to be made on the
dimensionality of the embedded vectors. It is a trade-off
between processing speed and task accuracy.
In our example, we used the context window of ten words.
IntroductionIn recent years multi-layer neural networks are gaining back their importance in the field of Natural Language Processing (NLP), as they are becoming capable of generating state-of-the-art results with a relatively decent speed due to utilization of Graphical Processing Units. The introduction of different methods in multi-layer neural network, such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Longest State Time Memory Neural Networks (LSTM) has proved to be very efficient in generating better results.
In this paper, we will discuss different methods that can be adopted to perform NLP using deep learning methodologies. We will start our discussion with representation methods available to present the data, then we move to feed forward NN, RNN, CNN and see how to use these methods for NLP.
3
Training FunctionsGiven below are the types of functions that are used in
modeling deep learning neural networks. The selection of
these functions depends majorly on the problem statement.
Non-linear Functions
Neural networks are universal approximators and thus can
approximate any non-zero linear or non-linear function.
The two basic things that should be kept in mind before
selecting a non-linear function are:
1) It should cover the entire input range, from negative
infinity to positive infinity, as its domain.
2) Its output should be bounded in a set range.
The popular non-linear functions used are: Sigmoid, Hard
Tanh, Hyperbolic tangent, Rectifier, Relu.
Output Transformation Functions
In many cases we use a transformation function in the outer
most layer of neural network.
The most popular function used here is softmax, to represent
multi-class classification.
Loss Function
When training the neural network, we require a loss function
which acts as an indicator to identify how far is the network
trained output from the actual output.
Selection of lost function depends upon the type,
requirement and extent of tolerance of loss function.
Most popular loss functions are: Hinge-binary,
Hinge-multiclass, log-loss, categorical cross entropy loss,
ranking loss.
Word Embedding Let us now look at the methodology to represent each of
the features into a low dimensionality space. This technique
is popularly used for sequential pattern training of the
sentences in the documents.
Continuous-Bag-of-Words
This approach is used to capture the sequential information
in the sentences. The process is as follows:
Total number of unique words are collected and all the words
are given a vector representation, this is called one hot
vector representation. Depending upon the window size,
words that are sequential part of a sentence are collected
and a training model is developed. E.g.:
A quick brown fox jumps right over a lazy dog.
In the above sentence, suppose we represent a word
with one word to its left and one to its right, then
the training data would look like:
[(a, brown),quick],[(quick,fox),brown] etc.
Then a neural network is taken with input layer's
size = C(Context)*V(vocabulary size) and a hidden layer's
size = N(heuristic) and output layer's
size = V(vocabulary size).
An optimization technique like gradient descent, is used to
train the model.
Figure 1 represents single word context CBOW.
X1
h1
WV*N={wki} WN*V={wy}
h2
h2
hN
y1
y2
y3
yj
yv
X2
X3
Xy
Xv
Input Layer Hidden Layer Output Layer
Figure 1
4
Recurrent Neural NetworksA Recurrent Neural Network (RNN) is a type of neural
network where connections between units form a directed
cycle. This allows NN to create an internal state of the
network which allows it to exhibit dynamic temporal
behavior. RNNs can use their internal memory to process
arbitrary sequences of inputs. This makes them applicable
to tasks such as unregimented connected handwriting
recognition or speech recognition.
RNNs are neural nets that can deal with sequences of
variable length. They are able to perform this by defining a
recurrence relationship over timestamps which is typically
the following formula:
Sk=f (Sk−1xWrec+Xk xWx)
Above equation forms the basic structure of RNN. The
previous state result is used to calculate the current
state output.
Figure 2 represents a fully connected recurrent neural net.
In our example, we used LSTM, a version of RNN due to the
presence of vanishing gradient problem in RNN.
Convolutional Neural NetworksConvolutional neural networks are a type of artificial neural
networks that use convolution methodology to extract the
features from the input data to increase the number of
features.
The basic steps are as follows: Considering the input
data set, we decide and select N number of convolution
functions, also known as filters, where N = maximum
number of features we want. After the selection of filter
functions we convolute the input with the filters. The output
from the convolution is a D-Dimensional vector, which is
then sent for pooling.
Pooling is a way of selecting most relevant features from
the set of given features. Mostly for text analytics we select
the features with max or average values.
Once the best features are selected, the normal architecture
of ANN is followed.
Convolutional NN are majorly used for image classification,
text classification, etc.
Figure 3 demonstrates a fully connected, multilayered
filtering CNN.
CNN are good performers for text classification, but poor
in learning the sequential information from the text. Hence,
LSTM was used in the given example.
x
sV
U
W W
V V V
U U U
W W W
o
x1-1
o1-1o1
o1+1
s1-1 s1+1s1
x1 x1+1Unfold
Figure 2
84 x 8420 x 20
9 x 9
Input Layer
Convolu�on Fully connected
1st Hidden
4 Frames
Nodes:
8 x 8 x 4 x 16 4 x 4 x 16 x 32 9 x 9 x 32 x 256 256 x 4
84 x 84 x 8 20 x 20 x 16 9 x 9 x 32 256 4
Weights:
15 Fillers 32 Fillers
2nd Hidden 3st Hidden {256 fully connected}
Output {ac�on}8 x 8 4 x 4
4 x 4
4 x 4
8 x 8
Figure 3
5
LSTMLong-short-term-memory (LSTM) is to deal with the
concept introduced for Recurrent Neural Networks because
of the problem of vanishing gradient in RNN.
Vanishing gradient problem is faced in machine learning
while training the neural nets using the back propagation
algorithm. In back propagation, each of the weights receives
an update proportional to gradient of the error function. As
the number of layers increase, the gradients of the error
functions multiply. This slows down the training process of
NN terribly.
IN RNN, this problem exacerbates when the feedback
component is also added to the training of weights.
To handle this, LSTM was introduced by Hochreiter and
Schmidhuber (1997).
Figure 4 represents the blog diagram of LSTM.
LSTM selects the parameters and decides when to and
when not to update the memory based on the log-logistic
function, working on the input vector and the latest
learnt weight.
Forget gate helps LSTM to forget the learnt weight if the
effect of carrying it forward is negligible, thus reducing the
long term dependencies.
Working ExampleWe performed sentiment analysis of movie review data
taken from IMDB.
The data has 5000 total sentences including positive and
negative sentiments.
We used RNN – LSTM as we need to train the sequential
information to our neural nets.
Data was split into 75% training and 25% testing data by
random sampling.
Deep Learning module was constructed as mentioned
below:
Word embedding: skip-gram (similar to CBOW)
RNN LSTM layer = 1
Number of LSTM nodes = 100
Neural Nodes = 5
Output Node = 1 (as there exist only 2 classes to classify)
Cost function: binary cross entropy
Optimizer: RMS prop
Non-linearity = tanh
After creating the neural network using Python as the
programming language, following result was obtained:
Validation Accuracy: 82.30%
This accuracy signifies that out of 25% of the entire data
(test set), having 0 - 1 tags , 82.30% was correctly classified
as positive or negative sentiment.
References[1]. Goldberg, Yoav. "A primer on neural network
models for natural language processing." arXiv preprint
arXiv:1510.00726 (2015).
[2]. Rong, Xin. "word2vec parameter learning explained."
arXiv preprint arXiv:1411.2738 (2014).
Input Gate
Memory Cell Input
Output Gate
Memory CellOutput
Forget GateSelf-recurrent
Connec�on
Figure 4
1
Ashutosh Vyas Assistant Manager, Mphasis NEXTlabs
Mr. Ashutosh Vyas is an Assistant Manager at Mphasis NEXTlabs. He
completed his B.Tech from SKIT, Jaipur Rajasthan Technical University,
followed by M.Tech from IIITB Bangalore in Information Technology.
He pursued his interest in analytics and machine learning by joining
NEXTLabs as a Sr. Analyst, where he worked on different analytical and
technical aspects and designed an agent based simulation model for
Mphasis proprietary product – InfraGraf. He designed a machine learning
based information retrieval algorithm from documents and sentiment
analysis module using deep learning. Ashutosh also published a paper
on Life Event Detection in InnovEx Asia 2016 conference.
His technical interest resides in machine learning, agent based
modelling, optimization theory, graph theory, stochastic based analysis,
and time series analytics.
About MphasisMphasis is a global technology services and solutions company specializing in the areas of Digital, Governance, Risk &
Compliance. Our solution focus and superior human capital propels our partnership with large enterprise customers in their
digital transformation journeys. We partner with global financial institutions in the execution of their risk and compliance
strategies. We focus on next generation technologies for differentiated solutions delivering optimized operations for clients.
6
For more information, contact: [email protected]