Background

I used to wonder how using deep learning / artificial intelligence, google could correctly autocomplete texts in Gmail and Text in search engines. Though this is partly possible with Natural Language Processing (sentiment analysis) how come deep learning comes into picture.

Many a times sequence of words is important, let’s look at following example illustrating this point :

“this movie is no good, not recommended”, could be classified as expressing positive sentiment if all words were to be taken independently, whereas a model that remembers words across sequences would understand negation that appears before particular words as changing the meaning of the sentence.

As per my understanding Deep Learning (DL) works only with numbers and how relationships are established within texts using DL.

This mystery, for me, cleared up when I explored RNN, a category of Neural Networks.

RNNs consist of recurrent neurons i.e. neurons which have memory, Recurrent neurons feedback their output as an input.  They are great at learning sequential data

This is very different from Artificial Neural Networks, which is used for regression and classification.

Examples of time series data might be sensor logs where you’re getting different inputs from sensors from the Internet of Things or maybe you’re trying to predict stock behavior by looking at historical stock trading information.

Past is Important & Text is Powerful

  • Assess stock prices over the last week, month and year
  • Will stocks be up or down in the coming week?
  • Autocomplete sentences
  • “The 2012 Olympics were held in …”
  • A car is moving in a parabolic arc for the last 15 seconds

Is it headed for a collision?

  • Predict the next word in a sequence (autocomplete)
  • “The tallest building in the world is …”
  • Language translations

“how are you” -> “Comment allez-vous”

Text is Sequential.

Data Feed-forward networks cannot learn from the past. Back propagation allows the weights and biases of the neurons to converge to their final values. Back Propagation Through Time is tweaked version of Back Propagation for RNN training

Representing text as numbers in a meaningful manner for use in RNN is a challenge.

The Big Question “How best can words be represented as numeric data?“

There are three options which comes to rescue and solve this problem, each of them have pros and cons. Word embedding is powerful and is used more often.

  1. One Hot Encoding (Used in ML as well)

For example, when analyzing reviews, create a set of all words (all across the corpus). Express each review as a tuple of 1,0 elements. Compare, measure distance using simple geometry.

  1. TF-IDF (TF = Term Frequency; IDF = Inverse Document Frequency):

Represent each word as numeric data, aggregate into tensor. Tokenised document and feed into RNN. Represent each word with the product of two terms – tf and idf. High weight for word i in document j if word occurs a lot in this document, but rarely elsewhere. Context is not captured in this scheme.

  1. Word Embedding (captures meaning)

Word embeddings are most powerful, and generated using a unsupervised  machine learning model. Relies on huge corpus of data that’s fed during learning.

Semantics and relationship with previous words is maintained.

Once this machine learning model has been fed in this huge corpus of data it’s ready to accept new words and spit out embeddings for those words. It’ll accept a word such as Delhi (“345698000”) and generate a numeric embedding for it.

This numeric encoding though is special because it also captures the meaning of the word London, by using the large corpus of data that be fed it.

This MLP algorithm is able to understand that Delhi references and represents a city.

So, if you feed in a word such as Mumbai (345698234), London (345698100), Paris (345698300) major cities. It will generate encodings for the cities will be very close to the encoding of Delhi.

Success of this depends on the large corpus of unlabeled data that’s fed in the training process.

Word embeddings capture meaning

“Queen” ~ “King” + “Woman” – “Man”

“Delhi” ~ “Mumbai” + “London” – “India”

There are two popular algorithms used here.

  1. Word2Vec
  2. GloVe (does not use neural networks): Uses nearest neighbours for word relationship.

Final Processing

Once numeric encodings for text is done, system is ready to fed into LSTM (Long Short Term Memory cells), to infer decisions.

Sentiment Analysis is one of them !

References

Book “Hands-On Machine Learning with Scikit-Learn & Tensorflow”  by Aurelien Graron is good reference. Explains RNN with illustrative Python code.