Natural Language Understanding Using Deep Learning

Published in

LivePerson Tech Blog

10 min readMay 20, 2017

The field of Natural Language Processing (NLP), or in its more inspiring name — Natural Language Understanding (NLU), is quite old in the computer science world.

Since the 50th of the last century, scientists have been looking for ways to automatically process human language. The algorithms developed till the 80s were mostly based on set of manual rules, and later based on Machine Learning algorithms.

Throughout the years, significant achievements had been accomplished — parsing of sentences, entity extraction, part of speech tagging, topic modeling, classification of text, categorization of text, translation between languages, translation from text to voice, translation from voice to text, sentiment analysis, automatic summarization, and more..

It seems like that analyzing Shakespeare’s writing with his perfect syntax is a doable task.

Unfortunately, human had started creating more complex texts, much more complicated than the English’s genius — such texts that the traditional NLP algorithms had failed to deal with.

Apparently processing a true natural language is not a trivial task at all. The existing NLP Parsers are very fragile and often failing processing free written texts that do not follow the rigid set of rules of a language.

Deep Learning for help

Deep Learning is an old/new field in Computer Science, or to be more precise — it’s a rebranding of some very old algorithms from the family of Neural Networks.

In the core of this methodology, there are multiple (hence deep) layers of Neurons connected with varying weights that are calculated during the training phase of the net. I will not further detail the technical aspects of DL, as it is not the main topic of this post and there’s a lot of content available online.

In the past years this field had gain tremendous progress thanks to some groundbreaking achievements in the fields of visual and voice recognition, for example:

Image recognition: at the end of 2014, Google published a post about the ability to describe an image and the elements inside.
Voice recognition: Microsoft publish a post at the end of 2016 claiming they have achieved a human parity in conversational speech recognition.

Recently, Deep Learning is also mentioned frequently with respect to NLP, due to great interest in the world of analyzing and processing human language — among these: free text, emojis, typos, syntax errors, acronyms and other generation Y text. Add to that the buzz around chatbots and there you get a complete fuss.

What had been changed?

From the data angle

It’s a well known fact, that having data is a preliminary condition in order to develop data learning systems.

At 2010, a project named imageNet started (mostly by Stanford and Princeton researches), in which over 10 Millions images were manually tagged. This dataset is now available for everyone and its existence jump started the research of Deep Learning algorithms, especially using Convolutional Neural Network that are very suitable for image recognition problems.

Textual data for research was also becoming more accessible than ever before — SNAP from Stanford, labeled reviews and rating of movies from IMDB, the entire Wikipedia is available for download, and more.

From the hardware angle

Neural Network algorithms ‘like’ to run on graphic cards (GPUs). These cards are categorized by their ability to parallel process massive small and simple calculations. Market benchmarks are showing x50 to x100 improvements in training time compared to CPUs. And thus, companies such as Asus, AMD, Intel and NVIDIA, have started developing graphic cards that are optimized for Deep Learning algorithms (and not just for first-man-shooter kind of games)

From the academic angle

The field of Neural Networks had known ups and downs over the past 70(!) years. In the recent years, a tremendous amount of academic researches had been conducted by top universities (led by Stanford) and research departments at Google, Microsoft and IBM.

Specifically in the NLP world, there had been a significant progress in the subject of text and words representation. More on that later in this article…

And from the software angle

In my oppinion, what placed Deep Learning in the center of the stage was the fact that Google released at the end of 2015 — Tensorflow, an open source library to develop Deep Learning networks which became extremely popular with over 44K starts in GitHub.

* correlation does not implies causality, and thus the explanation can also be exactly the opposite.

Automatic learning of features

Beside the great fuss created around Deep Learning, this field is truly leading a change in the way of developing learning systems.

Classic learning systems usually requires a pre-process phase of the data — called feature extraction or feature engineering. In these processes, the researcher tries to find attributes (features) in the data that their presence, non presence or presence with other features, might explain a trend in the prediction. These processes in many cases require a good domain knowledge and significant statistical background, and thus they tend to be very manual, sisyphic, and time consuming.

In NLP specifically, the researcher is expected to understand in grammar, morphology, language formalization, pronunciation, syntax, and so forth — of every language he’s dealing with in order to create the relevant features for the learning system.

For example:

The prefix of ‘un’ at the beginning of a word, changes the word meaning — for example: uninterested.
Understanding sequences — for example: pun intended.
Relying on syntactic and semantic annotated texts, for example — treebank.
Relying on external lexical database to identify part of speech elements, synonyms, antonym, for example — wordnet.
Relying on external map of names, locations, products, etc..

Eventually, even a professional text in a newspaper can confuse a human being:

It will probably be exaggerating to say that feature extractions processes will disappear thanks to DL algorithms, but a change in course is clearly shown. Instead of letting the researcher extract features, we’ll ‘throw’ the data on the net, and it will find the relevant features automatically and weight them correctly.

The change in course, than is — instead of manually extracting features, the main task is first to represent the data correctly in order to let the system identify the features automatically.

Data representation

Classic NLP algorithms are usually trying to represent words/sentences/documents using a vector or a matrix of numbers. There are quite a few popular methods to do so: one-hot, BOW, CBOW, TF-IDF, Co-occurrence matrix, n-grams, skip-grams and more.

For example, in the simplest one-hot representation-

The word ‘plane’ might be represented by a vector: [0, 0, 1, 0, 0, 0, …. , 0, 0]
And the word ‘airplane’ might be represented by a vector: [0, 0, 0, 0, 1, 0, …. , 0, 0]

The size of the vector is as big as the number of all words in the text corpus. This type of representation creates two main problems:

Sparsity — the number of dimensions (vector length) needed to represent a single word is number of total words in the corpus, it can easily reach 10s of thousands of words or more. Clearly this representation is inefficient and requires significant computation resources to feed it into a learning system.
Terms relationship — the word ‘plane’ and the word ‘airplane’ are synonymous and interchangeable. These representation methods misses this information which is crucial to best understand the text.

What’s required is a way to represent text in an efficient and in a low dimensionality vector space.

Word representation (word embedding)

In 2013 researchs from Google, published a paper describing how to represent words in a vector space — a paper which deeply influenced the NLP world and the use of Neural Nets to do so. Google also published the code behind this paper under the name — Word2Vec (based on Neural Network of course). This algorithm can take a large corpus of text and create for each word, a vector representation in a selected size (usually 50–200 dimensions).

What’s really interesting about the outcome of this algorithm is that there’s a linear meaning for distances and angles between words.

In this example (in a two dimensional space), the words ‘plane’ and ‘airplane’ are very close to each other. Additionally, the distance and angle between the word ‘plane’ and ‘sky’ is very similar to the distance and angle between the words ‘car’ and ‘ground’.

Another fascinating aspect of this algorithm is that you can run it on any text, in any language without the need to manually craft features in advance, and you will still get a data structure with linear characteristics.

Instagram published in 2015 a post in their tech blog, an interesting research about the use of emojis using different tools — including also word2vec.

It’s amazing to see which words are closest to each emoji:

😂 ⇒ lolol, lmao, lololol, lolz, lmfao, lmaoo, lolololol, lol, ahahah, ahahha, loll, ahaha, ahah, lmfaoo, ahha, lmaooo, lolll, lollll, ahahaha, ahhaha, lml, lmfaooo

😍 ⇒ beautifull, gawgeous, gorgeous, perfff, georgous, gorgous, hottt, goregous, cuteeee, beautifullll, georgeous, baeeeee, hotttt, babeee, sexyyyy, perffff, hawttt

Many applications in the field of NLU are now using representation of words in a low dimensionality space using word2vec (or GloVe which works in a statistical manner but creates very similar outcome) as an input to a Deep Learning network — a process called: pre-training.

Understanding sequences — Recurrent Neural Network

In order to predict a real-estate property value, you can refer to its location, size, year of built, etc.. As a matter of fact, the order of the features doesn’t matter mach — solely their existence.

In text on the other hand, words do not stand by themselves. A word meaning can change according to words that are coming before or after it.

Traditional learning systems can’t natively handle sequence of features depending on time and order. Thus, already in the 80s a class of algorithms from the family NN developed aiming to solve this limitation, called — Recurrent Neural Networks.

RNNs are very similar to regular NN, with one major difference. The output of every layer is also the input of the same layer for the next step. This type of a feedback loop architecture enables the net to ‘remember’ the information from the previous step (which actually accumulated till now), and thus to enable representation of sequences.

Apparently, this architecture works really well. In the past years researchers had accomplished to solve problems in the fields of voice recognition, translation, sentiment analysis and more, using variation of RNNs (mostly LSTM and GRU). It is clear that RNN is the best way to describe human language, and most of the recent researchers are using it.

The future — ask me anything

Is Deep Learning the solution for all NLP problems? It is surely seems like this is where things are heading. Using smart Word Embedding techniques together with variations of RNNs, researchers had outperform almost any other classic algorithm in the tested NLP problems.

Is this the end of the story? It seems like it’s only the beginning.

At this stage, the ability of learning systems is summarized by the ability to crunch data and use it to do a very specific task (e.g. predict the cost of a property). Can we build models of Artificial Intelligence that can answer any question? It is still early to say.

The top task in the field of AI is called — Artificial General Intelligence, an intelligence that can handle tasks at the level of a human being, including: using judgment, intuition, logic, being self conscious, ability to communicate, ability to learn and more.

It is probably going to take some time to accomplish a complete AGI, but recent papers and articles show the roadmap to get there:

In February 2016, researchers from Facebook published a paper about the roadmap to develop smart learning systems. In their paper they have referred to these two main capabilities: the ability to communicate and the ability to learn. The assumption is that the entire human knowledge has been digitalized and is available to all easily. All we need is a machine that will be able to read and learn from it.
Researchers from Salesforce published a paper in March 2016, in which they presented a model of Dynamic Neural Network able to deal with free questions and answers during a conversation dialog.

It is hard to predict the exact future, but it is safe to say that the pace of development is crazy. The time it takes from an academic paper to be published till it gets to be open sourced code in GitHub is just ridiculous. It seems like significant developments will be shown in the coming years. Hold tight.

Wrapping up — I’d like to learn more

The field of Deep Learning creates great opportunities for developers who are not coming directly from a Data Science background, thanks to few reasons:

It’s a new technology, a new paradigm — ramping up is required from everyone whether you come from a closeby or distant field. The pace of development is so rapid, it is sometime more important to learn about the trends and changes rather than focus on this or other specific algorithm.
The reduced need of feature engineering.
Development of wrappers (e.g. keras.io), that simplifies the development over the raw libraries (e.g. tensorflow)
The support of many programing language — python, java, lua and even some interesting libraries in javascript.
And specifically about NLP — it’s is very easy to get text to train on.

In these two links [1, 2] there’s a curated list of items related to NLP and DL.

If I had to recommend on software libraries to develop NLP + DL applications, these were the two:

Gensim — especially for its easy to use word2vec algorithm (Python), but also for other topic modeling algorithms — LDA, LSI, etc..
Keras.io — probably the easiest to use DL library (Python) on top of Tensorflow or Theano. Quiet recently in January this year, Google announced that they plan to make Keras it’s default API for Tensorflow.

Happy Deep Learning !

— Haggai Shachar