Probably Approximately a Scientific Blog: Representing Words

Sunday, January 3, 2016

Representing Words

We're already after 6 posts on the topic of natural language processing, and I can't believe I haven't discussed this basic topic yet. So today I'm going to discuss words; More accurately - I will discuss how words are represented in natural language processing.

A word is the most basic unit in semantic tasks. While other, low-level tasks, such as part-of-speech tagging (detecting the part of speech of each word in a sentence) and lemmatization (finding the lemma - a word's basic form) might be interested in the characters (mostly affixes) of the word, semantic tasks are concerned with meaning. A word is the most basic unit that conveys meaning.

While a word is basically stored in the computer as a string - a sequence of characters, this representation says nothing about the word's meaning. It allows detecting a certain similarity, in very specific cases. For example, morphological derivations (e.g. sing-singer) and inflections (e.g. cat-cats, listen-listened) modify words, creating related words (with a different part-of-speech, plurality form, etc). Such words are therefore similar both in meanings and in their string representations, since they share common characters. This similarity could be detected using Levenshtein distance.

Needless to say, most of the words that are similar by meaning do not share many common characters. Synonyms such as elevator and lift and related words such as food and eat, are not similar at the character-level.

In addition, performing operations on strings is highly inefficient. Computers can handle numbers much better than strings. Better representations are needed also for faster computation. I assume that I've convinced you why words need better representations. Now I can move on to telling you how words could be better represented as vectors (arrays).

one-hot vectors

As I mentioned, working with strings is computationally inefficient. A simple solution to this inefficiency is to convert every string to a number. First, we need to define the vocabulary: this is the set of all distinct words in the language (or at least those that we observed in a very large corpus). Unlike a dictionary, in which the entries are word basic forms, a vocabulary contains different entries for inflections and derivations of the same basic form (e.g., cat and cats).
When a semantic application processes a text, it can now replace every word in the text with its index in the vocabulary, e.g. cat is 12424, cats is 12431, dog is 15879, etc.

This is the most compact representation for a word - all it takes to store a word is one number. So what do vectors have to do with this? Another way to look at this representation is as a vector of dimension |V| (with V denoting the vocabulary), with zeros in all entries, and one set bit at the word's index (see figure 1).

Figure 1: an illustration of one-hot vectors of cat, cats and dog. The only non-zero entry in each vector is the index of the word in the vocabulary.

While one-hot vectors could be stored efficiently, their main problem is that they don't capture any information about the similarity between words. They even lose the information about string-similarity, that sometimes indicates semantic similarity (when words share the same lemma / basic form). Since we're interested in representing words for semantic tasks, a better solution is needed.

Distributional vectors
One important characteristic of a word is the company it keeps. According to the distributional hypothesis ^[1] , words that occur in similar contexts (with the same neighboring words), tend to have similar meanings (e.g. elevator and lift will both appear next to down, up, building, floor, and stairs); simply put, "tell me who your friends are and I will tell you who you are" - the words version.

This idea was leveraged to represent words by their contexts. Suppose that each word is again represented by a |V|-dimensional vector. Each entry in the vector corresponds to an index in the vocabulary. Now, instead of marking a word with its own index, we mark the occurrences of other words next to it. Given a large corpus, we search for all the occurrences of a certain word (e.g. elevator). We take a window of a pre-defined size k around elevator, and every word in this window is considered as a neighbor of elevator. For instance, if the corpus contains the sentence "The left elevator goes down to the second floor", with a window of size 5, we get the following neighbors: the, left, goes, down. A larger k will also include characteristic words such as floor.

We then update elevator's vector by increasing the number of occurrences in the indices of the, left, goes, and down. At the end of this process, each word vector contains the frequencies of all its neighbors. We can normalize the vector to get a probability distribution (how likely each word is to appear as a neighbor of the target word). There are some more complex metrics, but this is the main idea. These methods are also referred to as "counting methods".

The main advantage of distributional vectors is that they capture similarity between words: similar words => similar neighbors => similar vectors. Measuring similarity between vectors is possible, using measures such as cosine similarity, for example. We get a simple method of comparing similarities between words - we can expect elevator and lift to yield a higher similarity score than elevator and, say, cat (and I wouldn't miss the chance to watch a cat in an elevator video). These simple vector similarities are highly effective in recognizing word similarity in semantic tasks (e.g., to overcome lexical variability).

Figure 2: an illustration of distributional vectors of food, eat and laptop. Notice that the vectors of the similar words food and eat are similar, while different from the vector of the dissimilar word laptop.

Yet, distributional vectors pose computational obstacles -- the vocabulary size is typically very high (at least a few hundred thousand words). Storing each of these |V| words in a |V|-dimensional vector results in a |V|²cells matrix - this is quite large, and performing operations on all words is computationally heavy.

Word embeddings
Word embeddings to the rescue. The basic idea is to store the same contextual information in a low-dimensional vector; each word is now represented by a D-dimensional vector, where D is a relatively small number (typically between 50 and 1000). This approach was first presented in 2003 by Bengio et al. ^[2] , but gained extreme popularity with word2vec ^[3] in 2013. There are also some other variants of word embeddings, like GloVe ^[4].

Instead of counting occurrences of neighboring words, the vectors are now predicted (learned). Without getting too technical, this is what these algorithms basically do: they start from a random vector for each word in the vocabulary. Then they go over a large corpus and at every step, observe a target word and its context (neighbors within a window). The target word's vector and the context words' vectors are then updated to bring them close together in the vector space (and therefore increase their similarity score). Other vectors (all of them, or a sample of them) are updated to become less close to the target word. After observing many such windows, the vectors become meaningful, yielding similar vectors to similar words.

What are the advantages of word embeddings? First of all, despite the low dimensions, the information regarding word similarity is kept. Similar words still have similar vectors - this is what the algorithms aim to do while learning these vectors. Second, they are compact, so any operation on these vectors (e.g. computing similarities) is efficient and fast.

Moreover, people have done some amazing things with these vectors. Some of these things (if not all) are possible with high-dimensional vectors as well, but are computationally difficult.

Most similar words - we can easily find the most similar words to a certain word, by finding the most similar vectors. For instance, the 5 most similar words to July are June, January, October, November, and February. We can also find a word which is similar to a set of other words: for example, the most similar word to Israel + Lebanon + Syria + Jordan is Iraq, while the most similar word to France + Germany + Italy + Switzerland is Austria! Isn't this cool? If that's not impressive enough, using techniques for projecting high-dimensional vectors to 2-dimensional space (t-SNE, PCA), one can visualize word embeddings, and get some really nice insights about what the vectors capture:

Figure 3: a t-SNE visualization of a tiny sample of GloVe word embeddings. It seems to have noticed that bird and fly are related, as well as bird and cat, but love, marriage and wedding are not, for some reason.

Figure 4: two-dimensional projection of word2vec vectors of countries and their capitals, from the paper. The lines between countries and their capitals are approximately parallel, indicating that the vectors capture this relation.

Analogies - Following the above figure, the authors of this paper presented some nice results regarding the vectors' ability to solve analogy questions: a is to b as c is to d; given a, b, and c, find the missing word d. Turns out that this can be solved pretty accurately by selecting a vector which is similar to b and c, but not to a. This is done by addition and subtraction, and the most famous example is the "king + woman - man = queen", demonstrated in the following figure:

Figure 5: an illustration of the analogical male-female relation between word pairs such as (king, queen), (man, woman) and (uncle, aunt), from the paper.

While these results are very impressive (I'm still impressed and I've known about them for a long time now...), it still isn't clear how useful these vectors are. Turns out they are amazingly useful. Almost any semantic task has been re-implemented recently with the help of word embeddings. Many of them benefit from a performance boost, and it's mostly just very easy to use them. In future posts, I'll give some examples of tasks that rely on word embeddings.

Suggested Reading:
Deep Learning, NLP, and Representations, from Christopher Olah's blog (more technical and advanced).

References:

[1] Harris, Zellig S. "Distributional structure." Word. 1954. ^↩

[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin, "A neural probabilistic language model", The Journal of Machine Learning Research, 2003. ^↩
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." CoRR, 2013.. ^↩
[4] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP 2014. ^↩

28 comments:

UnknownSeptember 2, 2016 at 9:59 AM
Great post!
Thanks!
ReplyDelete
Replies
UnknownSeptember 2, 2016 at 10:00 AM
Great post!
Thanks!
ReplyDelete
Replies
Vered ShwartzSeptember 2, 2016 at 12:46 PM
Following a mail comment: how is the low dimensionality of word embeddings achieved?

The main idea is that these vectors are predefined at a low dimension, e.g. 300, and they are initialized randomly. Then, the algorithm reads the corpus, and for each word, it updates its vector so that its similarity scores with the neighbours are high and its similarity scores with the rest of the word vectors (or with a sample of non-neighbours) are low. After reading a very large corpus, words that occur with similar neighbours will be similar due to the training objective.

Another point that I didn't mention in the post is that prior to word embeddings, it was common to apply dimensionality reduction to the word co-occurrence matrix (count-based vectors) to get low-dimensional word vectors. They perform similarly or slightly worse than word embeddings (it depends who you ask... [1], [2]).

[1] Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. ACL 2014.
[2] Improving distributional similarity with lessons learned from word embeddings. Omer Levy, Yoav Goldberg, and Ido Dagan. TACL 2015.
ReplyDelete
Replies
SaadJanuary 8, 2017 at 7:31 AM
Thank you for this excellent excellent post on the topic. Very easy to understanddd!! :)

Thank You!! :)
ReplyDelete
Replies
TechSachFebruary 14, 2017 at 6:27 AM
Thank you for this well articulated post!
ReplyDelete
Replies
UnknownApril 9, 2017 at 11:56 AM
Excellent post.Thanks a lot
ReplyDelete
Replies
ikibApril 12, 2017 at 9:25 AM
Thank you. Very well written. Can you please consider writing on why word embeddings generalize. Maybe another excellent blog on how word embeddings and neural networks work together, with some technical details as well. Thanks for considering the request.
ReplyDelete
Replies
UnknownApril 13, 2017 at 3:54 PM
thank you ! this is great !
ReplyDelete
Replies
AnonymousMay 5, 2017 at 3:57 PM
So helpful. Thanks a lot.
ReplyDelete
Replies
Gyan GarciaMay 16, 2017 at 9:32 PM
Thanks a lot for your post. Well written an easy to understand to newbies (like me) on the nlp field.

Saludos
ReplyDelete
Replies
UnknownJune 13, 2017 at 10:26 AM
I read a dozen articles trying to understand this concept, i came across your quora answer on the same which directed me here. I am glad I read this, crystal clear explanation. Kudos to you!
ReplyDelete
Replies
AnonymousJune 14, 2017 at 3:44 AM
Hey, Vered!

As one of the many people who are setting their foot into the world of NLP, I have always had second thoughts about keeping myself up-to-date about recent advances that happen in its area of research without knowing my fundamentals about the concepts involved. After having searched through a lot of online resources to understand about word embeddings better, I reached your blog post through Quora. I loved reading this article of yours as it helped me gain an immensely interesting glimpse on word embeddings in the first shot as a commoner! Thank you so much for sharing this article. And, I would love to read more such articles written by you.

I shall be eagerly waiting for more to read and catch up on! Thanks again :)
ReplyDelete
Replies
Vered ShwartzJune 14, 2017 at 7:24 AM
Thank you so much, it's great to get such feedback! :)
ReplyDelete
Replies
UnknownJuly 5, 2017 at 3:20 PM
Thanks for awesome explanation, I loved it.
ReplyDelete
Replies
UnknownNovember 1, 2017 at 6:51 PM
Thanks for great explanations, it was awesome. I still have a question. like we have several measurement formula to get the similarity between strings, is it correct to say that all them works only if we find the similarity of them based on the context?
I mean for example the index of elevator is 1424, and the index of lift is 1550, so finding the similarity without having a context seems stupid?

sorry for my naive question, I am kind of newbie
ReplyDelete
Replies
AnonymousFebruary 26, 2018 at 4:39 PM
Great contribution!

Could you tell me what the source for figure 2 is?
Thank you!
ReplyDelete
Replies
UnknownAugust 22, 2018 at 3:41 PM
Thank you for the wonderful explanation of words in the vector space.

Few doubts:
1. On last method (Word embeddings), transforming higher dimensions to lower dimension will cause the loss of some features?
2. How to decide the number of neurons in the hidden layer as this will be act as transformed lower dimension. (above we say 300/500).

Waiting !!! :)
ReplyDelete
Replies
Dorra El MekkiOctober 23, 2018 at 5:27 PM
I really like this post, thank you for your help
ReplyDelete
Replies

Add comment