Saturday, July 4, 2015

Natural Language Processing

I'm afraid I'm pretty lousy at explaining people what I do. I think my parents learned to memorize the key words "Natural Language Processing" so that they can tell their friends about my occupation. Another relative of mine is under the illusion that my current research is about to replace Google search, just as soon as I'm done (I swear I never told her anything like that!). When I try to simplify it, I sometimes tell people that it is a subfield of Artificial Intelligence. Then again, I think it makes some people imagine me talking with a robot as my everyday routine.

In this post I would like to tell you a little bit about what Natural Language Processing is and why I find it such an interesting field of research. In the following post, I will elaborate on what I actually (try to) do in this field.

Natural Language Processing (NLP, not to be confused with the other NLP) is mainly about filling the gap between how humans communicate (with natural languages such as English) and what computers understand (machine language). When this task will be fully solved, you will be able to communicate with your computer (or your tablet, cell phone, your smart refrigerator and your car) just as you do with another human being.

"Computers are incredibly fast, accurate and stupid; humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination." (Albert Einstein)1

Computers are basically completely stupid, just as Albert Einstein pointed out. When you are engaged in a conversation with a person, each of you understands the meaning of what the other is saying. Computers basically understand only machine language, and are programmed to understand very specific instructions on top of this language. Human language is much more complex than that; you can say one thing in multiple ways, for example "where is the nearest sushi restaurant?" and "can you please give me addresses of sushi places nearby?"-- this is called language variability. Sometimes you say something that can have several meanings, like "time flies like an arrow" -- this is called language ambiguity. A human being usually understands the correct meaning in the context of the conversation. A computer... doesn't really.

However, human knowledge is limited, while today, in the big data era, the computer has access to almost unlimited knowledge. So what if we taught computers to understand us? We can have the answers to all the questions in the universe!

Of course, some of these applications already exist. If you have an Android phone, you can say "Ok Google" and then ask a question that Google (with some success) will answer. The same for Siri on Apple devices. However, this is an ongoing research, and none of these applications is perfect yet.

In addition to human language understanding, this field is also occupied with teaching computers to generate human language, so that they can fool you to think you are talking with an actual human being. I'm sure you have encountered virtual assistants:

I just want to talk. Is that a problem?
Such applications require both the understanding and generation of human language. I'm sure that with this example I've now convinced you that there's still plenty of work in this field. It is quite fun to challenge these virtual agents with complex language and topics they weren't trained to answer. I recommend it as a game :)

So how can NLP help us in our everyday life? In many ways. Here is a small subset of NLP tasks you may have encountered in applications:

  • Speech to text / text to speech - translation of spoken words into text and vice versa. The first and last step of applications in which you speak with a device. The internal processing is done over written text. NLP is actually a very small part of this task, which is related to electrical engineering, machine learning and other fields.
  • Machine translation - in two words: Google Translate.
  • Language model - determines how likely a certain sentence is in the language. For instance, the sentence "I'm reading this post now" is more likely to be said than "This post now reading I'm", even though both sentences contain only correct English words; and the sentence "I called my mother on the" is more likely to end with "phone" than with "banana". It is used in many applications, for example, the auto-suggest in your phone. Though it sometimes has funny suggestions, it can be very helpful.

    Mmm... what was that offer again?
  • Automatic summarization - you know you don't have the patience to read long news / entertainment articles on the web, reviews of restaurants in TripAdvisor, and not to mention any texts you need to read for work or school purposes. This application takes long texts and provides you with a concise version of them.
  • Information Retrieval - support search engines and improve search results by understanding what the user really means in his query. For instance, you may have noticed the special search results you get on Google when searching for things such as time, weather, and flight details:

    If you ever wondered how Google is so smart to understand you, I think you may have some of the answer now.

And here is a cool glimpse into the future (though some of it is already implemented, but definitely not common): when computers can generate human language, your refrigerator can tell you "hey, you're running out of milk - I added it to your grocery list". That would also require some help from other fields such as computer vision (enabling scanning the bar codes of the milk and other products inside the fridge). I think it's a cool example, though.

So now you see that you've actually encountered applications of NLP many times before, you just couldn't name it. I hope I managed to excite you about NLP, and hopefully I will also succeed with other topics in the next posts.

Small survey question: when you search something in Google (or any other search engine of your preference), is your query:
(1) a full question, such as "What is the height of Mount Everest?"
(2) composed of key words, such as "height Everest"

The results will be published when there will be enough readers to infer a meaningful statistical conclusion (probably never).

1 05/07/2015: Thanks to Yuval who doubted the authenticity of this quote, it turns out that it probably wasn't Einstein who said it, though it is not clear who did.


  1. My answer to the survey: I'm really not sure. I think that it is composed of keywords most of the time and sometimes I do just write a question and count on Google to ignore all the unimportant words. If you really need me to choose one then I guess the 2nd option is the right one for me.

  2. Thanks Vered! I may refer people to this post when trying to explain NLP :-)

    The pipeline of natural language human-machine interface is quite complex. Each phase deserves its own field of study, and in fact some of the phases (text-to-speech, for instance) are extremely hard.

    Regarding the survey, it depends if I'm typing or using voice search. When typing, I'm lazy and I know exact keywords will probably get me better results. When speaking, part of the request is also a test for the text-to-speech abilities of the search engine, so I try to give it a full sentence. But mostly it's just keywords.

    Oh, and about our friend Albert up there... as the famous quote goes:
    "The problem with quotes on the internet is that you can never be sure they are authentic" - Karl Marx

    1. Thanks Yuval! I might drill down to some of these phases in the following posts, as they are complex enough to have their own posts.

      Regarding the survey, it's an interesting observation. I actually do the same, intuitively, but I never thought about it...

      I really liked the "Karl Marx" quote :)
      You are right that it probably wasn't Einstein who said it:
      But this page didn't really find out who did. I will add a footnote to mention that.

  3. Who's the relative who thinks you're going to replace google? :-)
    And regarding the Automatic summarization - so you will make a revolution in the world of tl;dr...
    p.s. my answer to the survey is the second (key words).

    1. It's Revital... they are less skeptical in that side of the family... :)

      You actually gave me a great idea to generate these tl;dr automatically based on previous human-generated tl;drs! I should find the time to try it someday, though it's not my research focus ;)

  4. I never thought about it but the use of keywords for search results is common, a standard language model will not work well because though "what is the problem with quotes on the internet" is a probable English sentence, "problem quotes internet" is not. That makes it even harder because many more possibilities have a high possibility. It might even be as if there's almost no language model at all (maybe just use bigrams or unigrams together somehow) since you can use any word combination you want. I won't be surprised if Google take this into consideration when the voice search feature is used or for the suggestions (maybe their corpus is 'search queries'), but this might not work well with other providers of speech-to-text and auto-suggestions.

    Things that make you go "hmm" (not the Hidden Markov Model 'hmm').

    1. I think the people in Yahoo! are working on answering query questions with answers from Yahoo! Answers, and one of the reasons it's difficult is because queries are usually not grammatically correct questions.

      I think Google's algorithm is a bit more complex than you suggested:

      Hmm... I should steal your joke in case I need to mention HMM in one of the posts :)

    2. Hi Vered, welcome to the blogosphere!
      We are indeed deep into a research of the grammar of search queries, especially those that lead searchers to question answering sites. You can check out the paper our team presented two years ago at WWW, essentially about mapping the second form of query in your survey to the first (I use keywords, by the way):
      Oh, and you can drop the exclamation point from our name :)

      - Yuval

    3. Hi Yuval, thanks!

      I've actually heard Idan talking about it a few weeks ago at IBM's ML seminar, so now I know more about it, and it sounds very interesting. Thanks for the paper link. I've added it to my reading list and I'll read it soon.

      Although not many people replied to the survey, my impression is that anyone with a basic understanding in technology searches by key words. I can imagine only people from my parents' generation expecting a search engine to understand a grammatically correct question. But maybe in one of the days when semantic tools will be more developed, this will become reality.

      Good to know about the exclamation mark. It is weird to have an exclamation mark in a company's name and can be very confusing for NER :)

    4. Well, once there was Ask Jeeves that claimed to answer queries in question form (today it's and it's a CQA site), but today the tide is going in the way that question queries are being treated better and better by all engines.
      No kidding about the NER thing - only this week I found out that yet another tool we're trying to use thinks the string "Yahoo" is one of many things, including the Gulliver Travels creature, but not the company...

  5. This is really nice blog. Contents over here are so informative. For more information on these topics, visit here Natural Language Processing | Cognitive Science and Levels of knowledge used in Language Understanding

  6. Answer to the survey: I usually use keywords for my Google search. Great post by the way, you explained it all in a simple way that I could easily understand. Thanks!