Wednesday, March 1, 2017

Women in STEM*

This is a special post towards International Women's Day (March 8th). Every year I find myself enthusiastically conveying my thoughts about the topic to the people around me, so I thought I might as well share it with a broader audience. As always, this post presents my very limited knowledge/interpretation to a broadly discussed and studied topic. However, it may be a bit off topic for this blog, so if you're only interested in computational stuff, you can focus on section 3.

1. The Problem
Even though we are half of the population, women are quite poorly represented in STEM:

USA: the percentage of computing occupations held by women has been declining since 1991, when it reached a high of 36%. The current rate is 25%. [2016, here]

OECD member countries: While women account for more than half of university graduates in scientific fields in several OECD countries, they account for only 25% to 35% of researchers in most OECD countries. [2006, here]

2. The Causes (and possible solutions)

2.1 Cognitive Differences
There is a common conception that female abilities in math are biologically inferior to those of males. Many highly cited psychology papers prove differently, for example:

"Stereotypes that girls and women lack mathematical ability persist, despite mounting evidence of gender similarities in math achievement." [1].

"...provides evidence that mathematical and scientific reasoning develop from a set of biologically based cognitive capacities that males and females share. These capacities lead men and women to develop equal talent for mathematics and science." [2]


    In addition, if cognitive differences were so prominent, there wouldn't be so many women graduating in scientific fields. It seems that the problem lies in occupational gender segregation, which may be explained by any one of the following:

    2.2 Family Life
    Here are some references from studies conducted about occupational gender segregation:

    "In some math-intensive fields, women with children are penalized in promotion rates." [3]
      "[...] despite the women's movement and more efforts in society to open occupational doors to traditional male-jobs for women, concerns about balancing career and family, together with lower value for science-related domains, continue to steer young women away from occupations in traditionally male-dominated fields, where their abilities and ambitions may lie." [4]

      "women may “prefer” those [jobs] with flexible hours in order to allow time for childcare, and may also “prefer” occupations which are relatively easy to interrupt for a period of time to bear or rear children." [5] (the quotation marks are later explained, indicating that this is not a personal preference but rather influenced by learned cultural and social values).

      I'd like to focus the discussion now on my local point view of the situation in Israel, since I suspect that it is the most prominent cause of the problem here. I would be very interested in getting comments regarding what it is like in other countries.


      According to the Central Bureau of Statistics, in 2014, 48.9% of the workers in Israel were women (and 51.1% were men). The average salary was 7,439 NIS for women and 11,114 for men. Wait, what?... let me introduce another (crucial) factor.

      While the fertility rate has decreased in all other OECD member countries, in Israel it remained stable for the last decade, with an average of 3.7 children per family. On a personal note, as a married woman without children, I can tell you that it is definitely an issue, and "when are you planning to have children already?" is considered a perfectly valid question here, even from strangers (and my friends with 1 or 2 children often get "when do you plan to have the 2nd/3rd child?").

      Paid maternity leave is 14 weeks with a possibility (used by anyone who can afford it) to extend it to 3 more unpaid months. Officially, any one of the parents can take maternity leave, but in practice, since this law was introduced in 1998, only roughly 0.4% of the parents who took maternity leave were fathers. 

      Here is the number connecting the dots, and explaining the salary gap: in 2014, the average number of work hours per week was 45.2 for men and 36.7 for women. The culture in Israel is torn between the traditional family roles (mother as a main parent) and the modern opportunities for women. Most women I know have a career in the morning, and a second job in the afternoon with the kids. With a hard constraint of leaving work before 16:00 to pick up the kids, in a demanding market like in Israel, it is much harder for a woman to get promoted. It poses the high-tech industry, in which the working hours are known to be long, as a male-dominated environment. Indeed, in 2015, only 36.2% of the high-tech workers in Israel were women.

      This situation is doubly troubling: on the one hand, it is difficult for women who do choose demanding careers. They have to juggle between home and work in a way that men are never required to. On the other hand, we are oriented since childhood to feminine occupations that are less demanding in working hours. 

      Don't get me wrong, I'm not here to judge. Being a feminist doesn't entail that the woman must have a career while the man has to stay at home with the children. Each couple can decide on their division of labor as they wish. It's the social expectations and cultural bias that I'm against. I've seen this happening time after time: the man and the woman both study and build up their careers, they live in equality, and then the birth of their first child, and specifically maternity leave, is the slippery slope after which equality is a fantasy. 

      To make a long story short, I think it is not women the market is against, but mothers. When I say "against" I include allegedly good ideas such as allowing a mother to leave work at 16:00. While I'm not against leaving work at 16:00 (modern slavery is a topic for another discussion...), I don't see why this "privilege" should be reserved only for mothers. In my humble opinion, it will benefit mothers, fathers, children and the market if men and women could each get 3 days a week to leave work as "early" as at 16:00. It wouldn't hurt if both men and women will have the right to take parental leave together, developing their parenthood as a shared job. This situation will never change unless the market will overcome ancient society rules and stop treating parenthood as a job for women.

      2.3 Male-dominated Working Environments 
      Following the previous, tech workplaces (everywhere) are dominated by men, so that even women who choose to work in this industry might feel uncomfortable in their workplaces. Luckily for me I can't attest this by my own experience: I've never been treated differently as a woman, and have never felt threatened or uncomfortable in situations in which I was an only woman. This article exemplifies some of the things that other women experienced:

      "Many [women] will say that their voice is not heard, they are interrupted or ignored in meetings; that much work takes place on the golf course, at football matches and other male-dominated events; that progress is not based on merit and women have to do better than men to succeed, and that questions are raised in selection processes about whether a woman “is tough enough”."

        I've only become aware of these problems recently, so I guess it is both a good sign (that it might not be too common, or at least that not all women experience that), but also a bad sign (that many women still suffer from it and there's not enough awareness). This interesting essay written by Margaret Mitchell suggests some practical steps to make women feel more comfortable in their workplaces.

        Of course, things get much worse when you consider sexual harassment in workplaces. I know the awareness to the subject is very high today, an employer's duty to prevent sexual harassment is statutory in many countries, and many big companies require new employees to undergo a sexual harassment prevention training. While this surely mitigates the problem, it is still too common, with a disturbing story just from the last week (and many other stories untold). As with every other law, there will always be people breaking it, but it is the employers' duty to investigate any reported case and handle it even at the cost of losing a valuable worker.

        2.4 Gender Stereotypes 
        Simply because it's so difficult to change reality; even if some of the reasons why women were previously less likely to work in these industries are no longer relevant, girls will still be less oriented to working in these fields since they are considered unsuitable for them.


        An interesting illustration was provided in this work, where 26 girls (around 4 years old) were shown different Barbie dolls and asked whether they believed women could do masculine jobs. When the Barbie dolls were dressed in "regular" outfits, many of them replied negatively, but after being showed a Barbie dressed up in a masculine outfit (firefighter, astronaut, etc.), the girls believed that they too could do non-stereotypical jobs.

        This is the vicious circle that people are trying to break by encouraging young girls to study scientific subjects and supporting woman already working in these fields. Specifically, by organizing women-only conferences, offering scholarships for women, and making sure that there is a female representative in any professional group (e.g. panel, committee, etc). While I understand the rational behind changing the gender distribution, I often feel uncomfortable with these solutions. I'll give an example.

        Let's say I submitted a paper to the main conference in my field, and that paper was rejected. Then somebody tells me "there's a women-only workshop, why don't you submit your paper there?". If I submit my paper there and it gets accepted, how can I overcome the feeling of "my paper wasn't good enough for a men's conference, but for a woman's paper it was sufficient"?

        For the same reason, I'm uncomfortable with affirmative action. If I'm a woman applying for a job somewhere and I find out that they prefer women, I might assume that there was a man who was more talented/adequate than me but they settled for me because I was a woman. If that's true, it is also unfair for that man. In general, I want my work to be judged solely based on its quality, preferably without taking gender into consideration, for better and for worse.

        I know I'm presenting a naive approach and that in practice, gender plays a role, even if subconsciously. I also don't really have a better solution for that, but I do hope that if we take care of all the other reasons I discussed, this distribution will eventually change naturally. 

        3. Statistics and Bias
        Last year there was an interesting paper [6], followed by a lengthy discussion, about gender stereotypes in word embeddings. Word embeddings are trained with the objective of capturing meaning through co-occurrence statistics. In other words, words that often occur next to the same neighboring words in a text corpus are optimized to be close together in the vector space. Word embeddings have proved to be extremely useful for many downstream NLP applications.

        The problem that this paper presented was that these word embeddings capture also "bad" statistics, for example gender stereotypes with regard to professions. For instance, word embeddings have a nice property of capturing analogies like "man:king :: woman:queen", but these analogies contain also gender stereotypes like "father:doctor :: mother:nurse", "man:computer programmer :: woman:homemaker", and "he:she :: pilot:flight attendant".

        Why this is happening is pretty obvious - word embeddings are not trained to capture "truth" but only statistics. If most nurses are women, they would occur in the corpus next to words that are more likely to occur with feminine words than with masculine words, resulting in higher similarity between nurse and woman than nurse and man. In other words, if the input corpus reflects stereotypes and biases of society, so will the word embeddings.

        So why is this a problem, anyway? Don't we want word embeddings to capture the statistics of the real world, even the kind of statistics we don't like? If something should be bothering us, it is the bias in society, rather than the bias these word embeddings merely capture. Or in other words:

        I like this tweet because I was wondering just the same when I first heard about this work. The key concern about bias in word embeddings is that these vectors are commonly used in applications, and this might inadvertently amplify unwanted stereotypes. The example in the paper mentions web search aided by word embeddings. The scenario described is of an employer looking for an intern in computer science by searching for terms related to computer science, and the authors suggest that a LinkedIn page of a male researcher might be ranked higher in the results than that of a female researcher, since computer science terms are closer in the vector space to male names than to female names (because of the current bias). In this scenario, and in many other possible scenarios, the word embeddings are not just passively recording the gender bias, but might actively contribute to it!

        Hal Daumé III wrote a blog post called Language Bias and Black Sheep about the topic, and suggested that the problem goes even deeper, since corpus co-occurrences don't always capture real-world co-occurrences, but rather statistics of things that are talked about more often:

        "Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false."

        Prior to reading this paper (and the discussion and blog posts that followed it), I never realized that we are more than just passive observers of data; the work we do can actually help mitigate biases or inadvertently contribute to them. I think we should all keep this in mind and try to see in our next work whether it can have any positive or negative affect on that matter -- just like we try to avoid overfitting, cherry-picking, and annoying reviewer 2.

        [1] Cross-national patterns of gender differences in mathematics: A meta-analysis. Else-Quest, Nicole M.; Hyde, Janet Shibley; Linn, Marcia C. Psychological Bulletin, Vol 136(1), Jan 2010, 103-127.
        [2] Sex Differences in Intrinsic Aptitude for Mathematics and Science?: A Critical Review. Spelke, Elizabeth S. American Psychologist, Vol 60(9), Dec 2005, 950-958.
        [3] Women's underrepresentation in science: Sociocultural and biological considerations. Ceci, Stephen J.; Williams, Wendy M.; Barnett, Susan M. Psychological Bulletin, Vol 135(2), Mar 2009, 218-261. 
        [4] Why don't they want a male-dominated job? An investigation of young women who changed their occupational aspirations. Pamela M. Frome, Corinne J. Alfeld, Jacquelynne S. Eccles, and Bonnie L. Barber. Educational Research And Evaluation Vol. 12 , Iss. 4,2006
        [5] Women, Gender and Work: What Is Equality and How Do We Get There? Loutfi, Martha Fetherolf. International Labour Office, 1828 L. Street, NW, Washington, DC 20036, 2001.
        [6] Quantifying and Reducing Stereotypes in Word Embeddings. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai. 2016 ICML Workshop on #Data4Good: Machine Learning in Social Good Applications.

        *STEM = science, technology, engineering and mathematics

        Wednesday, November 23, 2016


        In the Seinfeld episode, "the opposite", George says that his life is the opposite of everything he wanted it to be, and that every instinct he has is wrong. He decides to go against his instincts and do the opposite of everything. When the waitress asks him whether to bring him his usual order, "tuna on toast, coleslaw, and a cup of coffee", he decides to have the opposite: "Chicken salad, on rye, untoasted. With a side of potato salad. And a cup of tea!". Jerry argues with him on what's the opposite of tuna, which is according to him, salmon. So which one of them is right? If you ask me, nor salmon nor chicken salad is the opposite of tuna. There is no opposite of tuna. But this funny video demonstrates one of the biggest problems in the task of automatically detecting antonyms: even us humans are terrible at that!

        It's a Bird, It's a Plane, It's Superman (not antonyms)
        Many people would categorize a pair of words as opposites if they represent two mutually exclusive options/entities in the world, like male and female. black and white, and tuna and salmon. The intuition is clear when these two words x and y represent the only two options in the world. In set theory, it means that y is the negation/complement of x. In other words, everything in the world which is not x, must be y (figure 1).

        Figure 1: x and y are the only options in the world U

        In this sense, tuna and salmon are not antonyms - they are actually more accurately defined as co-hyponyms: two words that share a common hypernym (fish). They are indeed mutually exclusive, as one cannot be both a tuna and a salmon. However, if you are not a tuna, you are not necessarily a salmon. You can be another type of fish (mackerel, cod...) or something else which is not a fish at all (e.g. person). See figure 2 for a set theory illustration.

        Figure 2: salmon and tuna are mutually exclusive, but not the only options in the world

        Similarly, George probably had in mind that tuna and chicken salad are mutually exclusive options for sandwich fillings. He was probably right; a tuna-chicken salad sandwich sounds awful. But since there are other options for sandwich fillings (peanut butter, jelly, peanut butter and jelly...), these two can hardly be considered as antonyms, even if we define antonyms as complements within a restricted set of entities in the world (e.g. fish, sandwich fillings). I suggest the "it's a bird, it's a plane, it's superman" binary test for antonymy: if you have more than two options, it's not antonymy!

        Wanted Dead or Alive (complementary antonyms)
        What about black and white? These are two colors out of a wide range of colors in the world, failing the bird-plane-Superman test. However, if we narrow our world down to people's skin colors, these two may be considered as antonyms.

        Other examples for complementary antonyms are day and night, republicans and democrats, dead and alive, true and false, stay and go. As you may have noticed, they can be of different parts of speech (noun, adjective, verb), but the two words within each pair both share the same part of speech (comment if you can think of a negative example!).

        Figure 3: Should I stay or should I go now?

        So are we cool with complementary antonyms? Well, not quite. If you say that female and male are complementary antonyms, people might tell you that gender is not binary, but a spectrum. Some of these antonyms actually have other, uncommon or hidden options. Like in coma for the dead and alive pair, libertarians in addition to republicans and democrats, etc. Still, these pairs are commonly considered as antonyms, since there are two main options.

        So what have we learned about complementary antonyms? That they are borderline, they depend on the context in which they occur, and they might be offensive to minorities. Use them with caution.

        The Good, the Bad [and the Ugly?] (graded antonyms)
        Even the strictest definition of antonymy includes pairs of gradable adjectives representing the two ends of a scale. Some examples are hot and cold, fat and skinny. young and old, tall and short, happy and sad. Set theory and my binary test aren't suitable for these types of antonyms.

        Set theory isn't adequate because a gradable adjective can't be represented as a set, e.g. "the set of all tall people in the world". The definition of a graded adjective changes depending on the context and is very subjective. For example, I'm relatively short, so everyone looks tall to me, while my husband is much taller than me, so he is more likely to say someone is short. The set of tall people in the world changes according to the person who defines it.

        In addition, by definition, testing for binarism fails. A cup of coffee can be more than just hot or cold. It can be boiling, very hot, hot, warm, cool, cold or freezing. And we can add more and more discrete options to the scale of coffee temperature.

        What makes specific pairs of gradable adjectives into antonyms? While the definition requires that they would be in the ends of the scale, intuitively I would say that they should only be symmetric in the scale, e.g. hot and cold, boiling and freezing, warm and cool, but not hot and freezing.

        Antonymy in NLP
        While there is a vast linguistics literature about antonyms, I'm less familiar with it, and I'm going to focus on some observations and interesting points about antonymy that appear in NLP papers that I read.

        The natural logic formulation ([1]) makes a distinction between "alternation" - words that are mutually exclusive, and "negation" - words that are both mutually exclusive and cover all the options in the world. While I basically claimed in this post that the former is not antonymy, we've seen that in some cases, if the two words represent the two main options, they may be considered as antonyms.

        However, people tend to disagree on these borderline word pairs, so sometimes it's easier to conflate them under a more loose definition. For example, [2] had an annotation task in which they asked crowdsourcing workers to choose the semantic relation that holds for a pair of terms. They followed the natural logic relations, but decided to merge "alternation" and "negation" into a weaker notion of "antonyms".

        More interesting observations about antonyms, and references to linguistic papers, can be found in [3], [4], and [5].

        After we established that humans find it difficult to decide whether two words are antonyms, you must be wondering whether automatic methods can do reasonably well on this task. There has been a lot of work on antonymy identification (see the papers in the references, and their related work sections). I will focus on my little experience with antonyms. We've just published a new paper ([6]) in which we analyze the roles of two main information sources used for automatic identification of semantic relations. The task is defined as follows: given a pair of words (x, y), determine what is the semantic relation that holds between them, if any (e.g. synonymy, hypernymy, antonymy, etc). As in this post, we've used information from x and y's joint occurrences in a large text corpus, as well as information about the separate occurrences of each word x and y. We found that among all the semantic relations we tested, antonymy was almost the hardest to identify (only synonymy was worse).

        The use of information about separate occurrences of x and y is based on the distributional hypothesis, which I've mentioned several times in this blog. Basically, if we look at the distribution of neighboring words of a word x, it may tell us something about the meaning of x. If we'd like to know what's the relation between x and y, we can compute something on top of the neighbor distributions of each word. For example, we can expect the distributions of x and y to be similar if x and y are antonyms, since one of the properties of antonyms is that they are interchangeable (a word can be replaced with its antonym and the sentence will remain grammatical and meaningful). Think about replacing tall with short, day with night, etc. The problem is that this is similarly true for synonyms - you can expect high and tall to also appear with similar neighboring words. So basing the classification on distributional information may lead to confusing antonyms with synonyms.

        The joint occurrences may help identifying the relation that holds between the words in a pair, as some patterns indicate a certain semantic relation - for instance, "x is a type of y" may indicate that y is a hypernym of x. The problem is that patterns that are indicative of antonymy, such as "either x or y" (either cold or hot) and "x and y" (day and night), may also be indicative of co-hyponymy (either tuna or chicken salad). In any case, this seems far less bad than confusing antonyms with synonyms; in some applications it may suffice to know that x and y are mutually exclusive, with no importance to whether they are antonyms or co-hyponyms. For instance, when you query a search engine, you'd like it to retrieve results including synonyms of your search query (e.g. returning New York City subway map when you search for NYC subway map), but you wouldn't want it to include mutually exclusive words (e.g. Tokyo subway map).

        One last thing to remember is that these automatic methods are trained and tested on data collected from humans. If we can't agree on what's considered antonymy, we can't expect these automatic methods to succeed in this any better than we do.


        [1] Natural Logic for Textual Inference. Bill MacCartney and Christopher D. Manning. RTE 2007.
        [2] Adding Semantics to Data-Driven Paraphrasing. Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch. ACL 2015.
        [3] Computing Word-Pair Antonymy. Saif Mohammad, Bonnie Dorr and Graeme Hirst. EMNLP 2008.
        [4] Computing Lexical Contrast. Saif Mohammad, Bonnie Dorr, Graeme Hirst, and Peter Turney. CL 2013.
        [5] Taking Antonymy Mask off in Vector Space. Enrico Santus, Qin Lu, Alessandro Lenci, Chu-Ren Huang. PACLIC 2014.
        [6] Path-based vs. Distributional Information in Recognizing Lexical Semantic Relations. Vered Shwartz and Ido Dagan. CogALex 2016.

        Saturday, November 12, 2016

        Question Answering

        In the my introductory post about NLP I introduced the following survey question: when you search something in Google (or any other search engine of your preference), is your query:
        (1) a full question, such as "What is the height of Mount Everest?"
        (2) composed of keywords, such as "height Everest"

        I never published the results since, as I suspected, there were too few answers to the survey, and they were probably not representative of the entire population. However, my intuition back then was that only older people are likely to search with a grammatical question, while people with some knowledge in technology would use keywords. Since then, my intuition was somewhat supported by (a) this lovely grandma that added "please" and "thank you" to her search queries, and (b) this paper from Yahoo Research that showed that search queries with question intent do not form fully syntactic sentences, but are made of segments (e.g. [height] [Mount Everest]). 

        Having said that, searching the web to get an answer to a question is not quite the same as actually asking the question and getting a precise answer:

        Here's the weird thing about search engines. It was like striking oil in a world that hadn't invented internal combustion. Too much raw material. Nobody knew what to do with it. 
        Ex Machina

        It's not enough to formulate your question in a way that the search engine will have any chance of retrieving relevant results. Now you need to process the returned documents and search for the answer. 

        Getting an answer to a question by querying a search engine is not trivial; I guess this is the reason so many people ask questions in social networks, and some other people insult them with Let me Google that for you

        The good news is that there are question answering systems, designed to do exactly that: automatically answer a question given as input; the bad news is that like most semantic applications in NLP, it is an extremely difficult task, with limited success. 

        Question answering systems have been around since the 1960s. Originally, they were developed to support natural language queries to databases, before web search was available. Later, question answering systems were able to find and extract answers from free text.

        A successful example of a question answering system is IBM Watson. Today Watson is described by IBM as "a cognitive technology that can think like a human", and is used in many of IBM's projects, not just for question answering. Originally, it was trained to answer natural logic questions -- or more precisely, to form the correct question to a given answer, as in the television game show Jeopardy. On February 2011, Watson competed in Jeopardy against former winners of the show, and won! It had access to millions of web pages, including Wikipedia, which were processed and saved before the game. During the game, it wasn't connected to the internet (so it couldn't use a search engine, for example). The Jeopardy video is pretty cool, but if you have no patience watching it all (I understand you...), here's a highlight:

        HOST: This trusted friend was the first non-dairy powdered creamer. Watson?
        WATSON: What is milk?
        HOST: No! That wasn’t wrong, that was really wrong, Watson.

        Another example is the personal assistants: Apple's Siri, Amazon's Alexa, Microsoft's Cortana, and Google Assistant. They are capable of answering an impressively wide range of questions, but it seems they are often manually designed to answer specific questions.

        So how does question answering work? I assume that each question answering system employs a somewhat different architecture, and some of the successful ones are proprietary. I'd like to present two approaches. The first is a general architecture for question answering from the web, and the second is question answering from knowledge bases.

        Question answering from the web

        I'm following a project report I submitted to a course 3 years ago, in which I exemplified this process on the question "When was Mozart born?". This example was originally taken from some other paper, which is hard to trace now. Apparently, it is a popular example in this field.

        The system preforms the following steps:

        A possible architecture for a question answering system. 
        • Question analysisparse the natural language question, and extract some properties:

          • Question type - mostly, QA systems support factoid questions (a question whose answer is a fact, as in the given example). Other types of questions, e.g. opinion questions, will be discarded at this point.

          • Answer type - what is the type of the expected answer, e.g. person, location, date (as in the given example), etc. This can be inferred with simple heuristics using the WH-question word, for example who => person, where => location, when => date. 

          • Question subject and object - can be extracted easily by using a dependency parser. These can be used in the next step of building the query. In this example, the subject is Mozart.

        • Search - prepare the search query, and retrieve documents from the search engine. The query can be an expected answer template (which is obtained by applying some transformation to the question), e.g. "Mozart was born in *". Alternatively, or in case the answer template retrieves no results, the query can consist of keywords (e.g. Mozart, born).

          Upon retrieving documents (web pages) that answer the query, the system focuses on certain passages that are more likely to contain the answer ("candidate passages"). These are usually ranked according to the number of query words they contain, their word similarity to the query/question, etc.

        • Answer extraction - try to extract candidate answers from the candidate passages. This can be done by using named entity recognition (NER) that identifies in the text mentions of people, locations, organizations, dates, etc. Every mention whose entity type corresponds to the expected answer type is a candidate answer. In the given example, any entity recognized as DATE in each candidate passage will be marked as a candidate answer, including "27 January 1756" (the correct answer) and "5 December 1791" (Mozart's death date).

          The system may also keep some lists that can be used to answer closed-domain questions, such as "which city [...]" or "which color [...]" that can be answered using a list of cities and a list of colors, respectively. If the system identified that the answer type is color, for example, it will search the candidate passage for items contained in the list of colors. In addition, for "how much" and "how many" questions, regular expressions identifying numbers and measures can be used.

        • Ranking - assign some score for each candidate answer, rank the candidate answers in descending order according to their scores, and return a list of ranked answers. This phase differs between systems. The simple approach would be to represent an answer by some characteristics (e.g. surrounding words) and learn a supervised classifier to rank the answers.

          An alternative approach is to try to "prove" the answer logically. In the first phase, the system creates an expected answer template. In our example it would be "Mozart was born in *". By assigning the candidate answer "27 January 1756" to the expected answer template, we get the hypothesis "Mozart was born in 27 January 1756", which we would like to prove from the candidate passage. Suppose that the candidate passage was "[...] Wolfgang Amadeus Mozart was born in Salzburg, Austria, in January 27, 1756. [...]", a person would know that given the candidate passage, the hypothesis is true, therefore this candidate answer should be ranked high.

          To do this automatically, Harabagiu and Hick ([1]) used a textual entailment system: the system receives two texts and determines whether if the first text (text) is true, it means that the second one (hypothesis) is also true. Some of these systems return a number, indicating to what extent this is true. This number can be used for ranking answers.

          While this is a pretty cool idea, the unfortunate truth is that textual entailment systems do not perform better than question answering systems, or very good in general. So reducing the question answering problem to that of recognizing textual entailment doesn't really solve question answering. 

        Question answering from knowledge bases

        A knowledge base, such as Freebase/Wikidata and DBPedia, is a large-scale set of facts about the world in a machine-readable format. Entities are related to each other via relations, creating triplets like (Donald Trump, spouse, Melania Trump) and (idiocracy, instance of, film) (no association between the two facts whatsoever ;)). Entities can be people, books and movies, countries, etc. Example relations are birth place, spouse, occupation, instance of, etc. While these facts are saved in a format which is easy for a machine to read, I never heard of a human who searches information in knowledge bases. Which is too bad, since it contains an abundance of information.

        So some researchers (e.g. [2], following [3]) came up with the great idea of letting people ask a question in natural language (e.g. "When was Mozart born?"), parsing the question automatically to relate it to a fact in the knowledge base, and answer accordingly.
        This reduces the question answering task to understanding the natural language question, whereas querying for the answer from a knowledge base requires no text processing. The task is called executable semantic parsing. The natural language question is mapped into some logic representation, e.g. Lambda calculus. For example, the example question would be parsed to something like λx.DateOfBirth(Mozart, x). The logical form is then executed against a knowledge base; for instance, it would search for a fact such as (Mozart, DateOfBirth, x) and return x. 

        Despite having the answer appear in a structured format rather than in free text, this task is still considered hard, because parsing a natural language utterance into a logical form is difficult.* 

        By the way, simply asking Google "When was Mozart born?" seems to take away my argument that "searching the web to get an answer to a question is not quite the same as actually asking the question and getting a precise answer":

        Google understands the question and answers precisely.

        Only that it doesn't. Google added this feature to its search engine in 2012, in which it presents information boxes above the regular search results, for some queries and questions. They parse the natural language query and try to retrieve results from their huge knowledge base, known as Google knowledge graph. Well, I don't know exactly how they do it, but I guess that similarly to the previous paragraph, their main effort is in parsing and understanding the query, which can then be matched against facts in the graph.

        [1] Methods for Using Textual Entailment in Open-Domain Question Answering. Sanda Harabagiu and Andrew Hick. In ACL and COLING 2006.
        [2] Semantic Parsing on Freebase from Question-Answer Pairs. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. In EMNLP 2013.
        [3] Learning to parse database queries using inductive logic programming. John M. Zelle and Raymond J. Mooney. In AAAI 1996.

        * If you're interested in more details, I recommend going over the materials from the very interesting ESSLLI 2016 course on executable semantic parsing, which was given by Jonathan Berant.

        Sunday, August 28, 2016

        Crowdsourcing (for NLP)

        Developing new methods to solve scientific tasks is cool, but they usually require data. We researchers often find ourselves collecting data rather than trying to solve new problems. I've collected data for most of my papers, but never thought of it as an interesting blog post topic. Recently, I attended Chris Biemann's excellent crowdsourcing course at ESSLLI 2016 (the 28th European Summer School in Logic, Language and Information), and was inspired to write about the topic. This blog post will be much less technical and much more high-level than the course, as my posts usually are. Nevertheless, credit for many interesting insights on the topic goes to Chris Biemann.1  

        Who needs data anyway?

        So let's start from the beginning: what is this data and why do we need it? Suppose that I'm working on automatic methods to recognize the semantic relation between words, e.g. I want my model to know that cat is a type of animal, and that wheel is a part of a car.

        At the very basic level, if I already developed such a method, I will want to check how well it does compared to humans. Evaluation of my method requires annotated data, i.e. a set of word pairs and their corresponding true semantic relations, annotated by humans. This will be the "test set"; the human annotations are considered as "gold/true labels". My model will try to predict the semantic relation between each word-pair (without accessing the true labels). Then, I will use some evaluation metric (e.g. precision, recall, F1 or accuracy) to see how well my model predicted the human annotations. For instance, my model would have 80% accuracy if for 80% of the word-pairs it predicted the same relation as the human annotators.

        Figure 1: an example of dataset entries for recognizing the semantic relation between words.
        If that was the only data I needed, I would have been lucky. You don't need that many examples to test your method. Therefore, I could select some word-pairs (randomly or using some heuristics), and annotate them myself, or bribe my colleagues with cookies (as I successfully did twice). The problem starts when you need training data, i.e., when you want your model to learn to predict something based on labelled examples. That usually requires many more examples, and annotating data is a very tiring and Sisyphean work.

        What should we do, then? Outsource the annotation process -- i.e., pay with real money, not cookies!

        What is crowdsourcing?

        The word crowdsourcing is a blend word composed of crowd (intelligence) + (out-)sourcing [1]. The idea is to take a task that can be performed by experts (e.g. translating a document from English to Spanish), and outsource it to a large crowd of non-experts (workers) that can perform it.

        The requester defines the task, and the workers work on it. The requester than decides whether to accept/reject the work and pays the workers (in case of acceptance).

        The benefits of using "regular" people rather than experts are:
        1. You pay them much less than experts - typically a few cents per question (/task). (e.g., [2] found that in translation tasks, the crowd reached the same quality as the professionals, with less than 12% of the costs).
        2. They are more easily available via crowdsourcing platforms (see below).
        3. By letting multiple people work on the task rather than a single/few experts, the task could be completed in a shorter time. 
        The obvious observation is that the quality of a worker is not as good as the expert; in crowdsourcing, it is not a single worker that replaces the expert, but the crowd. Rather than trusting a single worker, you assign each task to a certain number of workers, and combine their results. A common practice is to use the majority voting. For instance, let's say that I ask 5 workers what is the semantic relation between cat and dog, giving them several options. 3 of them say that cat and dog are mutually exclusive words (e.g. one cannot be both a cat and a dog), one of them says that they are opposites, and one says that cat is a type of dog. The majority has voted in the favor of mutually exclusive, and this is what I will consider as the correct answer.2

        The main crowdsourcing platforms (out of many others) are Amazon Mechanical Turk and CrowdFlower. In this blog post I will not discuss the technical details of these platforms. If you are interested in a comparison between the two, refer to these slides from the NAACL 2015 crowdsourcing tutorial.

        Figure 2: An example of a question in Amazon Mechanical Turk, from my project.

        What can be crowdsourced?

        Not every data we need to collect can be collected via crowdsourcing; some data may require expert annotation, e.g. if we need to annotate the syntactic trees of sentences in natural language, that's probably a bad idea to ask non-experts to do so.

        The rules of thumb for crowdsourcability are:
        • The task is easy to explain, and you as a requester indeed explain it simply. They key idea is to keep it simple. The instructions should be short, i.e. do not expect workers to read a 50 page manual. They don't get paid enough for that. The instructions should include examples.
        • People can easily agree on the "correct" answer, e.g. "is there a cat in this image?" is good, "what is the meaning of life?" is really bad. Everything else is borderline :) One thing to consider is the possible number of correct answers. For instance, if the worker should reply with a sentence (e.g. "describe the following image"), they can do so in so many ways. Always aim one possible answer for a question.
        • Each question is relatively small.
        • Bonus: the task is fun. Workers will do better if they enjoy the task. If you can think of a way to gamify your task, do so!
        Figure 3: Is there a cat in this image?

        Some tasks are borderline and may become suitable for crowdsourcing if presented in the right way to the workers. If the task at hand seems too complicated to be crowdsourced, ask yourself: can I break it into smaller tasks that can each be crowdsourced? For example, let workers write a sentence that describes an image, and accept all answers; then let other workers validate the sentences (ask them: does this sentence really describe this image?).

        Some examples for (mostly language-related) data collected with crowdsourcing
        (references omitted, but are available in the course slides in the link above).
        • Checking whether a sentence is grammatical or not.
        • Alignment of dictionary definitions - for instance, if a word has multiple meanings, and hence has multiple definitions in each dictionary - the task was to align the definitions corresponding to the same meaning in different dictionaries.
        • Translation.
        • Paraphrase collection - get multiple sentences with the same meaning. These were obtained by asking multiple workers to describe the same short video.
        • Duolingo started as a crowdsourcing project!
        • And so did reCAPTCHA!
        How to control for the quality of data?

        OK, so we collected a lot of data. How do we even know if it's good? Can I trust my workers to do well on the task? Could they be as good as experts? And what if they just want my money and are cheating on the task just to get easy money?

        There are many ways to control for the quality of workers:
        1. The crowdsourcing platforms provide some information about the workers, such as the number of tasks they completed in the past, their approval rate (% of their tasks that were approved), location, etc. You can define your requirements from the workers based on this information.
        2. Don't trust a single worker -- define that your task should be answered by a certain number of workers (typically 5) and aggregate their answers (e.g. by majority voting).
        3. Create control questions - a few questions for which you know the correct answer. These questions are displayed to the worker just like any other questions. If a worker fails to answer too many control questions, the worker is either not good or trying to cheat you. Don't use this worker's answers (and don't let the worker participate in the task anymore; either by rejecting their work or by blocking them).3
        4. Create a qualification test - a few questions for which you know the correct answer. You can require that any worker who wants to work on your task must take the test and pass it. As opposed to the control questions, the test questions don't have to be identical in format to the task itself, but should predict the worker's ability to perform the task well.
        5. Second-pass reviewing - create another task in which workers validate previous workers' answers. 
        6. Bonus the good workers - they will want to keep working for you.
        7. Watch out for spammers! Some workers are only after your money, and they don't take your task seriously, e.g. they will click on the same answer for all questions. There is no correlation between the number of questions workers answer and their quality, however, it is worth looking at the most productive workers: some of them may be very good (and you might want to give them bonuses), while some of them may be spammers.
        Ethical issues in crowdsourcing

        As a requester, you need to make sure you treat your workers properly. Always remember that workers are first of all people. When you consider how much to pay or whether to reject a worker's work, think of the following:

        • Many workers rely on crowdsourcing as their main income. 
        • They have no job security.
        • Rejection in some cases is unfair - even if the worker was bad in the task, they still spent time working (unless you are sure that they are cheating).
        • New workers do lower-paid work to build up their reputation, but underpaying is not fair and not ethical.
        • Are you sure you explained the task well? Maybe it is your fault if all the workers performed badly?
        The good news is that, from my little experience, paying well pays off for the requester too. If you pay enough (but not too much!), you get good workers that want to do the task well. When you underpay, the good workers don't want to work on your task - they can get better paying tasks. The time to complete the task will be longer. And if you are like me, the thought of underpaying your workers will keep you awake at night. So pay well :)4

        Important take-outs for successful crowdsourcing:
        • Work in small batches. If you have 10,000 questions, don't publish all at once. Try some, learn from your mistakes, correct them and publish another batch. Mistakes are bound to happen, and they might cost you good money!
        • Use worker errors to improve instructions (remember: it might be your fault).
        • KEEP. IT. SIMPLE.
        • Use quality control mechanisms.
        • Don't underpay!
        • Always expect workers to be sloppy. Repeat guidelines and questions and don't expect workers to remember them.
        • If your questions are automatically generated, use random order and try to balance the number of questions with each expected answer, otherwise workers will exploit this bias (e.g. if most word-pairs are unrelated, they will mark all of them as unrelated without looking twice).
        • Make workers' lives easier, and they will perform better. For instance, if you have multiple questions regarding the same word, group them together.
        • If you find a way to make your task more fun, do so!

        [1] Howe, Jeff. The rise of crowdsourcing. Wired magazine 14.6 (2006).
        [2] Omar F. 
        Zaidan and Chris Callison-Burch Crowdsourcing translation: professional quality from non-professionals. In ACL 2011.

        1 And I would also like to mention another wonderful crowdsourcing tutorial that I attended last year at NAACL 2015, which was given by Chris Callison-Burch, Lyle Ungar, and Ellie Pavlick. Unfortunately, at that time I had no personal experience with crowdsourcing, nor believed that my university will ever have budget for that, therefore made no effort to remember the technical details; I was completely wrong. A year later I published a paper about a dataset collected with crowdsourcing, on which I even got a best paper award  :) 
        2 For more sophisticated aggregation methods that assign weights to workers based on their quality, see MACE. 
        3 Blocking a worker means that they can't work on your tasks anymore. Rejecting a worker means that they are not paid for the work they have already done. As far as I know, it is not recommended to reject a worker, because then they write bad things about you in Turker Nation and nobody wants to work for you anymore. In addition, you should always give workers the benefit of the doubt; maybe you didn't explain the task well enough.
        4 So how much should you pay? First of all, not less than 2 cents. Second, try to estimate how long a single question takes and aim an hourly pay of around 6 USDs. For example, in this paper I paid 5 cents per question, which I've been told is the higher bound for such tasks.

        Monday, June 20, 2016

        Linguistic Analysis of Texts

        Not long ago, Google released their new parser, oddly named Parsey McParseface. For a couple of days, popular media was swamped with announcements about Google solving all AI problems with their new magical software that understands language [e.g. 1, 2].

        Well, that's not quite what it does. In this post, I will explain about the different steps applied for analyzing sentence structure. These are usually used as a preprocessing step for higher-level tasks that try understanding the meaning of sentences, e.g. intelligent personal assistants like Siri or Google Now.

        The following tools are traditionally used one after the other (also known as the "linguistic annotation/processing pipeline"). Generally speaking, the accuracy of available tools for the tasks in this list is in decreasing order. Some low-level tasks are considered practically solved, while others still have room for improvement.
        1. Sentence splitting - as simple as it sounds: receives a text document/paragraph and returns its partition to sentences. While it sounds like a trivial task -- cut the text on every occurrence of a period -- it is a bit trickier than that; sentences can end with an exclamation / question mark, and periods are also used in acronyms and abbreviations in the middle of the sentence. The simple period rule will fail on this text, for example. Still, sentence splitting is practically considered a solved task, using predefined rules and some learning algorithms. See this for more details.

        2. Tokenization - a tokenizer receives a sentence and splits it to tokens. Tokens are mostly words, but words that are short forms of negation or auxiliaries are split to two tokens, e.g. I'm => I 'maren't => are n't.

        3. Stemming / Lemmatization - Words appear in natural language in many forms, for instance, verbs have different tense suffixes (-ing, -ed, -s), nouns have plurality suffixes (s), and adding suffixes to words can sometimes change their grammatical categories, as in nation (noun) => national (adjective) => nationalize (verb).
          The goal of both stemmers and lemmatizers is to "normalize" words to their common base form, such as "cats" => "cat", "eating" => "eat". This is useful for many text-processing applications, e.g. if you want to count how many times the word cat appears in the text, you may also want to count the occurrences of cats.
          The difference between these two tools is that stemming removes the affixes of a word, to get its stem (root), which is not necessarily a word on its own, as in driving => drivLemmatization, on the other hand, analyzes the word morphologically and returns its lemma. A lemma is the form in which a word appears in the dictionary (e.g. singular for nouns as in cats => cat, infinitive for verbs as in driving => drive).
          Using a lemmatizer is always preferred, unless there is no accurate lemmatizer for that language, in that case a stemmer is better than nothing.

        4. Part of speech tagging - receives a sentence, and tags each word (token) with its part of speech (POS): noun, verb, adjective, adverb, preposition, etc. For instance, the following sentence: I'm using a part of speech tagger is tagged in Stanford Parser as:
          I/PRP 'm/VBP using/VBG a/DT part/NN of/IN speech/NN tagger/NN ./. Which means that I is a personal pronoun, 'm (am) is a verb, non-3rd person singular present, and if you're interested, here's the list to interpret the rest of the tags.
          (POS taggers achieve around 97% accuracy).

        5. Syntactic parsing - analyzes the syntactic structure of a sentence, outputting one of two types of parse trees: constituency-based or dependency-based.

          Constituency - segments the sentence into syntactic phrases: for instance, in the sentence the brown dog ate dog food, [the brown dog] is a noun phrase, [ate dog food] is a verb phrase, and [dog food] is also a noun phrase.
          An example of constituency parse tree, parsed manually by me and visualized using syntax tree generator.
          Dependency - connects words in the sentence according to their relationship (subject, modifier, object, etc.). For example, in the sentence the brown dog ate dog food, the word brown is a modifier of the word dog, which is the subject of the sentence. I've mentioned dependency trees in the previous post: I used them to represent the relation that holds between two words, which is a common use.
          (Parsey McParseface is a dependency parser. Best dependency parsers achieve around 94% accuracy).

        An example of dependency parser output, using Stanford Core NLP.

        Other tools, which are less basic, but are often used, include:
        • Named entity recognition (NER) - receives a text and marks certain words or multi-word expressions in the text with named entity tags, such as PERSON, LOCATION, ORGANIZATION, etc. 
          An example of NER from Stanford Core NLP.

        • Coreference resolution - receives a text and connects words that refer to the same entity (called "mentions"). This includes, but not limited to:
          pronouns (he, she, I, they, etc.) - I just read Lullaby. It is a great book.
          different names / abbreviations - e.g., the beginning of the text mentions Barack Obama, which is later referred to as Obama.
          semantic relatedness - e.g. the beginning of the text mentions Apple which is later referred to as the company.

          This is actually a tougher task than the previous ones, and accordingly, it achieves less accurate results. In particular, sometimes it is difficult to determine which entity a certain mention refers to (while it's easy for a human to tell): e.g. I told John I don't want Bob to join dinner, because I don't like him. Who does him refer to?
          Another thing is that it is very sensitive to context, e.g. in one context apple can be co-referent with the company, while in another, that discusses the fruit, it is not true.

        • Word sense disambiguation (WSD) - receives a text and decides on the correct sense of each word in the given context. For instance, if we return to the apple example, in the sentence Apple released the new iPhone, the correct sense of apple is the company, while in I ate an apple after lunch the correct sense is the fruit. Most WSD systems use WordNet for the sense inventory.

        • Entity linking - and in particular, Wikification: receives a text and links entities in the text to the corresponding Wikipedia articles. For instance, in the sentence 1984 is the best book I've ever read, the word 1984 should be linked to (rather than to the articles discussing the films / TV shows).
          Entity linking can complement word sense disambiguation, since most proper names (as Apple or 1984) are not present in WordNet.

        • Semantic role labeling (SRL) - receives a sentence and detects the predicates and arguments in the sentence. A predicate is usually a verb, and each verb may have several arguments, such as agent / subject (the person who does the action), theme (the person or thing that undergoes the action), instrument (what was used for doing the action), etc. For instance, in the sentence John baked a cake for Mary, the predicate is bake, and the arguments are agent:John, theme:cake, and goal:Mary. This is not just the final task in my list: it is the task which is the closest to understanding the semantics of a sentence.

        Here is an example for (a partial) analysis of the sentence: The brown dog ate dog food, and now he is going to sleepusing Stanford Core NLP:

        Analysis of The brown dog ate dog food, and now he is going to sleep, using Stanford Core NLP.
        All this effort, and we are not even yet talking about deep understanding of the sentence meaning, but rather analyzing the sentence structure, perhaps as a step toward understanding its meaning. As my previous posts show, it's hard as it is to understand a single word's meaning. In one of my next posts I will describe methods that deal with the semantics of a sentence.

        By the way, if you are potential users of these tools, and you are looking for a parser, Google's parser is not the only one available. BIST is more accurate and faster than Parsey McParseface, and spaCy is slightly less accurate, but much faster than both.