Neural Networks

take the SAT

by Sameer Rai Bhatnagar

Sameer Rai Bhatnagar is a PhD candidate at Polytechnique School of Montreal in Software engineering

Sameer Rai Bhatnagar is a PhD candidate at Polytechnique School of Montreal in Software engineering

American High School students must take the dreaded Scholastic Aptitude Test (SAT). Their grade on this nationaly standardized test will determine which univeristies they get admission to. There are three parts to this test: Math, Writing, and Reading. In the Reading section there are analogy questions such "King is to Queen, as Man is to 'x' ".

One could argue that such analogy tasks represent a true understanding of human language. While there are different definitions and technical requirements for true Artificial Intelligence (AI), if a computer system could reason out such analogies, it would definitely be a step in the right direction (towards AI). But how could a computer learn to reason like this?

Distributional Semantics is a sub-field of Natural Language Processing, where the meaning of words, are repesented by vectors which capture their distribution in some corpus of text. For example, we start with the British National Corpus, a collection 100 million words of spoken and written english from various sources. (The goal of such a corpus is to represent the state of the english language in the 20th century.)

Next, we build a very large matrix where each word is a column, and each row represents a particular context (i.e. a sentence, a document, a sequence of three words, etc.). The entries of the matrix represent how many times the word of this column, appeared in the context of that row. One would expect such a matrix to be quite sparse (i.e with many zeros), since there are many infrequent words, which only appear in certain contexts.

Up until 2012, a classic approach to reducing the dimension of such a matrix to a tractable (and useful) size, would be using a mathematical operation called Singular value Decomposition, a factorization that decomposes the variance into latent dimensions, and (hopefully) filters out the noise. This approach has been useful for many applications, most importantly information retrieval in web search engines.

In 2012, Tomas Mikolov, from the University of Brno in the Czech Republic, used a two-layer neural network model to derive the 600-dimensional word vectors (or "word embeddings"). Given that each of these vectors is such an abstracted representation (just a sequence of numbers), he was curious to put these vectors to the test: could they take the SAT?

If we represent words as vectors, the analogy task described above, can be formulated as a mathematical operation. When we ask "King is to Queen, as Man is to'x'  ", one could conceptualize this as "King + Queen = Man + 'x'". Doing some algebra, our unknown should be the result of " 'x'= King - Man + Queen". As humans, this makes intuitive sense: take your concept of King, remove all notion of being Male, while keeping the royal nature of Queens, and this is your result.

The surprising and fascinating result? When taking these vectors, these abstract sequences of numbers, and doing these algebraic operations (i.e subtracting the vector of 'Man' from that of 'King', and adding back 'Queen'), the resulting vector was the one that had already independantly been derived from the model, for the word 'Woman'! His word embeddings were also able to answer similar analogy questions, that required geographical knowdge, such as "Paris is to France as 'x' is to Italy" (x=Rome)!

The implications of this work were felt all around the research and industrial community. Here we had a system that could reason about a certain facet of language, the way humans do! (at least like American High School students). His system is now readily available, and is known as word2vec.  Mikolov has since since been hired at Google Research.