What do computers see when we talk to them?

Image for post
Image for post
photo credit

At Josh.ai, we’re working on providing a product that allows you to talk to your home. In an earlier blog post, I talked from a philosophical point of view about what it means for us to understand language. Computers are not things that can simply “learn like humans.” At least, not yet. That said, how is it possible that systems such as Google or Facebook’s M and many others, all appear to actually be understanding us?

To put it simply, we store the meanings of words and phrases as numbers. There are famous works, such as word2vec by Tomas Mikolov, now at Facebook, that use a shallow artificial neural network to help calculate something that we in the field like to call word embeddings. There are even techniques such as Memory Neural Networks that can learn from a body of text and, in a limited capacity, answer questions about it. In later blog posts, we’ll learn how some of these techniques work in better detail, but before we get there we need to have a solid understanding of the basics.

Image for post
Image for post
An example of t-SNE word embedding plot (photo credit).

In the last few years, image recognition and digital signal processing has been explosive with fantastic results thanks to advances in artificial neural network algorithms. In just the last few months we’ve started to see greater leaps in the field of natural language processing. Why does there seem to be so much more difficulty with understanding natural language?

Natural language processing is fundamentally a different problem with a different set of constraints.

Studying natural language is difficult for more reasons than we can count — be it because language is forever changing, be it that there are multiple languages. No matter what your view is, the biggest reason natural language is difficult to work with is because we don’t have enough of it.

Wait a minute Aaron, what are you trying to tell me? We have text all over the place! What about books, blogs, even social media?!?

Calm down reader, take a breathe and give me a chance to explain. You’re right, we have lots of sources for text, but even then, when we’re given text to analyze, it’s relatively not a lot. Look at the image below:

Image for post
Image for post
Image taken from: https://www.tensorflow.org/versions/master/tutorials/word2vec/index.html

When we analyze audio data, or image data, we actually have much more data than people seem to realize. For example, take the Canon 5D Mark III camera. Those who are in the photography field will usually say this is a respectable camera, and at max resolution, it takes pictures of size 5760 x 3840 pixels. Each pixel holds 4 values, because each color pixel is represented with a red, blue, green, and alpha value (RGBA). In its RAW format, that image will be roughly 88.47 MB of data. In comparison, all of the works of Shakespeare only takes up 4.4 MB. One pixel is represented using 4 bytes, and one character in text takes up 1 byte.

This means two things:
1. When parsing text, each token (the NLP term for basically what is a word) affects the entirety of the analyzed body of information much more.
2. When generating text, each token chosen affects the output much more.

In other words, the characters and tokens are a much larger percentage of some text (usually) as opposed to some pixels. Text is almost always considered to be sparse information.

So, how do we even start working with text? We have to represent it somehow, and we have to represent relationships between text somehow. Doing something like hard coding a check for “Turn on the lights” makes for a very rigid parsing system. If someone were to rephrase that command to say: “Power on the lights,” most systems would instantly break. The answer is that we need to represent text in a numerical format. Cue for vector representations. For now we’re going to stick with understanding very simple vector representations. In the future we’ll look at more complicated representations. You know, the ones that people actually use.

Let’s look at the two phrases we used above and add a few more:

  1. Turn on the lights
  2. Power on the lights
  3. What time is it?
  4. What is the current time?
Image for post
Image for post
Each human has their own lexicon based on what they’ve learned. (photo credit).

What we want to do is create what’s called a lexicon. In other words, we want to create a vocabulary using unique tokens from our phrases above. We’re going to do some implicit pre-processing where we ignore casing and remove punctuation. Lastly we’re going to assign each unique word its own index in what becomes our lexicon vector.

0 turn
1 on
2 the
3 lights
4 power
5 what
6 time
7 is
8 it
9 current

Since each word represents an index, we can use a reverse hash data structure and be able to get the index of each word, and then we can represent each word as a series of booleans representing whether or not each word is in a given text. For example, our sample sentences would be represented as:

  1. { 1, 1, 1, 1, 0, 0, 0, 0, 0, 0 }
  2. { 0, 1, 1, 1, 1, 0, 0, 0, 0, 0 }
  3. { 0, 0, 0, 0, 0, 1, 1, 1, 1, 0 }
  4. { 0, 0, 1, 0, 0, 1, 1, 1, 0, 1}

Now that we have these vectors, we can get a crude similarity calculation using a vector dot product. If you don’t remember what a dot product is, it’s when we take the sum of the products of the corresponding values in two vectors.

Image for post
Image for post
Image taken from: https://en.wikipedia.org/wiki/Dot_product

The dot product between our first two samples would be:

1 * 0 = 0
1 * 1 = 1
1 * 1 = 1
1 * 1 = 1
0 * 0 = 0
0 * 0 = 0
0 * 0 = 0
0 * 0 = 0
0 * 0 = 0
0 * 0 = 0

When we sum these values, we get a total of 3. If we do the same thing but with our first and third examples we get:

1 * 0 = 0
1 * 0 = 0
1 * 0 = 0
1 * 0 = 0
0 * 0 = 0
0 * 1 = 0
0 * 1 = 0
0 * 1 = 0
0 * 1 = 0
0 * 0 = 0

This time we got a sum of 0, which means that phrases 1 and 3 (“Turn on the lights” and “What time is it”) are not even similar in the slightest.

That about wraps it up for now. You’ll notice that our current lexicon is very small at only 10 words. Typically speaking, you’ll have lexicons that can number in the hundreds of thousands, which means when you’re given a short 4 word phrase, you’ll have 4 1s and 99,996 0s, hence the whole sparse vectors concept. Not only is this space inefficient, but vector and matrix calculations that run at such large sizes take a computationally large amount of time to perform. In the future when we explore semantic word embeddings, we’ll see that we can compress these vectors to not be massive, and we can use surrounding words to get even better context.

Image for post
Image for post

This post was written by Aaron at Josh.ai. Previously, Aaron worked at Northrop Grumman before joining the Josh team where he works on natural language programming (NLP) and artificial intelligence (AI). Aaron is a skilled YoYo expert, loves video games and music, has been programming since middle school and recently turned 21.

Josh.ai is an AI agent for your home. If you’re interested in following Josh and getting early access to the beta, enter your email at https://josh.ai.

Like Josh on Facebook — http://facebook.com/joshdotai

Follow Josh on Twitter — http://twitter.com/joshdotai

Image for post
Image for post

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store