Using Explicit Semantic Analysis to discover meaningful relatedness

Using Explicit Semantic Analysis to discover meaningful relatedness

In 1965 the world changed forever. Engelbart and English had created a pointing device at the Stanford Research Institute. They attached a wire to its rear end and decided to call it mouse since it looked like one. The mouse went on to revolutionize the computer world. Perhaps less obviously, it also had a huge effect on the word mouse itself because its semantic nature changed.

 

Semantics deals with the meaningful relationship between words or concepts. Those relationships can and often do change – sometimes overnight (as when the mouse was invented), but more often it is the result of a slow process. Language speakers are often unaware of that process since it may span multiple generations. For example, the meaning of the word ‘cute’ used to mean ‘keenly perceptive’. Somewhat less obviously, many of the words we use to describe a temporal situation (‘after’, ‘before’, etc) evolved from a directional meaning.

 

One active area of research here at Oversee.net focuses on the mapping between keywords and domain names. We want to enrich user experience by displaying links that are meaningful to our visitors. To that end, the question we are trying to answer is:

 

Given a domain name, which keywords are meaningfully related to it?

 

With respect to our initial example, prior to 1965 we would have liked to map mouse to (fictional) domains like rodentsunited.net or disneyland.ie. But today we also want to make sure to capture domains like computerequipment.net. One of the maybe lesser known algorithms to perform semantic mapping is called Explicit Semantic Analysis.

 

Explicit Semantic Analysis

Explicit Semantic Analysis (ESA) is a data-based approach to finding similarities between two text documents. The basic idea is that two documents are similar if the most important words in document A are strongly semantically related to the most important words in document B. This definition brings up a couple of questions.

Q: What makes a word important?

A: Suppose you have a corpus of 1,000 documents. A word is important for a given document if it occurs relatively often in the document while the same cannot be said for it occurrence in the remaining 999 documents. For example, the word and may occur very often in a document. But it also occurs quite often in the majority of the other documents in the corpus, so and is not an important word for any single document. On the other hand, if the word motorbike occurs a couple of times in say ten documents, then that word is important for those ten documents.

The importance of a word for a document in a given corpus may be calculated using tf-idf (term frequency / inverted document frequency). It scores the importance of words based on the concept I just described. A tf-idf value ranges from 0 (meaning the word is completely useless to describe its document’s content) to 1 (meaning the word is highly useful to describe its document’s content).

Q: What does it mean for two words to have a strong semantic relation?

A: It basically means that those two words are similar in meaning. For example, motorbike and car are related in meaning, as are motorbike and racing (the second example goes to show that a noun can be semantically related to a verb or any other part of speech, for that matter). You can think of words as having features such as ‘involves or enables movement’. The more features two words share, the more similar they are in meaning.

Note that when you check two documents for semantic relatedness in theory they don’t have to share a single word. One document may describe motorbikes while the other describes the history of Formula One car racing. While using different vocabulary, in terms of their positions in semantic space they are pretty close to one another.

But how does one actually calculate semantic relatedness between two words? There are many different ways to do it, but most strategies involve mapping the words into some kind of knowledge representation (also often referred to as knowledge space or feature space) and then calculating the distance between the words in that representation.

Using Wikipedia as knowledge representation

Most of the details described in this blog are based on this article. The following figure illustrates how ESA works:

ESA overview

In the beginning there was Wikipedia, an expansive and free semi-structured representation of world knowledge. It consists of a set of articles, and each article represents a concept of that knowledge representation, i.e. a node in a network of information. It’s a network because concepts are linked – by the categories they belong to, by the links that refer from one concept to another, and – most importantly for our purposes – by the use of similar words.

Concepts are described using words and sentences. Words tend to be re-used to describe similar concepts. For example, a number of concepts related to racing may contain the words motorbike and racing. We can say motorbike and racing are semantically related because their pattern of occurrence in our knowledge representation is similar. Now instead of comparing just two words, we can also compare two documents – documents are simply a collection of words. The advantage of comparing documents versus single words is that we can be more confident in our mapping. We say documents A and B are related in meaning if the distribution of active concepts in our knowledge network for document A is similar to the distribution of active concepts for document B.

Calculating semantic relatedness using cosine similarity

Let’s walk through the figure above. For each concept (i.e. Wikipedia article) we calculate the tf-idf measure for any given word. At the end of this process our knowledge representation looks something like this:

CONCEPT_ID WORD TF-IDF
4234234 bike 0.523
4234234 history 0.012
6846534 track 0.263
6846534 bike 0.123

 

In this example, word bike occurs in multiple concepts with different tf-idf values.

In order to speed things up, ESA uses an inverted index. Instead of concepts pointing to words, ESA employs a table where a word points to all its concept/value pairs. tf-idf values below a certain threshold (in this example, say x < 0.1) are filtered out:

WORD CONCEPT -> VALUE VECTOR
bike { ( 4234234 -> 0.523 ), ( 6846534 -> 0.123 ), … }
track { ( 6846534 -> 0.263 ), … }

 

ESA is now ready to compare documents. To that end, pick two documents A and B from a corpus of documents you want to compare. Create a tf-idf representation of each document. For example:

WORD DOCUMENT ID TF-IDF VALUE
bike A 0.021
bike B 0.321
track B 0.044

 

The last step is to represent documents A and B through the knowledge representation. This is done by creating an interpretation vector for each document. If a word w occurs in document A and also in the inverted index, then all concepts associated with w are added to the vector. The value of each vector element is the product of the document’s tf-idf value for w and the inverted index’ value.

 CONCEPT ID

4234234

6846534

INVERTED_INDEX VALUE 0.523 0.123
DOCUMENT A INTERPRETATION VECTOR 0.523 * 0.21 = 0.11 0.123 * 0.021 = 0.002
DOCUMENT B INTERPRETATION VECTOR 0.523 * 0.321 = 0.168 0.123 * 0.321 + 0.263 * 0.005 = 0.041

 

This gives us two vectors, one for each document. The vector quantifies the importance of concepts from our knowledge representation for each document. In a final step we can calculate the cosine similarity between those two interpretation vectors. It expresses the semantic relatedness between two documents using a value between 0 (not related at all) and 1 (identical).

Downsizing: from documents to phrases, words, and domain names

In its original form, ESA can be employed to calculate the relatedness of documents. Our goal at Oversee.net is different in that we attempt to calculate the relatedness between keywords, phrases, and domain names. To alleviate the issue of extreme data sparseness, we are researching a number of potential remedies, such as

  • finding content rich domains that have similar names to our set of domains, and use that content to apply ESA
  • ranking keywords based on their usefulness effectiveness and grouping them, thereby creating a document-like structure
  • categorizing domains and applying ESA on a per-category basis

 

Comments

  1. While I find this article interesting, I find your methodology flawed.

Speak Your Mind

*