Semantic Search

Search engines have become a vital and lucrative part of the Web and they are now commonplace tools for virtually every user of the Internet. And companies, such as Google and Yahoo! have become household names.

As Semantic technologies become more widely embraced it is reasonable to ask for better search capabilities which can truly respond to detailed requests reducing the amount of irrelevant results. A semantic search engine seeks to find documents that have similar ‘concepts’ not just similar ‘words’. But in order for a semantic search to be effective the semantic network must contain a great deal of relevant information. At the same time, a large network requires the processing of many paths to find a solution. As a result, most semantic-based search engines suffer performance problems from the scale of a large semantic network.

Semantic search methods augment and improve traditional search results by using not just words, but meaningful concepts. Several major companies are seriously addressing the issue of semantic search. There are two approaches to improving search results through semantic methods: (1) Latent Semantic Indexing (LSI) and (2) Semantic Web documents. Latent Semantic Index Search So far we have reviewed search technology in general, and identified today’s search limitations. Now we will explore future technologies based upon the semantics.

First, we will discuss implementing Latent Semantic Indexing which may improve today’s search capabilities without the extreme limitations of search large semantic networks. Building on the criteria of precision, ranking and recall requires more than brute force. Assigning descriptors and classifiers to a text provides an important advantage, by returning relevant documents that don't necessarily contain a verbatim match to our search query. Fully described data sets can also provide an image of the scope and distribution of the document collection as a whole. This can be accomplished by examining the structure of categories and sub-categories called taxonomy.

A serious drawback to this approach to categorizing data is the problem inherent in any kind of taxonomy ─ the world sometimes resists categorization. For example is a tomato a fruit or a vegetable? And what happens when we combine two document collections indexed in different ways? Solutions are called ontology taxonomies. Regular keyword searches approach a document collection where either a document contains a given word or it doesn't.

Latent semantic indexing (LSI) adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI was first developed at Bellcore in the late 1980's. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. Although the LSI algorithm doesn't understand anything about what the words mean, it notices the patterns. When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don't contain the keyword at all.

Searching for Content

A semantic search engine is a remarkably useful solution. It can discover if two documents are similar even if they do not have any specific words in common and it can reject documents that share only uninteresting words in common. Latent semantic indexing looks at patterns of words within a set of documents. Natural language is full of redundancies, and not every word that appears in a document carries semantic meaning. Frequently used words in English often don't carry content: examples include functional words, conjunctions, prepositions, and auxiliary verbs.

The first step in doing LSI, therefore, is culling these extraneous words from a document. This is called stemming. Stemming Some of the preparatory work needed to get documents ready for indexing, such as stemming, is very language-specific. For English documents, we use an algorithm called the Porter stemmer to remove common endings from words, leaving behind an invariant root form.

To obtain semantic content from a document we first make a complete list of all the words that appear in the collection and then stem them as follows:

  • Discard articles, prepositions, and conjunctions
  • Discard common verbs (know, see, do, be)
  • Discard pronouns
  • Discard common adjectives (big, late, high)
  • Discard frilly words (therefore, thus, however, albeit, etc.)
  • Discard any words that appear in every document
  • Discard any words that appear in only one document
After stemming is complete we modify the results according to algorithmic insights. The first of these insights applies to individual documents, and we refer to it as local weighting. Words that appear multiple times in a document are given a greater local weight than words that appear once. In broad strokes we present an algorithm that forms a web of documents and words ─ connecting all documents to all words. Given such a model of words and documents one can then establish values based on the distance of documents from each other. The 'value' of any document to any other document might be designated as a function of the number of connections that must be traversed to establish a connection between documents. If two documents are connected by multiple routes then those documents might have a high degree of correlation.

Term weighting is a formalization of two common-sense insights: content words that appear several times in a document are probably more meaningful than content words that appear just once; and infrequently used words are likely to be more interesting than common words. The implementation algorithm is:

For each document:
  1. Stem all of the words and throw away any common 'noise' words.
  2. For each of the remaining words, visit and remember each document that has a direct relationship to this word. Score each document based on a distance function from the original document and the relative scarcity of the word in common.
  3. For each of the as-of-yet-unvisited new related documents now being tracked
Recursively perform the same operation as above. The particular weighting algorithm that was used is this:
  1. For each increase in distance, divide a baseline score by two.
  2. The score of each document is equal to the baseline divided by the square root of the popularity of the word.
Overall this algorithm delivers a cheap semantic lookup based on walking through a document and creating a word graph. The specification shown here is the simplest case and it could be improved in a variety of ways. There are many other scoring algorithms that could be used. Additionally a thesaurus could be applied to help bridge semantic issues.

One interesting challenge would be to make the algorithm work 'on the fly' so that as new documents were added they would self-score. Another challenge would be to find a way to distribute the algorithm over multiple machines for scalability. The word stemming process gets rid of common words such as the, and etc. This feeds input into the semantic algorithm which first stems the words appropriately, scores them according to the semantic algorithm and sorts the results into the new rank order reflecting the semantic analysis. Semantic Search Engine Application We have developed a Semantic Search Engine as a Windows application, that uses Google APIs for an initial search which serves as input into our LSI stemming and ranking methods.

Thinking on the Web: Berners-Lee, Turing and Godel

Developing Semantic Web Services

Connections: Patterns of Discovery