Semantic Search
Search engines have become a vital and lucrative part of the Web and they are
now commonplace tools for virtually every user of the Internet. And companies,
such as Google and Yahoo! have become household names.
As Semantic technologies become more widely embraced it is reasonable to ask for
better search capabilities which can truly respond to detailed requests reducing
the amount of irrelevant results. A semantic search engine seeks to find
documents that have similar ‘concepts’ not just similar ‘words’. But in order
for a semantic search to be effective the semantic network must contain a great
deal of relevant information. At the same time, a large network requires the
processing of many paths to find a solution. As a result, most semantic-based
search engines suffer performance problems from the scale of a large semantic
network.
Semantic search methods augment
and improve traditional search results by using not just words, but meaningful concepts.
Several major companies are seriously addressing the issue of semantic search. There
are two approaches to improving search results through semantic methods: (1) Latent
Semantic Indexing (LSI) and (2) Semantic Web documents. Latent Semantic Index Search
So far we have reviewed search technology in general, and identified today’s search
limitations. Now we will explore future technologies based upon the semantics.
First, we will discuss implementing
Latent Semantic Indexing which may improve today’s search capabilities without the
extreme limitations of search
large semantic networks. Building on the criteria
of precision, ranking and recall requires more than brute force. Assigning descriptors
and classifiers to a text provides an important advantage, by returning relevant
documents that don't necessarily contain a verbatim match to our search query. Fully
described data sets can also provide an image of the scope and distribution of the
document collection as a whole. This can be accomplished by examining the structure
of categories and sub-categories called taxonomy.
A serious drawback to this approach
to categorizing data is the problem inherent in any kind of taxonomy ─ the world
sometimes resists categorization. For example is a tomato a fruit or a vegetable?
And what happens when we combine two document collections indexed in different ways?
Solutions are called ontology taxonomies. Regular keyword searches approach a document
collection where either a document contains a given word or it doesn't.
Latent semantic indexing (LSI) adds
an important step to the document indexing process. In addition to recording which
keywords a document contains, the method examines the document collection as a whole,
to see which other documents contain some of those same words. LSI was first developed
at Bellcore in the late 1980's. LSI considers documents that have many words in
common to be semantically close, and ones with few words in common to be semantically
distant. Although the LSI algorithm doesn't understand anything
about what the words
mean, it notices the patterns. When you search an LSI-indexed database, the search
engine looks at similarity values it has calculated for every content word, and
returns the documents that it thinks best fit the query. Because two documents may
be semantically very close even if they do not share a particular keyword, LSI does
not require an exact match to return useful results. Where a plain keyword search
will fail if there is no exact match, LSI will often return relevant documents that
don't contain the keyword at all.
Searching for Content
A semantic search engine is a remarkably
useful solution. It can discover if two documents are similar even if they do not
have any specific words in common and it can reject documents that share only uninteresting
words in common. Latent semantic indexing looks at patterns of words within a set
of documents. Natural language is full of redundancies, and not every word that
appears in a document carries semantic meaning. Frequently used words in English
often don't carry content: examples include functional words, conjunctions, prepositions,
and auxiliary verbs.
The first step in doing LSI, therefore,
is culling these extraneous words from a document. This is called stemming. Stemming
Some of the preparatory work needed to get documents ready for indexing, such as
stemming, is very language-specific. For English documents, we use an algorithm
called the Porter stemmer to remove common endings from words, leaving behind an
invariant root form.
To obtain semantic content from
a document we first make a complete list of all the words that appear in the collection
and then stem them as follows:
- Discard
articles, prepositions, and conjunctions
- Discard
common verbs (know, see, do, be)
- Discard
pronouns
- Discard
common adjectives (big, late, high)
- Discard
frilly words (therefore, thus, however, albeit, etc.)
- Discard
any words that appear in every document
- Discard
any words that appear in only one document
After stemming is complete we modify
the results according to algorithmic insights. The first of these insights
applies to individual documents, and we refer to it as local weighting. Words that appear
multiple times in a document are given a greater local weight than words that appear
once. In broad strokes we present an algorithm that forms a web of documents and
words ─ connecting all documents to all words. Given such a model of words and documents
one can then establish values based on the distance of documents from each other.
The 'value' of any document to any other document might be designated as a function
of the number of connections that must be traversed to establish a connection between documents.
If two documents are connected by multiple routes then those documents
might have a high degree of correlation.
Term weighting is a formalization
of two common-sense insights: content words that appear several times in a document
are probably more meaningful than content words that appear just once; and infrequently
used words are likely to be more interesting than common words. The implementation
algorithm is:
For each document:
- Stem
all of the words and throw away any common 'noise' words.
- For
each of the remaining words, visit and remember each document that has a direct
relationship to this word. Score each document based on a distance function from
the original document and the relative scarcity of the word in common.
- For
each of the as-of-yet-unvisited new related documents now being tracked
Recursively perform the same operation
as above. The particular weighting algorithm that was used is this:
- For
each increase in distance, divide a baseline score by two.
- The
score of each document is equal to the baseline divided by the square root of the
popularity of the word.
Overall this algorithm delivers
a cheap semantic lookup based on walking through a document and creating a word
graph. The specification shown here is the simplest case and it could be improved
in a variety of ways. There are many other scoring algorithms that could be used.
Additionally a thesaurus could be applied to help bridge semantic issues.
One interesting challenge would
be to make the algorithm work 'on the fly' so that as new documents were added they
would self-score. Another challenge would be to find a way to distribute the algorithm
over multiple machines for scalability. The word stemming process gets rid of common
words such as the, and etc. This feeds input into the semantic algorithm which first
stems the words appropriately, scores them according to the semantic algorithm and
sorts the results into the new rank order reflecting the semantic analysis. Semantic
Search Engine Application We have developed a Semantic Search Engine as a Windows
application, that uses Google APIs for an initial search which serves as input into
our LSI stemming and ranking methods.
REFERENCES
Thinking on the Web: Berners-Lee, Turing and Godel
Developing Semantic Web Services
Connections: Patterns of Discovery