Searching for documents and other items on the Web or computers is often tedious and time consuming. Time is money. Highly paid professionals spend hours, days, and even longer searching for information on the Web or computers. Most search today is done using key word and phrase matching, often combined with various ranking schemes for the search results. Occasionally more advanced methods such as logical queries, e.g. search for “rocket scientist” and NOT “space”, and regular expressions are used. All of these methods have significant limitations and often require lengthy human review and further manual searching of the search results.
The dream search engine would search by topic, by the detailed content of the items searched, ideally finding the desired information immediately. Actual understanding of text remains a unfulfilled promise of artificial intelligence. Statistical language processing can achieve a degree of searching by topic. This article introduces the basic concepts and mathematics of statistical language processing and its applications to search. It gives a brief introduction and overview of more advanced techniques in statistical language processing as applied to search. It also includes sample Ruby code illustrating some simple statistical language processing methods.
Read the rest of Faster, Better, Cheaper Search Engines on Scribd, where you’ll be able to download it in several formats including PDF, or click Fullscreen in the embedded document below.
Source code: trigram.zip
It is easy to say these thing. In fact the modern search engine is extremely sophisticated in what it is trying to do.
Most search engines these days use LSI (Latent Semantic Indexing). This presents each web page as being a vector. Some remarkable associations between websites that have got similar vectors.
https://chris.ikit.org/ksv2.pdf
is very impressive. It should be pointed out that LSI scanning is CPU intensive. OK once a page has been done it has been done.
One thing I would like Google to do is to use vectors when searching from within a document you are writing. It does not appear to do this.
Why do people put their PDFs on websites where you can’t download them without Flash and without signing up?
*sigh*
Adam, I added a PDF version for you to download.
Antonio: Excellent, thank you very much!