Searching for documents and other items on the Web or computers is often tedious and time consuming. Time is money. Highly paid professionals spend hours, days, and even longer searching for information on the Web or computers. Most search today is done using key word and phrase matching, often combined with various ranking schemes for the search results. Occasionally more advanced methods such as logical queries, e.g. search for “rocket scientist” and NOT “space”, and regular expressions are used. All of these methods have significant limitations and often require lengthy human review and further manual searching of the search results.
The dream search engine would search by topic, by the detailed content of the items searched, ideally finding the desired information immediately. Actual understanding of text remains a unfulfilled promise of artificial intelligence. Statistical language processing can achieve a degree of searching by topic. This article introduces the basic concepts and mathematics of statistical language processing and its applications to search. It gives a brief introduction and overview of more advanced techniques in statistical language processing as applied to search. It also includes sample Ruby code illustrating some simple statistical language processing methods.
Read the rest of Faster, Better, Cheaper Search Engines on Scribd, where you’ll be able to download it in several formats including PDF, or click Fullscreen in the embedded document below.
Source code: trigram.zip