A Simplified Version of Google's Spell Checker

Peter Norvig (from Google) explains in a detailed article how to write in 20 lines of Python code a spell checker almost as good as the one used by Google to show the famous "did you mean" corrections. Well, at least for one-word corrections.
We will read a text file, holmes.txt (that I happened to have on my laptop) which is a collection of Sherlock Holmes stories (from Project Gutenberg) consisting of about 100,000 words. We then extract the individual words from the file (using the function words, which converts everything to lowercase, so that "the" and "The" will be the same). Next we train a probability model, which is a fancy way of saying we count how many times each word occurs. (...)

Now let's look at the problem of enumerating the possible corrections c of a given word w. It is common to talk of the edit distance between two words: the number of edits it would take to turn one into the other. An edit can be a deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). (...)

The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target.

A simple way to define the error model was to say that "all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0". From all the candidates for the correction, you can choose the most frequent word.

In Peter Norvig's tests, this simple algorithm returned correct answers in more than 80% of the cases. Of course that Google has more data than the holmes.txt file (it crawls the web, right?) and has access to a huge list of queries and refinements that could improve the algorithm, but this is an example of a simple yet powerful program.

Labels

Web Search Gmail Google Docs Mobile YouTube Google Maps Google Chrome User interface Tips iGoogle Social Google Reader Traffic Making Devices cpp programming Ads Image Search Google Calendar tips dan trik Google Video Google Translate web programming Picasa Web Albums Blogger Google News Google Earth Yahoo Android Google Talk Google Plus Greasemonkey Security software download info Firefox extensions Google Toolbar Software OneBox Google Apps Google Suggest SEO Traffic tips Book Search API Acquisitions InOut Visualization Web Design Method for Getting Ultimate Traffic Webmasters Google Desktop How to Blogging Music Nostalgia orkut Google Chrome OS Google Contacts Google Notebook SQL programming Google Local Make Money Windows Live GDrive Google Gears April Fools Day Google Analytics Google Co-op visual basic Knowledge java programming Google Checkout Google Instant Google Bookmarks Google Phone Google Trends Web History mp3 download Easter Egg Google Profiles Blog Search Google Buzz Google Services Site Map for Ur Site game download games trick Google Pack Spam cerita hidup Picasa Product's Marketing Universal Search FeedBurner Google Groups Month in review Twitter Traffic AJAX Search Google Dictionary Google Sites Google Update Page Creator Game Google Finance Google Goggles Google Music file download Annoyances Froogle Google Base Google Latitude Google Voice Google Wave Google Health Google Scholar PlusBox SearchMash teknologi unik video download windows Facebook Traffic Social Media Marketing Yahoo Pipes Google Play Google Promos Google TV SketchUp WEB Domain WWW World Wide Service chord Improve Adsence Earning jurnalistik sistem operasi AdWords Traffic App Designing Tips and Tricks WEB Hosting linux How to Get Hosting Linux Kernel WEB Errors Writing Content award business communication ubuntu unik