Using Google's N-Gram Corpus

Two years ago, Google released a collection of n-grams from web pages and made it available on Linguistic Data Consortium's website. "We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times." Here are some examples of 3-grams, followed by their frequencies:

ceramics collectables collectibles 55
ceramics collectables fine 130
ceramics collected by 52

While this huge corpora is useful to build linguistic models, there are other ways to use it. Chris Harrison created some visualizations for bigrams and trigrams that start with pronouns. "These visual comparisons allow us to see differences in how the two subjects are used - both where they are similar and diverge. For example, among the top 120 trigrams, 'He' and 'She' have many common second words. However, they differ on some interesting ones, for example, only 'he' connects to 'argues', while only 'she' connects to 'love'."


Chris DiBona from Google works on IsolWrite, a word processing program that will include a text prediction option. "I gotta get my greasy hands on an open version of our published n-gram data (which is ranked) and incorporate that, if it makes sense."

{ via information aesthetics }

Labels

Web Search Gmail Google Docs Mobile YouTube Google Maps Google Chrome User interface Tips iGoogle Social Google Reader Traffic Making Devices cpp programming Ads Image Search Google Calendar tips dan trik Google Video Google Translate web programming Picasa Web Albums Blogger Google News Google Earth Yahoo Android Google Talk Google Plus Greasemonkey Security software download info Firefox extensions Google Toolbar Software OneBox Google Apps Google Suggest SEO Traffic tips Book Search API Acquisitions InOut Visualization Web Design Method for Getting Ultimate Traffic Webmasters Google Desktop How to Blogging Music Nostalgia orkut Google Chrome OS Google Contacts Google Notebook SQL programming Google Local Make Money Windows Live GDrive Google Gears April Fools Day Google Analytics Google Co-op visual basic Knowledge java programming Google Checkout Google Instant Google Bookmarks Google Phone Google Trends Web History mp3 download Easter Egg Google Profiles Blog Search Google Buzz Google Services Site Map for Ur Site game download games trick Google Pack Spam cerita hidup Picasa Product's Marketing Universal Search FeedBurner Google Groups Month in review Twitter Traffic AJAX Search Google Dictionary Google Sites Google Update Page Creator Game Google Finance Google Goggles Google Music file download Annoyances Froogle Google Base Google Latitude Google Voice Google Wave Google Health Google Scholar PlusBox SearchMash teknologi unik video download windows Facebook Traffic Social Media Marketing Yahoo Pipes Google Play Google Promos Google TV SketchUp WEB Domain WWW World Wide Service chord Improve Adsence Earning jurnalistik sistem operasi AdWords Traffic App Designing Tips and Tricks WEB Hosting linux How to Get Hosting Linux Kernel WEB Errors Writing Content award business communication ubuntu unik