Detecting Spam Pages Using Statistics

Spam web pages that are machine generated tend to differ in a number of ways from most other web pages, and can possibly be identified through statistical analysis. A paper from Microsoft Research titled "Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages" (pdf) [by Dennis Fetterly, Mark Manasse, Marc Najork] looks at some ways of finding those pages. Spam web pages tend to have these characteristics:

* Host names with many characters, dots, dashes, and digits are likely to be spam web sites.

* "One piece of folklore among the SEO community is that search engines (and Google in particular), given a query q, will rank a result URL u higher if u’s host component contains q. SEOs try to exploit this by populating pages with URLs whose host components contain popular queries that are relevant to their business, and by setting up a DNS server that resolves those host names. The latter is quite easy, since DNS servers can be configured with wildcard records that will resolve any host name within a domain to the same IP address. For example, at the time of this writing, any host within the domain highriskmortgage.com resolves to the IP address 65.83.94.42."

* Linkage properties: looking at the number of links embedded on a page compared to the number of links pointing to those pages. Are they similar to what is seen on other pages on other sites?

* Content properties: A large number of automatically generated pages contain the exact same number of words, though individual words will differ from page to page.

* "Overall, the web evolves slowly, 65% of all pages will not change at all from one week to the next, and only about 0.8% of all pages will change completely." Spam pages tend to fall in the last category, because many of them are generated at each request.

Related:
Detecting near-duplicate documents

Labels

Web Search Gmail Google Docs Mobile YouTube Google Maps Google Chrome User interface Tips iGoogle Social Google Reader Traffic Making Devices cpp programming Ads Image Search Google Calendar tips dan trik Google Video Google Translate web programming Picasa Web Albums Blogger Google News Google Earth Yahoo Android Google Talk Google Plus Greasemonkey Security software download info Firefox extensions Google Toolbar Software OneBox Google Apps Google Suggest SEO Traffic tips Book Search API Acquisitions InOut Visualization Web Design Method for Getting Ultimate Traffic Webmasters Google Desktop How to Blogging Music Nostalgia orkut Google Chrome OS Google Contacts Google Notebook SQL programming Google Local Make Money Windows Live GDrive Google Gears April Fools Day Google Analytics Google Co-op visual basic Knowledge java programming Google Checkout Google Instant Google Bookmarks Google Phone Google Trends Web History mp3 download Easter Egg Google Profiles Blog Search Google Buzz Google Services Site Map for Ur Site game download games trick Google Pack Spam cerita hidup Picasa Product's Marketing Universal Search FeedBurner Google Groups Month in review Twitter Traffic AJAX Search Google Dictionary Google Sites Google Update Page Creator Game Google Finance Google Goggles Google Music file download Annoyances Froogle Google Base Google Latitude Google Voice Google Wave Google Health Google Scholar PlusBox SearchMash teknologi unik video download windows Facebook Traffic Social Media Marketing Yahoo Pipes Google Play Google Promos Google TV SketchUp WEB Domain WWW World Wide Service chord Improve Adsence Earning jurnalistik sistem operasi AdWords Traffic App Designing Tips and Tricks WEB Hosting linux How to Get Hosting Linux Kernel WEB Errors Writing Content award business communication ubuntu unik