Spam web pages that are machine generated tend to differ in a number of ways from most other web pages, and can possibly be identified through statistical analysis. A paper from Microsoft Research titled "Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages" (pdf) [by Dennis Fetterly, Mark Manasse, Marc Najork] looks at some ways of finding those pages. Spam web pages tend to have these characteristics:
* Host names with many characters, dots, dashes, and digits are likely to be spam web sites.
* "One piece of folklore among the SEO community is that search engines (and Google in particular), given a query q, will rank a result URL u higher if u’s host component contains q. SEOs try to exploit this by populating pages with URLs whose host components contain popular queries that are relevant to their business, and by setting up a DNS server that resolves those host names. The latter is quite easy, since DNS servers can be configured with wildcard records that will resolve any host name within a domain to the same IP address. For example, at the time of this writing, any host within the domain highriskmortgage.com resolves to the IP address 65.83.94.42."
* Linkage properties: looking at the number of links embedded on a page compared to the number of links pointing to those pages. Are they similar to what is seen on other pages on other sites?
* Content properties: A large number of automatically generated pages contain the exact same number of words, though individual words will differ from page to page.
* "Overall, the web evolves slowly, 65% of all pages will not change at all from one week to the next, and only about 0.8% of all pages will change completely." Spam pages tend to fall in the last category, because many of them are generated at each request.
Related:
Detecting near-duplicate documents
Labels
Web Search
Gmail
Google Docs
Mobile
YouTube
Google Maps
Google Chrome
User interface
Tips
iGoogle
Social
Google Reader
Traffic Making Devices
cpp programming
Ads
Image Search
Google Calendar
tips dan trik
Google Video
Google Translate
web programming
Picasa Web Albums
Blogger
Google News
Google Earth
Yahoo
Android
Google Talk
Google Plus
Greasemonkey
Security
software download
info
Firefox extensions
Google Toolbar
Software
OneBox
Google Apps
Google Suggest
SEO Traffic tips
Book Search
API
Acquisitions
InOut
Visualization
Web Design Method for Getting Ultimate Traffic
Webmasters
Google Desktop
How to Blogging
Music
Nostalgia
orkut
Google Chrome OS
Google Contacts
Google Notebook
SQL programming
Google Local
Make Money
Windows Live
GDrive
Google Gears
April Fools Day
Google Analytics
Google Co-op
visual basic
Knowledge
java programming
Google Checkout
Google Instant
Google Bookmarks
Google Phone
Google Trends
Web History
mp3 download
Easter Egg
Google Profiles
Blog Search
Google Buzz
Google Services
Site Map for Ur Site
game download
games trick
Google Pack
Spam
cerita hidup
Picasa
Product's Marketing
Universal Search
FeedBurner
Google Groups
Month in review
Twitter Traffic
AJAX Search
Google Dictionary
Google Sites
Google Update
Page Creator
Game
Google Finance
Google Goggles
Google Music
file download
Annoyances
Froogle
Google Base
Google Latitude
Google Voice
Google Wave
Google Health
Google Scholar
PlusBox
SearchMash
teknologi unik
video download
windows
Facebook Traffic
Social Media Marketing
Yahoo Pipes
Google Play
Google Promos
Google TV
SketchUp
WEB Domain
WWW World Wide Service
chord
Improve Adsence Earning
jurnalistik
sistem operasi
AdWords Traffic
App Designing
Tips and Tricks
WEB Hosting
linux
How to Get Hosting
Linux Kernel
WEB Errors
Writing Content
award
business communication
ubuntu
unik