Google’s original algorithm doesn’t work like it once did because they are lost in controversy over their webspam
I think we all understand how web spam works to a degree. Purveyors want to sell some ads so they put up content that the search engines will like. Perhaps they have a network of sites that do this and they put links between the sites to each other’s content to help the chances and game the search engines.
But the original beauty of Google’s search engine, which set it apart and differentiated it from the crowded field in the 90s, was the PageRank algorithm. The idea, as far as most people understand, was that web pages are ranked by the quality and quantity of links that are sent to them. PageRank, drawn, is where the colorful Google GOOG balls/logo come from (seen pictured at right).
Google describes PageRank:
PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. We have always taken a pragmatic approach to help improve search quality and create useful products, and our technology uses the collective intelligence of the web to determine a page’s importance.
So how does all of this spam get through? No one links to spam…
Google knows, largely, what sites are reputable and have been around for years and which have just been built to house spam. To another degree, Google knows that spam servers are largely not sharing an IP address with reputable companies and so on and so forth.
But even if spammers and scrapers can get by those types of filters, PageRank should kick in and rate real, original content over the stuff that copies it.
People aren’t linking from their blogs and Twitter to web spam (or Facebook, but Google doesn’t index Facebook). News organizations aren’t linking to web spam. Corporate websites don’t link to spam. Only spam links to spam.
Therefore the PageRank algorithm should be able to kill spam before it even starts. How does a page that just scrapes popular articles rise above the actual articles that it scrapes? Certainly people are tweeting about the popular page and posting links to it on their blog. No one is posting a link to the spam scraper.
But we often see Adsense-laden scraper/copier sites ranked way above the original content.
This is exactly what the original PageRank was built to avoid.