Tutorial Proggraming Free: Evaluating Google Search Results

Evaluating Google Search Results

Google uses many testers to evaluate the quality of search results. This post gives some details about their work and speculates that Google will allow everyone to be a tester.

{ Information from this post is based on Henk van Ess' Search Bistro. }

Google has great search results because it has good algorithms, great data centers and (secret) evaluation tests. There are many people that are paid (some say they are paid with $20 per hour) just to test the relevance of the search results. Google doesn't manually adjust search results, they try to find the problem that generated the low-quality results and tweak their algorithms.

So what conditions should be met by a tester? Here's a job ad from 2005:

"You would work at your own pace, and the time and length of any particular work session would be up to you. Candidates will evaluate search results and rate their relevance. Thus, all candidates must be web-savvy and analytical, have excellent web research skills and a broad range of interests. Specific areas of expertise are highly desirable. We are looking for smart people who read voraciously and have a wide variety of interests.

Raters should have all the following qualifications:

* Native-level fluency in Dutch, Italian, Spanish, or French

* In-depth, up-to-date familiarity with the web culture of at least one predominantly Dutch, Italian, Spanish, or French-speaking country.

* Excellent web research skills and analytical abilities.

* A high-speed internet connection.

* Perfect English is not necessary; however, you must be able to read and write English well enough to use software with an English interface, understand fairly complicated instructions written in English, and make yourself understood in informal written communication.

* The job involves frequent written communication with fellow Quality Raters."

SearchBistro found more about this evaluation. Google selects a number of random queries and sends the list to a group of quality raters. The raters evaluate the results in a CommQuest Evaluation Interface.

"During random-query evaluation, each result URL for every randomly selected query is rated independently by a group of raters using the options given in a pull-down menu on the Quest interface. The rating results are subsequently analyzed. That’s where CommQuest comes in. When you – the raters – disagree with each other by a wide margin, the result URL will be presented to you again in the uniquely interactive CommQuest interface until a certain level of agreement among you is reached. CommQuest allows you to share your comments on queries and/or URLs with each other, explain the reasoning behind your initial ratings, and revise the ratings based on what you’re learning from each other."

It's hard to define relevant results, but Google evaluates results "based on relevance not to a specific person who actually posed the query, but to an imaginary rational mind behind the query. Oftentimes, a query may have more than one meaning, or interpretation. In such cases we will have to look at the hypothetical set of rational search engine users behind an ambiguous query, and deduce, or roughly estimate, the make-up of that set; for instance, we will consider the relative presence of zoology enthusiasts and car shoppers in a hypothetical representative sample of the users who could have queried [jaguar]."

Google thinks there are three types of queries:

* navigational queries, that have one result (like "BMW" or "MSN")

* informational queries, with more than one possible result (like "renaissance paintings", "what is a shark")

* transactional queries, where the user wants to make an acquisition ("download text editor", "buy blackberry")

There are also nine ratings for each result: Vital, Useful, Relevant, Off Topic, Offensive, Erroneous, Didn’t Load, Foreign Language, Unrated.

Here are the tasks that should be performed by each tester:

* Understanding the meaning of the query and its type – is it navigational, informational, transactional, or a mixture of two or three?

* If you come to the realization that the query could have been posted by different users with different intentions, crudely assigning possibilities for each interpretation and/or intent

* Researching the query coverage on the web using search engines other than Google, directories, specialized databases, and other sites, or offline resources

* Examining each result for attributes that would call for assigning an applicable special category rather than a merit-based assessment, and, in the absence of those attributes

* Determining the merit rating in light of the query coverage and considering various utility dimensions, as well as taking into account evidence of deceitful web design where appropriate.

So, as you can see, it's not an easy task to evaluate search results and this work influences a lot of what you see in Google search today.

You may be wondering what's the point of this post. Gary Price from Resource Shelf found that Google has registered some interesting domain names recently:

* indexbench.com (and .org, .info, .net) and similar domains
* Google-testing.com (and .net, .org) and similar domains

After the experience with Google Image Labeler, I think Google will try to have more quality testers, but this time for free. If there's a lot of fun in the process and the system is good enough to deal with spam and low-quality raters, the whole world could rate search results. This is just a speculation, but it wouldn't be the first time when users actively modify the order of search results (if you click directly on the third result of a search, Google will know the first two weren't relevant).

Evaluating Google Search Results

Labels