Google's quest for the perfect links

TED partner Google has allowed for the first time a journalist (Saul Hansell from the NYT) to spend a day with engineer Amit Shingal and his "search-quality team" — the people responsible for the very secret mathematical formulas that decide which web pages best answer each user’s query. It’s a delicate act, a mix of science and artistry: half a dozen major or minor changes are introduced in Google’s search engine every week, and each change can affect the ranking of many sites — although most are barely noticed by the average user. Hansell’s story is a rare glimpse behind the world’s largest search engine, which indexes billions of webpages in over a hundred languages and handles hundreds of millions of queries a day. It’s a long article (3200 words) but since "it’s becoming impossible not to visit with Google daily", as Swiss technophilosopher René Berger once said, it’s worth knowing a thing of two about the way your host runs his house. Excerpts:

Google’s servers basically make a copy of the entire Web, page by page, every few days, storing it in their huge data centers:

As Google compiles its index, it calculates a number it calls PageRank for each page it finds.
[ BG: the picture at right shows the original PageRank algorithm, from a powerpoint presentation Larry Page gave at Stanford in 1998] This was the key invention of Google’s founders, Larry Page and Sergey Brin. PageRank tallies how many times other sites link to a given page. Sites that are more popular, especially with sites that have high PageRanks themselves, are considered likely to be of higher quality.

Mr. Singhal has developed a far more elaborate system for ranking pages, which involves more than 200 types of information, or what Google calls “signals.” PageRank is but one signal. Some signals are on Web pages — like words, links, images and so on. Some are drawn from the history of how pages have changed over time. Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years. (…)

Once Google corrals its myriad signals, it feeds them into formulas it calls classifiers that try to infer useful information about the type of search, in order to send the user to the most helpful pages. Classifiers can tell, for example, whether someone is searching for a product to buy, or for information about a place, a company or a person. Google recently developed a new classifier to identify names of people who aren’t famous. Another identifies brand names.

These signals and classifiers calculate several key measures of a page’s relevance, including one it calls “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query. (…) Google combines all these measures into a final relevancy score. The sites with the 10 highest scores win the coveted spots on the first search page, unless a final check shows that there is not enough “diversity” in the results. (…) If this wasn’t excruciating enough, Google’s engineers must compensate for users who are not only fickle, but are also vague about what they want; often, they type in ambiguous phrases or misspelled words.

And they must of course also keep out the millions of fake webpage created by hucksters who try to hijack searches to lure users to their porn or scam pages. Hansell’s article also details the constant debate inside Google (and other search companies) about "freshness": is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until recently, Google had preferred the latter. But last year, when the company introduced its new stock quotation service, a search for “Google Finance” couldn’t find it, and that pointed to a broader problem that was solved by developing a new mathematical model that tries to determine when users want new information and when they don’t. The solution

revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information. The model also examines Google’s own stream of billions of search queries, which Mr. Singhal believes is an even better monitor of global enthusiasm about a particular subject. As an example, he points out what happens when cities suffer power failures. “When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds,” he says.