TEDBlog

« A TED-bagful of inspiration from Sierra Leone | Main | The string quartet Ethel plays "Blue Room," on TED.com »

19 June 2007

Google's quest for the perfect links

TED partner Google has allowed for the first time a journalist (Saul Hansell from the NYT) to spend a day with engineer Amit Shingal and his "search-quality team" -- the people responsible for the very secret mathematical formulas that decide which web pages best answer each user's query. It's a delicate act, a mix of science and artistry: half a dozen major or minor changes are introduced in Google's search engine every week, and each change can affect the ranking of many sites -- although most are barely noticed by the average user. Hansell's story is a rare glimpse behind the world's largest search engine, which indexes billions of webpages in over a hundred languages and handles hundreds of millions of queries a day. It's a long article (3200 words) but since "it's becoming impossible not to visit with Google daily", as Swiss technophilosopher René Berger once said, it's worth knowing a thing of two about the way your host runs his house. Excerpts:

Google's servers basically make a copy of the entire Web, page by page, every few days, storing it in their huge data centers:

Pagerank1998 As Google compiles its index, it calculates a number it calls PageRank for each page it finds. [ BG: the picture at right shows the original PageRank algorithm, from a powerpoint presentation Larry Page gave at Stanford in 1998] This was the key invention of Google’s founders, Larry Page and Sergey Brin. PageRank tallies how many times other sites link to a given page. Sites that are more popular, especially with sites that have high PageRanks themselves, are considered likely to be of higher quality.

Mr. Singhal has developed a far more elaborate system for ranking pages, which involves more than 200 types of information, or what Google calls “signals.” PageRank is but one signal. Some signals are on Web pages — like words, links, images and so on. Some are drawn from the history of how pages have changed over time. Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years. (...)

Once Google corrals its myriad signals, it feeds them into formulas it calls classifiers that try to infer useful information about the type of search, in order to send the user to the most helpful pages. Classifiers can tell, for example, whether someone is searching for a product to buy, or for information about a place, a company or a person. Google recently developed a new classifier to identify names of people who aren’t famous. Another identifies brand names.

These signals and classifiers calculate several key measures of a page’s relevance, including one it calls “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query. (...) Google combines all these measures into a final relevancy score. The sites with the 10 highest scores win the coveted spots on the first search page, unless a final check shows that there is not enough “diversity” in the results. (...) If this wasn’t excruciating enough, Google’s engineers must compensate for users who are not only fickle, but are also vague about what they want; often, they type in ambiguous phrases or misspelled words.

And they must of course also keep out the millions of fake webpage created by hucksters who try to hijack searches to lure users to their porn or scam pages. Hansell's article also details the constant debate inside Google (and other search companies) about "freshness": is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until recently, Google had preferred the latter. But last year, when the company introduced its new stock quotation service, a search for “Google Finance” couldn’t find it, and that pointed to a broader problem that was solved by developing a new mathematical model that tries to determine when users want new information and when they don't. The solution

revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information. The model also examines Google’s own stream of billions of search queries, which Mr. Singhal believes is an even better monitor of global enthusiasm about a particular subject. As an example, he points out what happens when cities suffer power failures. “When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds,” he says.

TrackBack

TrackBack URL for this entry:
http://blog.ted.com/cgi-bin/mte/mt-tb.cgi/3060

Discuss Blog Post


    Tools for TED.com

    Find transcripts >>
    Download the TED Miro player >>
    Subscribe to the TED Blog's RSS feed >>
    Join our Facebook Group >>

    Tips? Comments? contact (at) ted (dot) com


    Get involved: TED Prize wishes

    Once Upon a School

    Meet the Greens

    Next Einstein

    InSTEDD

    Open Architecture Network

    Encyclopedia of Life

    Pangea Day

    TED Bloggers

    Chris Anderson | Curator
    June Cohen | Director of TED Media
    Amy Novogratz | TED Prize Director
    Tom Rielly | Humorist
    Bruno Giussani | TED European Director
    Jason Wishnow | Director, Film + Video
    Emily McManus | Editor, TED.com
    Matthew Trost | Editorial Assistant, TED.com

    Blogs we watch

    >> TEDPrize.org | Updates on the 2008 TED Prize winners and wishes: Dave Eggers' wish blog; Karen Armstrong's wish blog; Neil Turok's wish blog

    >> Thomas Dolby | TED Musical Director, blogging at ThomasDolby.com
    >> Bruno Giussani | TED European Director, blogging at LunchOverIP.com
    >> Emeka Okafor | TEDAfrica Director, blogging at Timbuktu Chronicles and Africa Unchained

    by topic

    Archives

    This work is licensed under a Creative Commons license.

    Powered by Movable Type

    What we blog about