Google numbers — why don’t they add up?

Google numbers — why don’t they add up?

Fascinating article by Simpson Garfinkel about Google’s secretiveness. Here’s a quote:

“Farach-Colton was giving a public lecture about his two-year sabbatical working at Google. The number that he was disparaging was in the middle of his PowerPoint slide:

150 million queries/day

The next slide had a few more numbers:

1,000 queries/sec (peak) 10,000+ servers More than 4 tera-ops/sec at daily peak Index: 3 billion Web pages  4 billion total docs 4+ petabytes disk storage

A few people in the audience started to giggle: the Google figures didn’t add up.

I started running the numbers myself. Let’s see: “4 tera-ops/sec” means 4,000 billion operations per second; a top-of-the-line server can do perhaps two billion operations per second, so that translates to perhaps 2,000 servers — not 10,000. Four petabytes is 4×1015 bytes of storage; spread that over 10,000 servers and you’d have 400 gigabytes per server, which again seems wrong, since Farach-Colton had previously said that Google puts two 80-gigabyte hard drives into each server.

And then there is that issue of 150 million queries per day. If the system is handling a peak load of 1,000 queries per second, that translates to a peak rate of 86.4 million queries per day — or perhaps 40 million queries per day if you assume that the system spends only half its time at peak capacity. No matter how you crank the math, Google’s statistics are not self-consistent.

“These numbers are all crazily low,” Farach-Colton continued. “Google always reports much, much lower numbers than are true.”

Whenever somebody from Google puts together a new presentation, he explained, the PR department vets the talk and hacks down the numbers. Originally, he said, the slide with the numbers said that 1,000 queries/sec was the “minimum” rate, not the peak. “We have 10,000-plus servers. That’s plus a lot.”

Just as Google’s search engine comes back instantly and seemingly effortlessly with a response to any query that you throw it, hiding the true difficulty of the task from users, the company also wants its competitors kept in the dark about the difficulty of the problem. After all, if Google publicized how many pages it has indexed and how many computers it has in its data centers around the world, search competitors like Yahoo!, Teoma, and Mooter would know how much capital they had to raise in order to have a hope of displacing the king at the top of the hill.

Google has at times had a hard time keeping its story straight. When vice president of engineering Urs Hoelzle gave a talk about Google’s Linux clusters at the University of Washington in November of 2002, he repeated that figure of 1,000 queries per second — but he said that the measure was made at 2:00 a.m. on December 25, 2001. His point, obvious to everybody in the room, is that even by November 2002, Google was doing a lot more than 1,000 queries per second — just how many more, though, was anybody’s guess.

The facts may be seeping out. Last Thanksgiving, the New York Times reported that Google had crossed the 100,000-server mark. If true, that means Google is operating perhaps the largest grid of computers on the planet. “The simple fact that they can build and operate data centers of that size is astounding,” says Peter Christy, co-founder of the NetsEdge Research Group, a market research and strategy firm in Silicon Valley. Christy, who has worked in the industry for more than 30 years, is astounded by the scale of Google’s systems and the company’s competence in operating them. “I don’t think that there is anyone close.”

It’s this ability to build and operate incredibly dense clusters that is as much as anything else the secret of Google’s success. And the reason, explains Marissa Mayer, the company’s director of consumer Web products, has to do with the way that Google started at Stanford…”.