How big is the web?

Nobody really knows, but here is an interesting post on the Official Google Blog…

We’ve known it for a long time: the web is big. The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we’ve seen a lot of big numbers about how much content is really out there. Recently, even our search engineers stopped in awe about just how big the web is these days — when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

How do we find all those pages? We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.

So how many unique pages does the web really contain? We don’t know; we don’t have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite — for example, web calendars may have a “next day” link, and we could follow that link forever, each time finding a “new” page. We’re not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what’s a useful page, and there is no exact answer…

Say ‘Cheese!’ for Google

This morning’s Observer column — about Google Street View…

In a way the issue is not whether this Google innovation is permitted or not, but the general direction we’re headed and the role Google might play in our collective future. Last week I wrote about the legal ruling which compelled Google to hand over to Viacom its computer logs of every single viewing of a YouTube video, including those by UK residents. The privacy implications of that ruling have since been mitigated by agreement that the data can be ‘anonymised’ by Google before handover. But, again, the direction is towards a world in which everything we do is monitored and logged – mostly by one company.

Google’s mission, according to its corporate website, is ‘to organise the world’s information and make it universally accessible and useful’. What we perhaps haven’t fully realised is that these guys really mean it. Their ambition is at least as megalomaniacal as Bill Gates’s vision of a computer on every desk running Microsoft software. So it’s time we started thinking about what a world dominated by Google would be like. As it happens, some people have – and they’ve been publishing the results on YouTube. Have a look — and then pour yourself a stiff drink.

The word on the street

In his Manitoba lecture, Mike Wesch mentioned a survey which suggested that 88% of the material on YouTube was original, not the copyrighted stuff the mainstream media (and Viacom) obsesses about. Here’s a great example of creative use of the platform. It’s the second of a series of four short movies about the creepier implications of Google Street View.

Thanks to Tony Hirst for spotting it.

Viacom ‘backs off’?

Well, maybe

Viacom has “backed off” from demands to divulge the viewing habits of every user who has ever watched a video on YouTube, the website has claimed.

Google had been ordered to provide personal details of millions of YouTube users to help Viacom prepare its case on alleged copyright infringement…

En passant, I think I heard Mike Wesch say in his Manitoba lecture that a suvery he and his students did found that 88% of the stuff on YouTube is original material — i.e. not copyright-infringing.

Who’s watching what?

This morning’s Observer column

On 2 July, a US district judge, Louis L Stanton, lobbed a grenade into the cosy world of social networking, user-generated content and so-called ‘cloud’ computing. He ordered Google to turn over to Viacom all of its logs relating to viewing of YouTube video clips since the search engine giant acquired the video hosting site in November 2006.

That amounts to 12 terabytes (or more than 12 million megabytes) of data: each log entry records the user name and IP (machine) address of the user who viewed the video, plus a timestamp and a code identifying the clip. What the judgment means is that if you have watched a YouTube clip at any time since November 2006, a record of that will be passed to Viacom’s lawyers…

UPDATE: This from CNET:

Viacom wants to know which videos YouTube employees have watched and uploaded to the site, and Google is refusing to provide that information, CNET News has learned.

This dispute is the reason the two companies, and lawyers representing a group of other copyright holders suing Google, have failed to reach a final agreement on anonymizing personal information belonging to YouTube users, according to two sources close to the situation.

Reality dawns in the Googleplex

From this morning’s NYTimes

Two months ago, Google held a series of secret focus groups with employees who have children in Google’s day care facilities. The purpose was to gauge their reaction to the company’s plan to raise the amount it charged for in-house day care by 75 percent.

Parents who had been paying $1,425 a month for infant care would see their costs rise to nearly $2,500 — well above the market rate. For parents with toddlers and preschoolers, who were charged less, the price increases were equally eye-popping. Under the new plan, parents with two kids in Google day care would most likely see their annual day care bill grow to more than $57,000 from around $33,000.

At the first of the three focus groups, parents wept openly. As word leaked out about the company’s plan, the Google parents began to fight back. They came up with ideas to save money, used the company’s T.G.I.F. sessions — a weekly meeting for anyone who wanted to ask questions of Google’s top executives — to plead their case, and conducted surveys showing that most parents with children in Google day care would have to leave Google’s facilities and find less expensive child care.

Now we know how this story ends, don’t we? Google famously doesn’t do evil. But guess what?

Although Google is rolling back its price increase slightly and is phasing in the higher price over five quarters, the outline of the original decision remains largely unchanged. At a T.G.I.F. in June, the Google co-founder Sergey Brin said he had no sympathy for the parents, and that he was tired of “Googlers” who felt entitled to perks like “bottled water and M&Ms,” according to several people in the meeting. (A Google spokesman denies that Mr. Brin made that comment.) On Monday, Google began the first phase of its new day care plan, letting go of the outside day care firm it had been using.

Another straw in the wind. Google may be extraordinary in some ways, but basically it’s a public company, not a campaigning, do-gooding, non-profit. That’s why it caved in to the Chinese over censoring search results. That’s why it’s handing over all those YouTube access details to Viacom. And that’s why it’s beginning to pare back employee perks.

Now Viacom knows where you are

This is truly — as Marc Rotenberg, executive director of the Electronic Privacy Information Center put it — one of those “I told you so” moments.

For every video on YouTube, the judge required Google to turn over to Viacom the login name of every user who had watched it, and the address of their computer, known as an I.P. or Internet protocol address.

Both companies have argued that I.P. addresses alone cannot be used to unmask the identities of individuals with certainty. But in many cases, technology experts and others have been able to link I.P. addresses to individuals using other records of their online activities.

The amount of data covered by the order is staggering, as it includes every video watched on YouTube since its founding in 2005. In April alone, 82 million people in the United States watched 4.1 billion clips there, according to comScore. Some experts say virtually every Internet user has visited YouTube.

Of course Viacom swears blind that the only people who will have access to this information are its lawyers (who are working on its $1 billion copyright infringement suit against Google). But it brings one up sharply against the implications of cloud computing.

What Google does right — and wrong

Here’s an interesting phenomenon — a guy who has left Google to work for Microsoft. In his blog he explains why. First the good news:

There are many things that Google does really well, and I plan to advocate that some of these things be adopted at Microsoft.

Among them is the peer-based review model where one’s performance is determined largely based on peer comments, and much less so based on the observations of the manager. The idea that a manager is far easier to fool than the co-workers are is sound and largely works. A very important side-effect that this model produces is an increased amount of cooperation between the people, and generally better relationships within the team.

The wide employee participation in corporate governance through a concept called “Intergrouplets” is a good one and merits emulation. Unlike most other companies where internal life is regulated largely by management, a lot of aspects of Google are ruled by committees of employees who are passionate about an issue, and are willing to allocate some of their time to have this issue resolved. Many things, such as quality of code base, testing practices, internal engineering documentation, and even food service are decided by intergrouplets. Of course, this is where 20% time (a practice where any Googler can spend one day a week working on whatever he or she wants) plugs in well, for without available time there would have been nothing to allocate.

Doing many things by committee. Hiring, resource allocations at Google are done by consensus of many players. If you are to achieve anything at Google, you must learn how to build this consensus, or at least how to not obstruct it. This skill comes in very handy for every other aspect of work.

Free food. More than just a benefit, it is a tool for increasing communications within the team, because it’s so much easier to have team lunches. I don’t think making Redmond cafeterias suddenly free would work (maybe I am wrong), but giving out free lunch coupons for teams of more than 3 people from more than one discipline to have lunch together – and at the same time have an opportunity to communicate – I think, has a fair chance of success.

There are other things that I would want at Microsoft, but which will probably not happen simply because there is far too much legacy. I will miss the things like one code base with uniform style guides and coding standards – there’s too much existing code at Microsoft to try and turn this ship around.

So why did he leave?

Several reasons. Firstly it seems that he prefers writing software for users who are willing to pay real money for it.

Secondly, he doesn’t like the way Google approaches software engineering. Its orientation towards cool, but not necessarily useful or essential software, he writes,

really affects the way the software engineering is done. Everything is pretty much run by the engineering – PMs and testers are conspicuously absent from the process. While they do exist in theory, there are too few of them to matter.

[…]

On the other hand, I was using Google software – a lot of it – in the last year, and slick as it is, there’s just too much of it that is regularly broken. It seems like every week 10% of all the features are broken in one or the other browser. And it’s a different 10% every week – the old bugs are getting fixed, the new ones introduced. This across Blogger, Gmail, Google Docs, Maps, and more.

This is probably fine for free software, but I always laugh when people tell me that Google Docs is viable competition to Microsoft Office. If it is, that is only true for the occasional users who would not buy Office anyway. Google as an organization is not geared – culturally – to delivering enterprise class reliability to its user applications.

As I say, it’s an interesting perspective. And he’s probably done himself no harm with his new bosses at Redmond by writing about it. Or is that too cynical a view?