Wolfram Alpha vs Google

At last, some data. David Talbot got a login id from Wolfram and ran some comparative tests. For example:

SEARCH TERM: 10 pounds kilograms

WOLFRAM ALPHA: The site informed me that it interpreted my search term as an effort to multiply "10 pounds" by "1 kilogram" and gave me this result: 4.536 kg2 (kilograms squared) or 22.05 lb2 (pounds squared).

GOOGLE: Google gave me links to various metric conversion sites.

Tentative conclusion: the semantic web is still a long way off. The problem of search is only about 5% solved. Google accounts for 3% of that. Mr Talbot’s experiments suggest that Wolfram isn’t going to move it much beyond 6%. Still, it’s progress. And Google could do with some competition,

Google, the Collective Unconscious and PEAR

Hmmm… this Google Trends idea gets more intriguing by the minute. Rex Hughes read my comumn and pointed me at the Princeton Engineering Anomalies Research Program, which I guess was funded by You Know Who at the Pentagon. The project has closed, but here’s the blurb:

The Princeton Engineering Anomalies Research (PEAR) program, which flourished for nearly three decades under the aegis of Princeton University’s School of Engineering and Applied Science, has completed its experimental agenda of studying the interaction of human consciousness with sensitive physical devices, systems, and processes, and developing complementary theoretical models to enable better understanding of the role of consciousness in the establishment of physical reality. It has now incorporated its present and future operations into the broader venue of the International Consciousness Research Laboratories (ICRL), a 501(c)(3) organization chartered in the State of New Jersey. In this new locus and era, PEAR plans to expand its archiving, outreach, education, and entrepreneurial activities into broader technical and cultural context, maintaining its heritage of commitment to intellectual rigor and integrity, pragmatic and beneficial relevance of its techniques and insights, and sophistication of its spiritual implications.

It lives on here.

Google as a predictor

Following on from this post and today’s Observer column, I’ve had some feedback asking how one might go about using Google queries as forward indicators of economic developments. The answer is that I don’t know, but it probably hinges on finding the right search queries to map. Here are some experiments I’ve done.

This suggests that people weren’t really concerned about bank deposit guarantees until August 2008. I don’t believe this, so perhaps this is the wrong search term to be tracking.

This suggests that the peak of interest in house prices occurred in 2005 and is now in gentle decline. This might indicate that this search term is a token for general curiosity (“wonder why our house is worth at the moment?”) rather than alarm. It’s interesting to compare this with the chart for ‘negative equity’ below.

This has a regular annual cycle but is now clearly on the rise. Again, it’s not clear that it has much predictive power.

Speaks for itself, I think.

Ditto.

As I say, the trick would be to identify search terms which would give an indication of what people know or suspect about their organisational future.

There’s nothing terribly systematic about this — I was just trying to think of search terms that might reveals what people were thinking as they realise that they face an uncertain future.

Tony Hirst, who is the nearest thing to a wizard with Google data that I know, has some interesting things to say about the topic. He’s also knowledgeable about the limitations of the Google data.

The predictive power of search engine queries

This morning’s Observer column

The most interesting aspect of the Google data, however, was revealed in a chart which compared flu queries with ‘objective’ data on incidence of the disease compiled by public health authorities. The chart suggests that the search data accurately reflects incidence – but is current rather than lagged. (The official statistics take about two weeks to collate.)

This suggests other possibilities – for example in macroeconomic management. Everyone I know in business has known for months that the UK is in recession, but it’s only lately that the authorities have been in a position to confirm that – because the official data always lag the current reality. So policymakers are in the situation of someone trying to drive a car which has a blacked-out windscreen. The driver’s only view of the road is a via TV monitor showing what was happening 10 seconds ago. How long would you give the driver before he hits a wall? We need to raise our game, and maybe intelligent use of the net offers us a way of doing it.

Dr Google

This is interesting — Google Flu Trends…

We have found a close relationship between how many people search for flu-related topics and how many people actually have flu symptoms. Of course, not every person who searches for “flu” is actually sick, but a pattern emerges when all the flu-related search queries from each state and region are added together. We compared our query counts with data from a surveillance system managed by the U.S. Centers for Disease Control and Prevention (CDC) and discovered that some search queries tend to be popular exactly when flu season is happening. By counting how often we see these search queries, we can estimate how much flu is circulating in various regions of the United States.

There’s a nice animation on the site showing how official health data lags Google searches.

The NYT has a report on this today.

Excerpt:

Tests of the new Web tool from Google.org, the company’s philanthropic unit, suggest that it may be able to detect regional outbreaks of the flu a week to 10 days before they are reported by the Centers for Disease Control and Prevention.

In early February, for example, the C.D.C. reported that the flu cases had recently spiked in the mid-Atlantic states. But Google says its search data show a spike in queries about flu symptoms two weeks before that report was released. Its new service at google.org/flutrends analyzes those searches as they come in, creating graphs and maps of the country that, ideally, will show where the flu is spreading.

The C.D.C. reports are slower because they rely on data collected and compiled from thousands of health care providers, labs and other sources. Some public health experts say the Google data could help accelerate the response of doctors, hospitals and public health officials to a nasty flu season, reducing the spread of the disease and, potentially, saving lives.

“The earlier the warning, the earlier prevention and control measures can be put in place, and this could prevent cases of influenza,” said Dr. Lyn Finelli, lead for surveillance at the influenza division of the C.D.C. From 5 to 20 percent of the nation’s population contracts the flu each year, she said, leading to roughly 36,000 deaths on average.

When ignorance is bliss

This morning’s Observer column

Sometimes, ignorance is bliss. We saw two examples of this last week. The first came when a new search engine – Cuil (www.cuil.com) – was unveiled. The launch was an old-style PR operation. Some influential bloggers and mainstream reporters had been briefed in advance, and whispers were circulating in cyberspace that this would be Something Big. Cuil would be the ‘Google Killer’ everyone had been waiting for.

Evidence for this hypothesis was freely cited. The venture was the brainchild of ‘former Google employees’: nudge, nudge. At least one of them had been at Stanford, the university that nurtured the founders of both Yahoo and Google: wink, wink. It had indexed no fewer than 121 billion web pages, compared with Google’s measly 40 billion: Wow! Cuil had already received $33m in venture funding! Cue trumpets.

So many people were taken in by this that when cuil.com finally opened for business the site was swamped…

How big is the web?

Nobody really knows, but here is an interesting post on the Official Google Blog…

We’ve known it for a long time: the web is big. The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we’ve seen a lot of big numbers about how much content is really out there. Recently, even our search engineers stopped in awe about just how big the web is these days — when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

How do we find all those pages? We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.

So how many unique pages does the web really contain? We don’t know; we don’t have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite — for example, web calendars may have a “next day” link, and we could follow that link forever, each time finding a “new” page. We’re not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what’s a useful page, and there is no exact answer…

First European Privacy Seal awarded

Here’s an interesting development — a search engine that really takes privacy seriously.

The first European privacy seal was presented today to search engine ixquick.com by the European Data Protection Supervisor Peter Hustinx on the occasion of the 30th anniversary of data protection legislation in Schleswig-Holstein.

According to the citation:

Ixquick is a meta-search engine which forwards search requests of its users to several search engines, gathers and combines their results and presents the results to the requesting users. Privacy is ensured by using several data-minimization techniques: personal data like IP addresses are deleted within 48 hours, after which they are no longer needed to prevent possible abuse of the servers. The remaining (non-personal) data are deleted within 14 days. Ixquick serves as a proxy, i.e. IP addresses of users are not disclosed to other search engines.

Hmmm… Bet that won’t appeal to the British Home Office.

Thanks to Gerard for the link.