Coping with the data tsunami

Interesting article in today’s NYT about the challenges posed by the coming avalanche of experimental data.

The next generation of computer scientists has to think in terms of what could be described as Internet scale. Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos. (A petabyte is about 1,000 times as large as a terabyte, and could store about 500 billion pages of text.)

It was not long ago that the notion of one company having anything close to 40 billion photos would have seemed tough to fathom. Google, meanwhile, churns through 20 times that amount of information every single day just running data analysis jobs. In short order, DNA sequencing systems too will generate many petabytes of information a year.

The article makes the rather good point that today’s university students, for the most part, will be imprinted on the rather feeble personal computer technology that they use today, and so are not attuned to the kit that will be required to do even routine science in a few years. It cites some of the usual scare stories — e.g. from astronomy:

The largest public database of such images available today comes from the Sloan Digital Sky Survey, which has about 80 terabytes of data, according to Mr. Connolly. A new system called the Large Synoptic Survey Telescope is set to take more detailed images of larger chunks of the sky and produce about 30 terabytes of data each night. Mr. Connolly’s graduate students have been set to work trying to figure out ways of coping with this much information.