This morning’s Observer column.
The growth in computing power, networking and sensor technology now means that even routine scientific research requires practitioners to make sense of a torrent of data. Take, for example, what goes on in particle physics. Experiments in Cern’s Large Hadron Collider regularly produce 23 petabytes per second of data. Just to get that in context, a petabyte is a million gigabytes, which is the equivalent of 13.3 years of HDTV content. In molecular biology, a single DNA-sequencing machine can spew out 9,000 gigabytes of data annually, which a librarian friend of mine equates to 20 Libraries of Congress in a year.
In an increasing number of fields, research involves analysing these torrents of data, looking for patterns or unique events that may be significant. This kind of analysis lies way beyond the capacity of humans, so it has to be done by software, much of which has to be written by the researchers themselves. But when scientists in these fields come to publish their results, both the data and the programs on which they are based are generally hidden from view, which means that a fundamental principle of scientific research – that findings should be independently replicable – is being breached. If you can’t access the data and check the analytical software for bugs, how can you be sure that a particular result is valid?