A systems view of digital preservation

The longer I’ve been around, the more concerned I become about long-term data loss — in the archival sense. What are the chances that the digital record of our current period will still be accessible in 300 years’ time? The honest answer is that we don’t know. And my guess is that it definitely won’t be available unless we take pretty rigorous steps to ensure it. Otherwise it’s posterity be damned.

It’s a big mistake to think about this as a technical problem — to regard it as a matter of bit-rot, digital media and formats. If anything, the technical aspects are the trivial aspects of the problem. The really hard questions are institutional: how can we ensure that there are organisations in place in 300 years that will be capable of taking responsibility for keeping the archive intact, safe and accessible?

Aaron Schwartz has written a really thoughtful blog post about this in which he addresses both the technical and institutional aspects. About the latter, he has this to say:

Recall that we have at least three sites in three political jurisdictions. Each site should be operated by an independent organization in that political jurisdiction. Each board should be governed by respected community members with an interest in preservation. Each board should have at least five seats and move quickly to fill any vacancies. An engineer would supervise the systems, an executive director would supervise the engineer, the board would supervise the executive director, and the public would supervise the board.

There are some basic fixed costs for operating such a system. One should calculate the high-end estimate for such costs along with high-end estimates of their growth rate and low-end estimates of the riskless interest rate and set up an endowment in that amount. The endowment would be distributed evenly to each board who would invest it in riskless securities (probably in banks whose deposits are ensured by their political systems).

Whenever someone wants to add something to the collection, you use the same procedure to figure out what to charge them, calculating the high-end cost of maintaining that much more data, and add that fee to the endowments (split evenly as before).

What would the rough cost of such a system be? Perhaps the board and other basic administrative functions would cost $100,000 a year, and the same for an executive director and an engineer. That would be $300,000 a year. Assuming a riskless real interest rate of 1%, a perpetuity for that amount would cost $30 million. Thus the cost for three such institutions would be around $100 million. Expensive, but not unmanageable. (For comparison, the Internet Archive has an annual budget of $10-15M, so this whole project could be funded until the end of time for about what 6-10 years of the Archive costs.)

Storage costs are trickier because the cost of storage and so on falls so rapidly, but a very conservative estimate would be around $2000 a gigabyte. Again, expensive but not unmanageable. For the price of a laptop, you could have a gigabyte of data preserved for perpetuity.

These are both very high-end estimates. I imagine that were someone to try operating such a system it would quickly become apparent that it could be done for much less. Indeed, I suspect a Mad Archivist could set up such a system using only hobbyist levels of money. You can recruit board members in your free time, setting up the paperwork would be a little annoying but not too expensive, and to get started you’d just need three servers. (I’ll volunteer to write the Python code.) You could then build up the endowment through the interest money left over after your lower-than-expected annual costs. (If annual interest payments ever got truly excessive, the money could go to reducing the accession costs for new material.)

Any Mad Archivists around?

Worth reading in full.

LATER: Dan Gillmor has been attending a symposium at the Library of Congress about preserving user-generated content, and has written a thoughtful piece on Salon.com about it.

The reason for libraries and archives like the Library of Congress is simple: We need a record of who we are and what we’ve said in the public sphere. We build on what we’ve learned; without understanding the past we can’t help but screw up our future.

It was easier for these archiving institutions when media consisted of a relatively small number of publications and, more recently, broadcasts. They’ve always had to make choices, but the volume of digital material is now so enormous, and expanding at a staggering rate, that it won’t be feasible, if it ever really was, for institutions like this to find, much less, collect all the relevant data.

Meanwhile, those of us creating our own media are wondering what will happen to it. We already know we can’t fully rely on technology companies to preserve our data when we create it on their sites. Just keeping backups of what we create can be difficult enough. Ensuring that it’ll remain in the public sphere — assuming we want it to remain there — is practically impossible.

Dan links to another thoughtful piece, this time by Dave Winer. Like Aaron Schwartz, Dave is concerned not just with the technological aspects of the problem, but also with the institutional side. Here are his bullet-points:

1. I want my content to be just like most of the rest of the content on the net. That way any tools create to preserve other people’s stuff will apply to mine.

2. We need long-lived organizations to take part in a system we create to allow people to future-safe their content. Examples include major universities, the US government, insurance companies. The last place we should turn is the tech industry, where entities are decidedly not long-lived. This is probably not a domain for entrepreneurship.

3. If you can afford to pay to future-safe your content, you should. An endowment is the result, which generates annuities, that keeps the archive running.

4. Rather than converting content, it would be better if it was initially created in future-safe form. That way the professor’s archive would already be preserved, from the moment he or she presses Save.

5. The format must be factored for simplicity. Our descendents are going to have to understand it. Let’s not embarass ourselves, or cause them to give up.

6. The format should probably be static HTML.

7. ??