Linux, politics, and other interesting things
I have been asked for advice about long-term storage of documents. I decided to blog about it because my thoughts may be useful to others, and because if I get something wrong then surely people will correct me. ;)
Many organizations are looking at using computers for storing all documents. This gives savings on the costs of storing paper – the promise of the paperless office is being fulfilled. There are some potential issues about whether a signature in a PDF file that’s scanned from paper is valid – but I guess it’s the same as a signature on a FAX and everyone seems to accept FAXed contracts.
The technical problem is how to reliably store data long-term. The problem is that all modern methods of storing data will degrade over time. Anything less than engraving a message in stone, gold, or platinum and burying it will have some data loss eventually.
If the documents that need to be archived have no special requirements and if you have a good backup system in place (testing backup media, off-site storage in case of disaster, multiple sets of hardware that can read the backup media in case of hardware failure, etc) then you might be able to just store the documents on a server and include them in the backup plan. The regular backups should cater for replacing media over the long term. If however there is a significant amount of data or the data has confidentiality requirements that preclude having it all online all the time then you need a separate infrastructure for such storage.
Regular backup systems have to deal with files being deleted from storage and files that have their contents changed. For a document archiving system no file will ever be changed once it has been created and no file will be deleted. This allows some simplifications to the backup strategy. For example if you have multiple terabytes of documents backed up by tape and stored off-site you could use CD-ROMs or other media for storing recent changes. It would be very easy for an employee to grab a couple of CDs before rushing out of a burning building, but grabbing a set of tapes (or the correct tape from a large set) may not be possible.
It would be possible to use a tape library system as the primary storage for documents. If a large organization was going to implement this a few years ago that might be a good option. Nowadays storage is getting increasingly large and cheap. Terabytes are available in desktop PCs and hundreds of terabytes are available for server storage. So having the primary document store on a server with a decent amount of space and then making tape backups for storage in secure locations seems viable.
One thing to note about such document storage is that having everything on a server allows a much larger amount of data to be accessed and copied more easily than on paper. Sorting through a billion paper documents and copying the thousand most useful ones would be a difficult task for someone who was involved in industrial espionage. Finding the most useful files when they are indexed on a server should be quite easy and copying a few thousand is also easy (one thousand scanned documents of medium size should fit on a USB memory stick – much smaller than a few reams of copied documents).
Finally documents have to be archived in publicly documented file formats that can be easily read in the future. The PDF specification is well known and there are multiple programs that can display data in such files, another good option for scanned documents is JPEG. Proprietary formats such as MS-Word should never be used, you never know whether you will be able to read them in four years, let alone the seven years for which many documents must be retained or the 20-30 years that some documents must be retained.