Sunday, January 13, 2013

On the backup of data

The rule of thumb is, that you need two independent backups: two different places and two different media types. Different places to protect your backup from thieves, fires, floods, etc. and different media types like external hard disks, cloud based backup, DVDs, tapes, etc. to protect you from a faulty series of disks or flawed tape achieving software.

So what data requires backup in Bioinformatics? In my opinion you should only backup original data, the scripts, and documentation so you keep a minimal backup from which you can reconstruct all steps from the original data to the final analysis results. The idea is to protect the work that consumes expensive man-hours and down-prioritize the work that consumes cheap CPU-hours - with occasional exceptions.

My guess is that most professionals in Bioinformatics fail to heed this rule of thumb, but rely on redundant copies of scripts and data sets of importance to sit on different servers. One reason is, that the amounts of data is huge, which makes it very expensive to keep full backup - especially bulky sequencing data. In fact, it is very soon going to be cheaper to keep the biological samples and re-sequence in case of data loss. Another reason is, that most colleagues of mine (and myself) keep data, results, and documentation in strict hierarchies using the UNIX file tree, which is a logical way that allows a degree of self-documentation and efficient navigation of your projects. However, this may lead to a file tree for a project, where you have a mixture of data, documentation, and scripts.

In order to keep efficient backup I wrote a program bagop to flag specific files and directories in the directory tree. All flagged files and directories are synchronized to a local or remote backup directory using rsync, from where you can use standard backup software to execute and rotate the backup. The basic functionality in bagop mimics Subversion so you first add files and directories to the backup, and then you have to commit these. Similar to Subversion you should work in cycles of completing a single task and then commit your work. You can keep track of the backup status to see if files are OK, have changed, or have gone missing.
 
I managed to convince our sysadm +Ole Tange that this project was a good idea, and the result is now freely available on GitHub, and is going to be rolled out to the users at the Danish National High-throughput DNA Sequencing Center.

Of cause, using bagop for backup does not necessarily lead to backup at two locations using two media types.



No comments: