Tuesday, November 24, 2015


Friday I announced BioDSL (pronounced biodiesel) on Biostars and Seqanswers. BioDSL was originally meant to be a reimplementation of Biopieces that was started back in 2007. However, where Biopieces are a set of command line tools that have proved excellent and are used widely, BioDSL is a Domain Specific Language for bioinformatic analysis (as opposed to a General Purpose Language like Perl or Python). With BioDSL it is easy to create workflows that consists of several pipelines each made up of several steps doing one particular task:

#!/usr/bin/env ruby

require 'BioDSL'

p1 = BD.new.read_fasta(input: "test.fna")
p2 = BD.new.grab(keys: :SEQ, select: "ATCG").
     plot_histogram(key: :SEQ_LEN, terminal: :png, output: "select.png")
p3 = BD.new.grab(keys: :SEQ, reject: "ATCG").
     plot_histogram(key: :SEQ_LEN, terminal: :png, output: "reject.png")
p4 = p1 + p3

(p1 + p2).write_fasta(output: "select.fna").run
p4.write_fasta(output: "reject.fna").run
The above example basically does the same as the below two Biopieces commands - read in a bunch of FASTA entries, grab the ones containing (or not containing in the second command) sequences with the ATCG pattern, plot a histogram of sequence length distribution to a file and save the sequences to an output FASTA file.

$ read_fasta -i test.fna |
  grab -k SEQ -p ATCG |
  plot_histogram -k SEQ_LEN -t png -o select.png |
  write_fasta -o select.fna -x

$ read_fasta -i test.fna |
  grab -i -k SEQ -p ATCG |
  plot_histogram -k SEQ_LEN -t png -o reject.png |
  write_fasta -o reject.fna
BioDSL pipelines are executed in a two step manner: First the pipeline is assembled and only then is it run. Thus it is possible to check command options prior to the execution of the pipeline - a feature sorely missing in Biopieces where you risked waiting an hour for a pipeline to finish and only then learn some file was missing.

BioDSL also offers the option to parallelize the execution of pipelines so that several samples can be analyzed in parallel. This is in contrast to Biopieces where the execution of the different steps each was running as its own process.

BioDSL is written in Ruby only (with speed critical parts written in inline C) therefore the code base is much cleaner that Biopieces. The code adheres to Ruby's unofficial style guide , is patrolled by Rubocop, and is well covered by Unit Tests.

Additional features:

  • Detailed logfile with run statistics.
  • History file allowing simple rerunning of any pipeline.
  • Progress output during execution.
  • Generation of HTML reports with stats and plots.
  • Option to run on the command line.
  • Option to run in a Ruby interactive shell.

Friday, January 25, 2013

On the cover of Science

I got the cover of Science this week.

We ventured to Indooroopilly Island back in January 2008, where there is a huge colony of fruit bats counting several hundred thousand this time of year. You can wade across to the island at low tide - which was at 5.32 AM - from the golf grounds at Indooroopilly Golf Club, however, the going was tough with black, stinking mangroove mud, dense mangroove, heaps of spider webs and a plethora of mosquitos. We finally made it there, but it was hard to move around without disturbing the bats. Well, one good thing about disturbing the bats was the opportunity to catch a few shots of bats in flight.

The original photo is here.

Australia, Queensland, Brisbane, Indooroopilly Island.

Sunday, January 13, 2013

On the backup of data

The rule of thumb is, that you need two independent backups: two different places and two different media types. Different places to protect your backup from thieves, fires, floods, etc. and different media types like external hard disks, cloud based backup, DVDs, tapes, etc. to protect you from a faulty series of disks or flawed tape achieving software.

So what data requires backup in Bioinformatics? In my opinion you should only backup original data, the scripts, and documentation so you keep a minimal backup from which you can reconstruct all steps from the original data to the final analysis results. The idea is to protect the work that consumes expensive man-hours and down-prioritize the work that consumes cheap CPU-hours - with occasional exceptions.

My guess is that most professionals in Bioinformatics fail to heed this rule of thumb, but rely on redundant copies of scripts and data sets of importance to sit on different servers. One reason is, that the amounts of data is huge, which makes it very expensive to keep full backup - especially bulky sequencing data. In fact, it is very soon going to be cheaper to keep the biological samples and re-sequence in case of data loss. Another reason is, that most colleagues of mine (and myself) keep data, results, and documentation in strict hierarchies using the UNIX file tree, which is a logical way that allows a degree of self-documentation and efficient navigation of your projects. However, this may lead to a file tree for a project, where you have a mixture of data, documentation, and scripts.

In order to keep efficient backup I wrote a program bagop to flag specific files and directories in the directory tree. All flagged files and directories are synchronized to a local or remote backup directory using rsync, from where you can use standard backup software to execute and rotate the backup. The basic functionality in bagop mimics Subversion so you first add files and directories to the backup, and then you have to commit these. Similar to Subversion you should work in cycles of completing a single task and then commit your work. You can keep track of the backup status to see if files are OK, have changed, or have gone missing.
I managed to convince our sysadm +Ole Tange that this project was a good idea, and the result is now freely available on GitHub, and is going to be rolled out to the users at the Danish National High-throughput DNA Sequencing Center.

Of cause, using bagop for backup does not necessarily lead to backup at two locations using two media types.

Wednesday, December 12, 2012

Taxonomic Tree Viewer

I wanted an overview of taxonomic data from multiple 16S RNA amplicon sequencing experiments. I took this as an opportunity to brush up on my web development skills. Web development is truly a multi-discipline undertaking involving client side scripting with Javascript/jQuery, layout using HTML and CSS, data transfer with AJAX and JSON, and running all this on an appropriate web server using Rails or a Rails based CMS.

Here I present the client side part of my Taxonomic Tree Viewer loaded with the entire Green Genes v12_10 taxonomy (>1M entries). Loading, folding/unfolding and searching is reasonably fast. Currently the data is hardcode in the tree and the next step will be to serve the data along with meta data using AJAX.

You can play with the taxonomic tree here.

Tuesday, October 9, 2012

Finding similar sequences - FindSim

Finding similar sequences between query sequences and a database is a common task for which many different tools like BLAST and BLAT have been developed. For taxonomic classification it is useful to have a search tool that locates the top hits efficiently and in the case of searching a 16S ribosomal DNA database, sensitivity can be sacrificed for speed while still locating the relevant sequences. The SimRank tool (by Niels Larsen) breaks down a query sequence into overlapping k-mers (7-mers) - or oligofying the sequence - which are used to search a database of indexed sequences to locate the sequences that share the most unique k-mers. A specified number of hits per query sequence is output according to an identity score that is calculated as the number of unique shared k-mers over the minimum unique k-mers in either the query sequence or the database sequence. The resulting identity score is a rough under-estimate of the per-base identity so that a identity of 70% based on k-mer corresponds to 90% based on nucleotides.

I have implemented an upgraded logic of the SimRank algorithm now allowing for multiple query sequences to be processed at one time, which dramatically improves performance: 1000 full length RNA sequenced against 400.000 sequences is completed in 3.5 minutes on a laptop using less than 1Gb of memory. Moreover, the algorithm scales almost linearly. The FindSim algorithm functions by initially pre-processing all query sequences into a query index by oligofying all query sequences and for every oligo record the sequences containing that particular oligo. Holding this query index in memory the database of reference sequences is searched by oligofying each database sequence and determine shared unique k-mers and record the score. Finally, the scores are post-processed so that for each query sequence a given number of database hits is output best first, or a variable number of  database hits is output within an identity range.

FindSim schematics
FindSim schematic
The FindSim algorithm was implemented as a Biopiece: findsim_seq.

Monday, October 8, 2012

Denoising Amplicons

Determining bacterial diversity can be done by amplifying and sequencing 16S ribosomal DNA genes yielding amplicons. In recent years pyrosequencing has been the workhorse in amplicon sequencing, because the read length of this deep sequencing can span one or more of the variable regions in the 16S ribosomal DNA gene, which is necessary to get the resolution to identify bacteria at the species level. Unfortunately, pyrosequencing is prone to homopolymeric errors, where runs of identical nucleotides are miscalled. Several computer programs have been developed to rectify the problem: AmpliconNoise applies an approximate likelihood using empirically derived error distributions to remove noise. Denoiser is a less slow algorithm that uses frequency-based heuristics rather than statistical modeling to cluster reads. Recently, Acacia has been released, which is much less computer resource demanding that AmpliconNoise and Denoiser. Acacia avoids all-against-all alignments by aligning each read to dynamically updated cluster consensus sequences, and the alignment algorithm is made more efficient using heuristics that only consider homopolymer over- and under-calls.

For some reason none of these programs utilize the quality scores of the sequences. I have devised an algorithm were all sequences are clustered based on sequence similarity and for each cluster a consensus sequence is determined from the sequence distribution while taking into account quality scores. Thus, a column in the cluster alignment showing identical residues, but with subquality scores may be included in the consensus, while a column with a percentage of subqualty scores may be filtered for those residues, and the consensus will be determined from the remaining residues. A number of different filters can be used with different parameters.

consensus aligment
Consensus alignment
The consensus sequence is the second last line. The last line illustrates the mean quality scores of the column by identity '0' = 75-100%, 'o' = 50-75%, '.' = 25-50%, ' ' = 0-25%. Lowercase residues have failed the rules by the filters and are omitted from the consensus.

Testing results show that sequence clusters indeed can be improved upon and lead to better consensus sequences. It is possible to iterate the denoising so that the first round uses clustering at 100% similarity and loose filtering followed be a second round of denoising using clustering at 97% similarity and more strict filtering. In general, the consensus sequences are shorter than the sequences in the cluster, because any detected errors in the alignment results in gaps in the consensus, but identifying reference sequences from a reference database is not a problem using such as BLAST or findsim_seq

This algorithm have been implemented as a Biopiece: denoise_seq.

Wednesday, March 24, 2010

Genome sequencing of Bacteria

I have shifted from eukaryotic to prokaryotic research, and I am now working in the Molecular Microbiology Ecology Group at University of Copenhagen. The MME-group is part of The Copenhagen High-Throughput Sequencing Center and houses 2 Roche FLX instruments and 2 Illumina Genome Analyzers. We will be sequencing bacterial genomes and amplicons (part of the 16S ribosomal DNA) for community analysis, but also metagenomes. For now I have been working on the assembly and annotation of the sequenced bacterial genomes. Assembly is the process of piecing together sequence reads into longer contiguous regions (contigs). A number of assembly software packages exist, but several of them are left-overs from the shotgun sequencing days (such as the Celera Assembler, Cap3, Arachne, Phrap, etc) and do not cope well with Next Generation Sequencing data. A number of new assemblers are looking very promising, but there are currently no open-source assemblers that can deal with mixed data types, i.e. Solexa and 454 reads. While Mira theoretically can handle mixed data types the reality is that the memory usage gets enormous, and the result suffer if the coverage is too big. The alternative is to shred reads into Solexa sizes and use Velvet. However, it appears quite illogical to reduce sequence sizes before increasing them! Also, none of the assemblers use comparative genomics, which should be one obvious path to guided assembly. So, no matter what assembler you are using, the result is a set of contigs and not a single sequence because of repetitive elements in the genome. The gaps between contigs can be closed in the laboratory with tedious PCR (a number of different protocols for gap closing exists), or if you have a closely related genome sequence that can be used for scaffolding - the ordering of contigs in the correct order and orientation.

After assembly the next task is to annotate the genome and luckily, this process has been made easy with the RAST annotation service. The resulting annotation is of high quality and can be viewed in the RAST annotation viewer. However, if you want to view the sequence data along with the annotation you want a real Genome Browser like the UCSC genome browser, Ensembl, or Gbrowse. Unfortunately, these Browsers are not optimal for browsing bacterial draft genomes, and it is very difficult to upload custom genomes and annotation tracks while at the same time controlling permissions on which users are allowed access to what data. Therefore I decided to write my own genome browser, that I call the Biopieces Genome Browser. It is described in more detail here, but a screen shot can be seen below: