Tuesday, November 24, 2015

BioDSL

Friday I announced BioDSL (pronounced biodiesel) on Biostars and Seqanswers. BioDSL was originally meant to be a reimplementation of Biopieces that was started back in 2007. However, where Biopieces are a set of command line tools that have proved excellent and are used widely, BioDSL is a Domain Specific Language for bioinformatic analysis (as opposed to a General Purpose Language like Perl or Python). With BioDSL it is easy to create workflows that consists of several pipelines each made up of several steps doing one particular task:

#!/usr/bin/env ruby

require 'BioDSL'

p1 = BD.new.read_fasta(input: "test.fna")
p2 = BD.new.grab(keys: :SEQ, select: "ATCG").
     plot_histogram(key: :SEQ_LEN, terminal: :png, output: "select.png")
p3 = BD.new.grab(keys: :SEQ, reject: "ATCG").
     plot_histogram(key: :SEQ_LEN, terminal: :png, output: "reject.png")
p4 = p1 + p3

(p1 + p2).write_fasta(output: "select.fna").run
p4.write_fasta(output: "reject.fna").run
The above example basically does the same as the below two Biopieces commands - read in a bunch of FASTA entries, grab the ones containing (or not containing in the second command) sequences with the ATCG pattern, plot a histogram of sequence length distribution to a file and save the sequences to an output FASTA file.

$ read_fasta -i test.fna |
  grab -k SEQ -p ATCG |
  plot_histogram -k SEQ_LEN -t png -o select.png |
  write_fasta -o select.fna -x

$ read_fasta -i test.fna |
  grab -i -k SEQ -p ATCG |
  plot_histogram -k SEQ_LEN -t png -o reject.png |
  write_fasta -o reject.fna
BioDSL pipelines are executed in a two step manner: First the pipeline is assembled and only then is it run. Thus it is possible to check command options prior to the execution of the pipeline - a feature sorely missing in Biopieces where you risked waiting an hour for a pipeline to finish and only then learn some file was missing.

BioDSL also offers the option to parallelize the execution of pipelines so that several samples can be analyzed in parallel. This is in contrast to Biopieces where the execution of the different steps each was running as its own process.

BioDSL is written in Ruby only (with speed critical parts written in inline C) therefore the code base is much cleaner that Biopieces. The code adheres to Ruby's unofficial style guide , is patrolled by Rubocop, and is well covered by Unit Tests.

Additional features:

  • Detailed logfile with run statistics.
  • History file allowing simple rerunning of any pipeline.
  • Progress output during execution.
  • Generation of HTML reports with stats and plots.
  • Option to run on the command line.
  • Option to run in a Ruby interactive shell.