Thursday, July 31, 2008

Biopieces: bioinformatic tools easily used and created

Biopieces is a project I have been working with as a side project for one year almost to the date. The idea is that you write a set of tools that you can pipe together using the command line in a Unix environment in such a way that the output of one tool can be used as input to the next tool. This is by no means a new idea, I should say it is in fact native to the Unix environment where tools such as cat, cut, grep and sort work in this way. However, these tools are generic and not optimal for working with biological data, sequence data in particular. One can of cause use sed, awk, Perl or some such tool, but those requires quite an effort for the uninitiated user. Also, other systems with the same idea as the Biopieces exists, e.g. the Boulder system created by Lincoln Stein. However, Biopieces focusses on usability; getting something working fast and simple. Biopieces uses only a simple text based structure to exchange data, where the other systems uses XML, YAML, or other complex interchange formats. Also, since the exchange format used is text based, Biopieces are independant of the programming language and because the format is so simple, it is very easy for any developer to write new Biopieces.

I began work on the Biopieces when approached for the n'th time by a colleague requesting assistance to deal with some quite trivial problems - that is trivial to me because they could be quickly solved with the onboard Unix tools and a Perl one-liner. After discussing whether or not he should undertake the task of learning Perl I decided to supply him with a set of tools, each doing a very specific function, but flexible so they could be connected in all sorts of ways to do both simple and complicated tasks. I wrote a very basic set of tools, and then expanded the toolset over the next period as needed. This turned out to be a great success. After some minor start trouble my colleague was self-propelled - and the Biopieces were born.

Today there are 90 different Biopieces all setup in a neat framework allowing different developers to add Biopieces written in their favorite programming language. The Biopieces live at their own website and there is also the discussion group.

The next step towards the full success of the Biopieces is to make their use more widespread, and to see the first Biopieces contributed by other developers - which is now possible since the Biopieces went online. And finally it would be great if the UCSC Genome Browser tools and EMBOSS tools were converted to use the Biopiece framework.

No comments: