Thursday, July 31, 2008

Biopieces: bioinformatic tools easily used and created

Biopieces is a project I have been working with as a side project for one year almost to the date. The idea is that you write a set of tools that you can pipe together using the command line in a Unix environment in such a way that the output of one tool can be used as input to the next tool. This is by no means a new idea, I should say it is in fact native to the Unix environment where tools such as cat, cut, grep and sort work in this way. However, these tools are generic and not optimal for working with biological data, sequence data in particular. One can of cause use sed, awk, Perl or some such tool, but those requires quite an effort for the uninitiated user. Also, other systems with the same idea as the Biopieces exists, e.g. the Boulder system created by Lincoln Stein. However, Biopieces focusses on usability; getting something working fast and simple. Biopieces uses only a simple text based structure to exchange data, where the other systems uses XML, YAML, or other complex interchange formats. Also, since the exchange format used is text based, Biopieces are independant of the programming language and because the format is so simple, it is very easy for any developer to write new Biopieces.

I began work on the Biopieces when approached for the n'th time by a colleague requesting assistance to deal with some quite trivial problems - that is trivial to me because they could be quickly solved with the onboard Unix tools and a Perl one-liner. After discussing whether or not he should undertake the task of learning Perl I decided to supply him with a set of tools, each doing a very specific function, but flexible so they could be connected in all sorts of ways to do both simple and complicated tasks. I wrote a very basic set of tools, and then expanded the toolset over the next period as needed. This turned out to be a great success. After some minor start trouble my colleague was self-propelled - and the Biopieces were born.

Today there are 90 different Biopieces all setup in a neat framework allowing different developers to add Biopieces written in their favorite programming language. The Biopieces live at their own website and there is also the discussion group.

The next step towards the full success of the Biopieces is to make their use more widespread, and to see the first Biopieces contributed by other developers - which is now possible since the Biopieces went online. And finally it would be great if the UCSC Genome Browser tools and EMBOSS tools were converted to use the Biopiece framework.

Wednesday, July 30, 2008

Non-coding RNA

Currently I am working as a Postdoctoral Fellow at Professor John Matticks group at the Institute of Molecular Bioscience, University of Queensland where I am once again engaged in the world of RNA. And the world of RNA is getting bigger and bigger for each day as it turns out that the eukaryotic cell (as well as the prokaryotic) is not a protein machine, but an RNA machine. It is becoming evident that introns and untranslated regions of mRNA harbor non-coding RNA. Moreover, the so-called "junk DNA" in intergenic regions are full of RNA encoding regions, repeats all of sudden express rasiRNA and piRNAs, annotated mRNAs turn out to be wrong and are in fact long non-coding RNAs, and some mRNAs have a non-coding functional state prior to translation.

non-coding RNA discovery

Addressing this novel world of RNA is tricky because many function as digital interfaces where function arises from basepairing and not from a catalytic three dimensional structure of the RNA moity. Such digital RNAs may be short as the miRNAs and their basepairing interaction may be imperfect, which renders analysis based on primary structure difficult. Not only is it difficult to identify the encoding loci in the genome, the targets are just as elusive. Prediction of secondary structure is unreliable and only assists in identifying stem containing RNAs such as pre-miRNA while e.g. piRNA do not show any stem forming potential. So in order to discover novel non-coding RNAs or novel classes of non-coding RNAs, it is necessary to create a platform that allows integration with comparative genomics, motif searches, expression data and deep sequencing data, etc. The UCSC genome browser can be used as such a platform, and the Institute of Molecular biology hosts a mirror which we have direct access to for fast generation of custom tracks and tweaking of the source code. I have used the comparative PhastCons data to locate and cluster distinct conservation profiles, where some are specific to miRNA and others to snoRNA while some are putative non-coding RNAs. Next step will be to evaluate these putative RNAs with expression data from deep sequencing. Because our group is part of the RIKEN consortium, we have access to very large amount of sequence data along with the public available data that is incoming at an impresive rate. I use suffix array software for fast sequence mapping (I mapped 60 mio tags to the human genome in one hour) which covers the conserved blocks with tiled tags that can be scored to give an idea of uniqueness and expression. Finally, probes can be designed for DNA arrays.

Below is an image from the UCSC genome browser showing a miRNA loci. From this it is evident that not only the miRNA is expressed from the stem indicated by the RNAfold track, but also the star sequences.

Are Alu repeats the storage of the Human memory?

The Human genome contains more the one million Alu repeats - a type of repeat specific to human. Alu repeats have been shown to be transcribed in the brain and in fact all Alu repreats have the potential to encode one or two small RNAs. The potential regulatory network is staggering, and the task of this network could possible be important to human memory. Since trafficing of small RNAs from the synapse to the neuron nucleus has been shown, we set out to test if Alu RNAs direct genomic recoding leading to genomically encoded memory - an idea supported by the finding that disrupted ADAR editing, which similarly recodes the immune system, lead to mental disorder. We sequenced 20 Alu loci and using 454 deep sequencing obtaining ~300,000 reads - which we hoped was deep enough to reveal transitions and transversions from possible ADAR editing. Unfortunately, it turned out that the error rate of 454 sequencing was too high for this type of analysis, and we are waiting for improved sequencing methods. In the meantime we are planning an experiment that theoretically will allow us to track the expression of specific Alu RNAs using DNA array with 30-mer probes. Eventhough Alu repeats are repetitive, they actually all have very slight sequence variation, which can be used to track individual Alus with five probes per Alu.

In order to asses the repetitiveness of the Alu repeats I wrote a piece of code that for all positions in the Human genome determines the genome-wide occurrence of that 15-mer. This is done by sliding a 15 nucleotide wide window over the genomic sequence in two passes. In both passes two bits, corresponding to the next nucleotide in the sequence, were pushed onto a 32 bit integer. In the first pass the full 32 bit integer was then used as an index to increment the count in an integer array. In the second pass the full 32 bit integer was used to output the count at each window position. The principle is illustrated in pseudo-code below.

// first pass

foreach nucleotide in sequence
if nucleotide is 'A'
shift 00 onto 32bits block
else if nucletide is 'T'
shift 11 onto 32bits block
else if nucletide is 'C'
shift 01 onto 32bits block
else if nucletide is 'G'
shift 10 onto 32bits block
reset block

if sliding window is full
count_array[ block ]++

// second pass

foreach nucleotide in sequence
if nucleotide is 'A'
shift 00 onto 32bits block
else if nucletide is 'T'
shift 11 onto 32bits block
else if nucletide is 'C'
shift 01 onto 32bits block
else if nucletide is 'G'
shift 10 onto 32bits block
reset block

if sliding window is full
count = count_array[ block]
print position and count

Now, in the real implementation I simultaneously counted the anti sense strand in the same go by pushing the reverse complemented bits onto another 32 bit block that was used to increment the count index as well. The result was a piece of code written in c that counted trough the entire Human genome in minutes using only 8Gb of RAM. The results were compiled as a wiggle track in the UCSC Genome Browser and the figure below shows two Alu repeats with the count track. Each peak in the count track corresponds the number (the height of the peak) of 15-mer beginning at that position. The conclusion was that the repeatedness of the Alu repeats is mainly in the promoter region and that there in each Alu repeat exists a couple of almost unique 15-mers.

Declining fertility of Danish women

While at the Department of Growth and Reproduction, Rigshospitalet, I was approached by Professor Tina Kold Jensen, who asked me if I could assist in a register based study of Danish womens reproductive health. We obtained access to a considerable amount of data, since Denmark have a long tradition of compiling high quality registers of birth, death, immigration, induced abortions, and hospitalization data. I undertook the task of cleaning and ordering this data allowing analysis of the impact of assisted reproduction on the fertility rate in Denmark with respect to native Danish women. Denmark is one of the few countries in the Western World with a steady fertility rate, however, we were able to show that the fertility rate was upheld only be the use of artificial reproduction. It has been argued that the falling fertility rate observed in younger cohorts is due to postponement of child bearing, however, we also observe a cohort effect showing a decline of the induced abortion rate among Danish teenage girls - is this because contraception use has improoved, or is this linked the fact that 20% or young Danish men have a sperm concentration below 20 mio sperms/milliliter - the WHO threshold of impaired reproductive function?

The below figure shows the accumulated Total conception rate (birth rate + abortion rate) with and without the contribution of assisted reproduction.

Gene families and spermatogenesis

I completed my PhD project at the Department of Growth and Reproduction, Righospitalet, which was lead by Professor Niels Erik Skakkebaek. Senior Scientist Henrik Leffers was my everyday supervisor, however, since Henrik was a molecular biologist, computational sparring was introduced in the form of Niels Larsen, who at the time was Software Development Leader at Integrated Genomics in Chicago. I visited Niels in Chicago for three months, where I learned to write production quality code. For computational sparring back in Denmark, Professor Soeren Brunak from Denmarks Technical University was affilated as my co-supervisor.

The title of my Ph.D. thesis was "Bioinformatic Analysis of Gene Families in Mouse and Human Spermatogenesis". Initially we determined gene expression profiles during the development of spermatogenesis in newborn mice. I wrote a data integration system to incorporate data from differnt sources including Differential Display, DNA array, and in situ hybridization data. We were able to show that the first wave of spermatogenesis was constituted of three major clusters of expression originating from Sertoli cells, Pachytene germ cells, and spermatids - and that all genes expressed could be associated to one or more of these clusters. This made it possible to determine the germ cell composition of the growing testis from total RNA and work is still ongoing to utilize this to located disrupted germ cell populations that may be caused by hormone like agents such as phthalates, parabenes, and pesticides - leading to impaired testicular function.

During this process we came across a novel gene family, where the Human orthologs showed a distinct genomic location pattern: the members of the gene family was positioned in the arms of direct and inverted segmental duplications. A sensitive psiBLAST-like approach revealed that these genes via their promoter sequence were in fact related to four known human gene families with similar genomic location patterns. Thus, all these genes could be involved in infertility since inverted segmental duplications are subject to non-homologous recombination with consequent gene loss - as described for the well studied AZF gene family. Also, the complex genomic patterns might lead to overexpression of these genes resulting in cancer. Below is a dotplot and a line plot showing a 150Kbp inverted segmental duplications - or palindrome - that harbors the VCY genes.

Sunday, July 6, 2008

Recursive pair-wise alignments

During my PhD years I used a lot of time and energy exploring large inverted segmental duplications - or palindromes - on the Human and Mouse sex chromosomes. These palindromes showed a remarkable level of arm-to-arm sequence conservation and harbored many testis and cancer specific genes we were interested in. Locating these palindromes were quite a daunting task because of their sizes in the multi-megabasepair range. Niels Larsen and I looked into how this could be done computationally, unfortunately, David Page et al beat us to it, and they published a couple of very nice papers on this phenomenon. So this work never made it as a chapter in my PhD thesis, but what remains are the thoughts about biological meaningful alignments.


Biological alignments of sequences constitutes one of the major data analysis tasks in modern biology. While computer scientists have focused on algorithms working with limited, generic datasets, biologists have failed to grasp the internals of computer analysis and this has resulted in a lack of computer programs that efficiently will handle large datasets to give biologically meaningful results.

The biologically meaningful result is an important point which implies that aligning should be terminated when there is any doubt about the result instead of sprinkling indels throughout the alignment. A biologically meaningful alignment algorithm would leave the troublesome parts unaligned, and somewhere annotate which sequence parts were aligned and which were not.

The following describes an alignment algorithm idea of unknown origin (at least to me) which can be modified to create biological meaningful pair-wise alignments with proper adjustment of the match scoring method.


The algorithm utilizes a recursive top-down approach where the two sequences are used to cast a search space and all matches longer than a certain word-size shared between the sequences are located (Fig. 1A). Next, all matches longer than a certain size (e.g. 50-100nt) are assumed to be beyond doubt since these should not occur by chance - and these long matches are included in the alignment chain without further considerations (matches a, b, and c in Fig. 1B). Matches longer than the minimum word-size, but shorter than the long match size are retained and evaluated in the next alignment round if these matches falls within the new search spaces cast between the long matches and the boundaries of the previous search space (search space alpha in Fig. 1B). If a search space does not include any retained matches, new matches are found using a word-size calculated based on the search space dimensions. All retained and new matches within a search space is scored according to three different criteria (explained later) and the best scoring match is included in the alignment chain and used to cast two new search spaces (Fig. 1C). When no more matches can be found and no more search spaces can be cast, the alignment chain is complete (Fig. 1D).

Match Scoring

The matches found in a search space may all be valid for inclusion in the alignment chain, but it is also possible that all the matches are incorrect, and moreover, the matches may be a collection of both correct and incorrect matches. Therefore it is necessary to be able to distinguish between matches that should be:
  • included in the alignment chain
  • retained for further alignment rounds
  • discarded
This can be achieved by scoring all matches and including in the alignment chain the best scoring match, while retaining and discarding the remaining matches depending on a certain threshold score. Conflicting matches, that arise as a result of repetitive sequence, should also be scored in such a way, that the most correct match is included in the alignment chain. Match scoring should be done in a generic and robust way independent of the search space dimension including matches that result in an obvious correct alignment. Scoring takes place according to a sum of these criteria listed in order of importance:
  1. match length
  2. distance to closest diagonal
  3. distance to the narrow end of the search space
Distance to the closest corner of the search space

As stated above, this criteria is actually invalid, but it is mentioned here anyway to illustrate how this renders the scoring independent of the search space dimensions. Scoring according to corner distance in such a way that the closer to the corner, the better the match, is wrong as illustrated in Fig. 2A, where the match with the highest corner distance results in the correct alignment in a match conflict where the match lengths and diagonal and narrow end distances are equal.

Distance to narrow end of the search space

This criteria is only relevant for search spaces with uneven dimensions (i.e. uneven sequence lengths) where it is used to discriminate in match conflicts where match lengths and diagonal distances are equal. Thus, the match most distant to the narrow end of the search space is choosen (Fig. 2B). Also, since this criteria only is invoked in the special match conflicts described here, it can be considered independent of the search space dimensions. This parameter is dependent only on the primary sequence and hence the narrow distance score is linear.

Distance to closest diagonal

In a match conflict where two matches have equal lengths and the search space has even dimensions (neglecting the use of the narrow distance) will have to be resolved using distance to closest diagonal (Fig. 2C). Any distance between a match and the closest diagonal corresponds linearly to the amount of indels that will be inserted when the alignment chain is converted to the finished alignment output.

Match length

The length of a match is the most dominant parameter for scoring. A search space of uneven dimensions may contain conflicting matches with equal distance to closest diagonal, narrow end of search space, and closest corner thus illustrating that the lenght of the match is the determining factor (Fig. 2D). The longer the match, the higher the probability it did not occur by chance and should be included in the alignment chain. This implies that the match length score should be exponentially derived from the match length.


Here is a pseudocode version of the recursive function:

func align_two_seq
filter all matches outside search space
if no matches
else if very long matches
include longest match
order remaining matches
include all non-crossing matches
call align_two_seq on all inter-match search spaces
score all matches
filter matches scoring below threshold
order matches and include the highest scoring
call align_two_seq on the match defined search spaces

Wednesday, July 2, 2008

Modifications in ribosomal RNA

This, my Cand. Scient. (MS) project, was completed at the Department of Biological Chemistry, Institute of Molecular Biology, University of Copenhagen. My supervisor was Associate Professor Birte Vester who was part of the RNA group led by Professor Roger Garrett. Here I did three years of lab-work investigating posttranscriptional modifications of ribosomal RNA using standard lab methods combined with mass spectrometry. I was able to isolate the part of the 23S ribosomal RNA that constitutes the peptidyl transferase centre using site-directed RNaseH digestion followed by isolation with PAGE. The isolated fragments were analysed by MALDI-MS, where I was so fortunate to collaborate with Associate Professor Finn Kirpekar (University of Southern Denmark) who analyzed the fragments - and taught me to perform this type of analyzis on equipment in Copenhagen. I also screened these fragments for the mass-silent pseudouridines using chemical modification and detection with RT-PCR and visualization on sequencing gels. At the same time the first high resolution crystal structure of the ribosome was published, and I was able to use this data to visualize the spatial position of my modifications from five different bacteria and archaea, and showing that they were all positioned in the peptidyl transferase pocket facing outwards, away from the ribosomal RNA, and towards the amino-acetylated tRNAs. These findings suggest that the modifications may be important to the ribosome function, however, the positions of the modifications varied between species and some modifications were non-ubiquitous - so the precise function and importance remains unknown today.

During these years I learned that if you can work with RNA you can work with everything. And especially highly structured RNA like ribosomal RNA has a tendency of wanting to ball up in a tight structure expelling whatever probes you are trying to hybridize - no matter if these probes are of DNA or RNA or spiked with modified nucletotides such as LNA. The solution for me was to move probes to regions of less stability by trial and error. Also, I was introduced to Linux and have been running Debian ever since. This Linux platform was optimal for the visualization work I did using VMD. One image can be seen below, where the ribosome structure is shown in crown and side view with methylations shown in blue and psedouridines in red.