Thursday, June 25, 2009

Reconstructing non-coding RNAs

We observed that sequence tags (from deep sequencing data sets) in the micro RNA size range cover other types of non-coding RNAs such as tRNA and snoRNA. Using public available data sets we set out to reconstruct full length non-coding RNAs from Drosophila. We started with 12 GEO data sets including 90 experiments and 56M sequence tags of which 11M were unique 6M could be mapped perfectly to the genome yielding 68M hits. These mapped tags were assembled into tag contigs (TC) yielding 0.5M TCs where all tags were on the same strand and disregarding coding and repetitive regions (Fig. 1). For each TC the Tag depth was determined as the maximum number of overlapping sequence tags for each TC indicating the expression level.

Figure 1. Assembly of sequence tags into Tag Contigs.

Inspection of TCs overlapping with annotated non-coding RNAs revealed that it was indeed possible to reconstruct tRNAs as well as both box H/ACA and box C/D snoRNAs. Moreover, plotting the TC length and Tag depth revealed that these non-coding RNAs form well defined clusters (Fig. 2). Testing un-annotated TCs from these clusters by Northern blotting we validated the existence of transcripts of the expected length from from these 8 and 26 previously unrecognized box H/ACA and box C/D snoRNAs, respectively. However, as indicated in grey in Figure 2 a large number of un-annotated TCs indicates the existence of many more non-coding RNAs.

Figure 2. Tag Contigs lengths plotted against Tag depth.

Manuscript abstract is here.


Paul said...

Hej Martin,
That sounds fascinating! I know others have observed ncRNA degradation products in their RNA-seq data, but it sounds you've tied this into a very nice approach for detecting novel ncRNAs. Feel free to send your more confident examples to Rfam, or even better publish them in the RNA families track at RNA Biology.

maasha said...


Thanks for your comment. The technology is fast catching up with our approach that will be made obsolete with longer sequence reads.

Also, a number of recent papers show how mRNA and snoRNA appear to processed in a non-random.

We also have a large deep sequencing data set that shows strongly biased processing/degradation of rRNA - we have no idea of what is going on - and what to do with this observation.

RNA degradation is poorly understood. Perhaps RNA is carefully processed instead of being degraded?

maasha said...

The paper is now out:

BMC Genomics. 2010 Feb 1;11(1):77.
Identification of novel non-coding RNAs using profiles of short sequence reads from next generation sequencing data.

Jung CH, Hansen MA, Makunin IV, Korbie DJ, Mattick JS.

Institute for Molecular Bioscience, University of Queensland, St Lucia QLD 4072, Australia.

ABSTRACT: BACKGROUND: The increasing interest in small non-coding RNAs (ncRNAs) such as microRNAs (miRNAs), small interfering RNAs (siRNAs) and Piwi-interacting RNAs (piRNAs) and recent advances in sequencing technology have yielded large numbers of short (18-32 nt) RNA sequences from different organisms, some of which are derived from small nucleolar RNAs (snoRNAs) and transfer RNAs (tRNAs). We observed that these short ncRNAs frequently cover the entire length of annotated snoRNAs or tRNAs, which suggests that other loci specifying similar ncRNAs can be identified by clusters of short RNA sequences. RESULTS: We combined publicly available datasets of tens of millions of short RNA sequence tags from Drosophila melanogaster, and mapped them to the Drosophila genome. Approximately 6 million perfectly mapping sequence tags were then assembled into 521,302 tag-contigs (TCs) based on tag overlap. Most transposon-derived sequences, exons and annotated miRNAs, tRNAs and snoRNAs are detected by TCs, which show distinct patterns of length and tag-depth for different categories. The typical length and tag-depth of snoRNA-derived TCs was used to predict 7 previously unrecognized box H/ACA and 26 box C/D snoRNA candidates. We also identified one snRNA candidate and 86 loci with a high number of tags that are yet to be annotated, 7 of which have a particular 18mer motif and are located in introns of genes involved in development. A subset of new snoRNA candidates and putative ncRNA candidates was verified by Northern blot. CONCLUSIONS: In this study, we have introduced a new approach to identify new members of known classes of ncRNAs based on the features of TCs corresponding to known ncRNAs. A large number of the identified TCs are yet to be examined experimentally suggesting that many more novel ncRNAs remain to be discovered.

PMID: 20113528 [PubMed - in process]