Tuesday, October 9, 2012

Finding similar sequences - FindSim

Finding similar sequences between query sequences and a database is a common task for which many different tools like BLAST and BLAT have been developed. For taxonomic classification it is useful to have a search tool that locates the top hits efficiently and in the case of searching a 16S ribosomal DNA database, sensitivity can be sacrificed for speed while still locating the relevant sequences. The SimRank tool (by Niels Larsen) breaks down a query sequence into overlapping k-mers (7-mers) - or oligofying the sequence - which are used to search a database of indexed sequences to locate the sequences that share the most unique k-mers. A specified number of hits per query sequence is output according to an identity score that is calculated as the number of unique shared k-mers over the minimum unique k-mers in either the query sequence or the database sequence. The resulting identity score is a rough under-estimate of the per-base identity so that a identity of 70% based on k-mer corresponds to 90% based on nucleotides.

I have implemented an upgraded logic of the SimRank algorithm now allowing for multiple query sequences to be processed at one time, which dramatically improves performance: 1000 full length RNA sequenced against 400.000 sequences is completed in 3.5 minutes on a laptop using less than 1Gb of memory. Moreover, the algorithm scales almost linearly. The FindSim algorithm functions by initially pre-processing all query sequences into a query index by oligofying all query sequences and for every oligo record the sequences containing that particular oligo. Holding this query index in memory the database of reference sequences is searched by oligofying each database sequence and determine shared unique k-mers and record the score. Finally, the scores are post-processed so that for each query sequence a given number of database hits is output best first, or a variable number of  database hits is output within an identity range.

FindSim schematics
FindSim schematic
The FindSim algorithm was implemented as a Biopiece: findsim_seq.

Monday, October 8, 2012

Denoising Amplicons

Determining bacterial diversity can be done by amplifying and sequencing 16S ribosomal DNA genes yielding amplicons. In recent years pyrosequencing has been the workhorse in amplicon sequencing, because the read length of this deep sequencing can span one or more of the variable regions in the 16S ribosomal DNA gene, which is necessary to get the resolution to identify bacteria at the species level. Unfortunately, pyrosequencing is prone to homopolymeric errors, where runs of identical nucleotides are miscalled. Several computer programs have been developed to rectify the problem: AmpliconNoise applies an approximate likelihood using empirically derived error distributions to remove noise. Denoiser is a less slow algorithm that uses frequency-based heuristics rather than statistical modeling to cluster reads. Recently, Acacia has been released, which is much less computer resource demanding that AmpliconNoise and Denoiser. Acacia avoids all-against-all alignments by aligning each read to dynamically updated cluster consensus sequences, and the alignment algorithm is made more efficient using heuristics that only consider homopolymer over- and under-calls.

For some reason none of these programs utilize the quality scores of the sequences. I have devised an algorithm were all sequences are clustered based on sequence similarity and for each cluster a consensus sequence is determined from the sequence distribution while taking into account quality scores. Thus, a column in the cluster alignment showing identical residues, but with subquality scores may be included in the consensus, while a column with a percentage of subqualty scores may be filtered for those residues, and the consensus will be determined from the remaining residues. A number of different filters can be used with different parameters.

consensus aligment
Consensus alignment
The consensus sequence is the second last line. The last line illustrates the mean quality scores of the column by identity '0' = 75-100%, 'o' = 50-75%, '.' = 25-50%, ' ' = 0-25%. Lowercase residues have failed the rules by the filters and are omitted from the consensus.

Testing results show that sequence clusters indeed can be improved upon and lead to better consensus sequences. It is possible to iterate the denoising so that the first round uses clustering at 100% similarity and loose filtering followed be a second round of denoising using clustering at 97% similarity and more strict filtering. In general, the consensus sequences are shorter than the sequences in the cluster, because any detected errors in the alignment results in gaps in the consensus, but identifying reference sequences from a reference database is not a problem using such as BLAST or findsim_seq

This algorithm have been implemented as a Biopiece: denoise_seq.