Tuesday, October 9, 2012

Finding similar sequences - FindSim

Finding similar sequences between query sequences and a database is a common task for which many different tools like BLAST and BLAT have been developed. For taxonomic classification it is useful to have a search tool that locates the top hits efficiently and in the case of searching a 16S ribosomal DNA database, sensitivity can be sacrificed for speed while still locating the relevant sequences. The SimRank tool (by Niels Larsen) breaks down a query sequence into overlapping k-mers (7-mers) - or oligofying the sequence - which are used to search a database of indexed sequences to locate the sequences that share the most unique k-mers. A specified number of hits per query sequence is output according to an identity score that is calculated as the number of unique shared k-mers over the minimum unique k-mers in either the query sequence or the database sequence. The resulting identity score is a rough under-estimate of the per-base identity so that a identity of 70% based on k-mer corresponds to 90% based on nucleotides.

I have implemented an upgraded logic of the SimRank algorithm now allowing for multiple query sequences to be processed at one time, which dramatically improves performance: 1000 full length RNA sequenced against 400.000 sequences is completed in 3.5 minutes on a laptop using less than 1Gb of memory. Moreover, the algorithm scales almost linearly. The FindSim algorithm functions by initially pre-processing all query sequences into a query index by oligofying all query sequences and for every oligo record the sequences containing that particular oligo. Holding this query index in memory the database of reference sequences is searched by oligofying each database sequence and determine shared unique k-mers and record the score. Finally, the scores are post-processed so that for each query sequence a given number of database hits is output best first, or a variable number of  database hits is output within an identity range.

FindSim schematics
FindSim schematic
The FindSim algorithm was implemented as a Biopiece: findsim_seq.

No comments: