Monday, October 8, 2012

Denoising Amplicons

Determining bacterial diversity can be done by amplifying and sequencing 16S ribosomal DNA genes yielding amplicons. In recent years pyrosequencing has been the workhorse in amplicon sequencing, because the read length of this deep sequencing can span one or more of the variable regions in the 16S ribosomal DNA gene, which is necessary to get the resolution to identify bacteria at the species level. Unfortunately, pyrosequencing is prone to homopolymeric errors, where runs of identical nucleotides are miscalled. Several computer programs have been developed to rectify the problem: AmpliconNoise applies an approximate likelihood using empirically derived error distributions to remove noise. Denoiser is a less slow algorithm that uses frequency-based heuristics rather than statistical modeling to cluster reads. Recently, Acacia has been released, which is much less computer resource demanding that AmpliconNoise and Denoiser. Acacia avoids all-against-all alignments by aligning each read to dynamically updated cluster consensus sequences, and the alignment algorithm is made more efficient using heuristics that only consider homopolymer over- and under-calls.

For some reason none of these programs utilize the quality scores of the sequences. I have devised an algorithm were all sequences are clustered based on sequence similarity and for each cluster a consensus sequence is determined from the sequence distribution while taking into account quality scores. Thus, a column in the cluster alignment showing identical residues, but with subquality scores may be included in the consensus, while a column with a percentage of subqualty scores may be filtered for those residues, and the consensus will be determined from the remaining residues. A number of different filters can be used with different parameters.

consensus aligment
Consensus alignment
The consensus sequence is the second last line. The last line illustrates the mean quality scores of the column by identity '0' = 75-100%, 'o' = 50-75%, '.' = 25-50%, ' ' = 0-25%. Lowercase residues have failed the rules by the filters and are omitted from the consensus.

Testing results show that sequence clusters indeed can be improved upon and lead to better consensus sequences. It is possible to iterate the denoising so that the first round uses clustering at 100% similarity and loose filtering followed be a second round of denoising using clustering at 97% similarity and more strict filtering. In general, the consensus sequences are shorter than the sequences in the cluster, because any detected errors in the alignment results in gaps in the consensus, but identifying reference sequences from a reference database is not a problem using such as BLAST or findsim_seq

This algorithm have been implemented as a Biopiece: denoise_seq.

No comments: