Sunday, July 6, 2008

Recursive pair-wise alignments

During my PhD years I used a lot of time and energy exploring large inverted segmental duplications - or palindromes - on the Human and Mouse sex chromosomes. These palindromes showed a remarkable level of arm-to-arm sequence conservation and harbored many testis and cancer specific genes we were interested in. Locating these palindromes were quite a daunting task because of their sizes in the multi-megabasepair range. Niels Larsen and I looked into how this could be done computationally, unfortunately, David Page et al beat us to it, and they published a couple of very nice papers on this phenomenon. So this work never made it as a chapter in my PhD thesis, but what remains are the thoughts about biological meaningful alignments.

Background


Biological alignments of sequences constitutes one of the major data analysis tasks in modern biology. While computer scientists have focused on algorithms working with limited, generic datasets, biologists have failed to grasp the internals of computer analysis and this has resulted in a lack of computer programs that efficiently will handle large datasets to give biologically meaningful results.

The biologically meaningful result is an important point which implies that aligning should be terminated when there is any doubt about the result instead of sprinkling indels throughout the alignment. A biologically meaningful alignment algorithm would leave the troublesome parts unaligned, and somewhere annotate which sequence parts were aligned and which were not.

The following describes an alignment algorithm idea of unknown origin (at least to me) which can be modified to create biological meaningful pair-wise alignments with proper adjustment of the match scoring method.

Principle

The algorithm utilizes a recursive top-down approach where the two sequences are used to cast a search space and all matches longer than a certain word-size shared between the sequences are located (Fig. 1A). Next, all matches longer than a certain size (e.g. 50-100nt) are assumed to be beyond doubt since these should not occur by chance - and these long matches are included in the alignment chain without further considerations (matches a, b, and c in Fig. 1B). Matches longer than the minimum word-size, but shorter than the long match size are retained and evaluated in the next alignment round if these matches falls within the new search spaces cast between the long matches and the boundaries of the previous search space (search space alpha in Fig. 1B). If a search space does not include any retained matches, new matches are found using a word-size calculated based on the search space dimensions. All retained and new matches within a search space is scored according to three different criteria (explained later) and the best scoring match is included in the alignment chain and used to cast two new search spaces (Fig. 1C). When no more matches can be found and no more search spaces can be cast, the alignment chain is complete (Fig. 1D).




Match Scoring

The matches found in a search space may all be valid for inclusion in the alignment chain, but it is also possible that all the matches are incorrect, and moreover, the matches may be a collection of both correct and incorrect matches. Therefore it is necessary to be able to distinguish between matches that should be:
  • included in the alignment chain
  • retained for further alignment rounds
  • discarded
This can be achieved by scoring all matches and including in the alignment chain the best scoring match, while retaining and discarding the remaining matches depending on a certain threshold score. Conflicting matches, that arise as a result of repetitive sequence, should also be scored in such a way, that the most correct match is included in the alignment chain. Match scoring should be done in a generic and robust way independent of the search space dimension including matches that result in an obvious correct alignment. Scoring takes place according to a sum of these criteria listed in order of importance:
  1. match length
  2. distance to closest diagonal
  3. distance to the narrow end of the search space
Distance to the closest corner of the search space

As stated above, this criteria is actually invalid, but it is mentioned here anyway to illustrate how this renders the scoring independent of the search space dimensions. Scoring according to corner distance in such a way that the closer to the corner, the better the match, is wrong as illustrated in Fig. 2A, where the match with the highest corner distance results in the correct alignment in a match conflict where the match lengths and diagonal and narrow end distances are equal.

Distance to narrow end of the search space

This criteria is only relevant for search spaces with uneven dimensions (i.e. uneven sequence lengths) where it is used to discriminate in match conflicts where match lengths and diagonal distances are equal. Thus, the match most distant to the narrow end of the search space is choosen (Fig. 2B). Also, since this criteria only is invoked in the special match conflicts described here, it can be considered independent of the search space dimensions. This parameter is dependent only on the primary sequence and hence the narrow distance score is linear.

Distance to closest diagonal

In a match conflict where two matches have equal lengths and the search space has even dimensions (neglecting the use of the narrow distance) will have to be resolved using distance to closest diagonal (Fig. 2C). Any distance between a match and the closest diagonal corresponds linearly to the amount of indels that will be inserted when the alignment chain is converted to the finished alignment output.

Match length

The length of a match is the most dominant parameter for scoring. A search space of uneven dimensions may contain conflicting matches with equal distance to closest diagonal, narrow end of search space, and closest corner thus illustrating that the lenght of the match is the determining factor (Fig. 2D). The longer the match, the higher the probability it did not occur by chance and should be included in the alignment chain. This implies that the match length score should be exponentially derived from the match length.



Implementation


Here is a pseudocode version of the recursive function:

func align_two_seq
filter all matches outside search space
if no matches
find_wordsize
find_matches
else if very long matches
include longest match
order remaining matches
include all non-crossing matches
call align_two_seq on all inter-match search spaces
else
score all matches
filter matches scoring below threshold
order matches and include the highest scoring
call align_two_seq on the match defined search spaces

No comments: