Wednesday, July 30, 2008

Are Alu repeats the storage of the Human memory?

The Human genome contains more the one million Alu repeats - a type of repeat specific to human. Alu repeats have been shown to be transcribed in the brain and in fact all Alu repreats have the potential to encode one or two small RNAs. The potential regulatory network is staggering, and the task of this network could possible be important to human memory. Since trafficing of small RNAs from the synapse to the neuron nucleus has been shown, we set out to test if Alu RNAs direct genomic recoding leading to genomically encoded memory - an idea supported by the finding that disrupted ADAR editing, which similarly recodes the immune system, lead to mental disorder. We sequenced 20 Alu loci and using 454 deep sequencing obtaining ~300,000 reads - which we hoped was deep enough to reveal transitions and transversions from possible ADAR editing. Unfortunately, it turned out that the error rate of 454 sequencing was too high for this type of analysis, and we are waiting for improved sequencing methods. In the meantime we are planning an experiment that theoretically will allow us to track the expression of specific Alu RNAs using DNA array with 30-mer probes. Eventhough Alu repeats are repetitive, they actually all have very slight sequence variation, which can be used to track individual Alus with five probes per Alu.

In order to asses the repetitiveness of the Alu repeats I wrote a piece of code that for all positions in the Human genome determines the genome-wide occurrence of that 15-mer. This is done by sliding a 15 nucleotide wide window over the genomic sequence in two passes. In both passes two bits, corresponding to the next nucleotide in the sequence, were pushed onto a 32 bit integer. In the first pass the full 32 bit integer was then used as an index to increment the count in an integer array. In the second pass the full 32 bit integer was used to output the count at each window position. The principle is illustrated in pseudo-code below.

// first pass

foreach nucleotide in sequence
if nucleotide is 'A'
shift 00 onto 32bits block
else if nucletide is 'T'
shift 11 onto 32bits block
else if nucletide is 'C'
shift 01 onto 32bits block
else if nucletide is 'G'
shift 10 onto 32bits block
else
reset block

if sliding window is full
count_array[ block ]++

// second pass

foreach nucleotide in sequence
if nucleotide is 'A'
shift 00 onto 32bits block
else if nucletide is 'T'
shift 11 onto 32bits block
else if nucletide is 'C'
shift 01 onto 32bits block
else if nucletide is 'G'
shift 10 onto 32bits block
else
reset block

if sliding window is full
count = count_array[ block]
print position and count

Now, in the real implementation I simultaneously counted the anti sense strand in the same go by pushing the reverse complemented bits onto another 32 bit block that was used to increment the count index as well. The result was a piece of code written in c that counted trough the entire Human genome in minutes using only 8Gb of RAM. The results were compiled as a wiggle track in the UCSC Genome Browser and the figure below shows two Alu repeats with the count track. Each peak in the count track corresponds the number (the height of the peak) of 15-mer beginning at that position. The conclusion was that the repeatedness of the Alu repeats is mainly in the promoter region and that there in each Alu repeat exists a couple of almost unique 15-mers.


No comments: