Wednesday, March 24, 2010

Genome sequencing of Bacteria

I have shifted from eukaryotic to prokaryotic research, and I am now working in the Molecular Microbiology Ecology Group at University of Copenhagen. The MME-group is part of The Copenhagen High-Throughput Sequencing Center and houses 2 Roche FLX instruments and 2 Illumina Genome Analyzers. We will be sequencing bacterial genomes and amplicons (part of the 16S ribosomal DNA) for community analysis, but also metagenomes. For now I have been working on the assembly and annotation of the sequenced bacterial genomes. Assembly is the process of piecing together sequence reads into longer contiguous regions (contigs). A number of assembly software packages exist, but several of them are left-overs from the shotgun sequencing days (such as the Celera Assembler, Cap3, Arachne, Phrap, etc) and do not cope well with Next Generation Sequencing data. A number of new assemblers are looking very promising, but there are currently no open-source assemblers that can deal with mixed data types, i.e. Solexa and 454 reads. While Mira theoretically can handle mixed data types the reality is that the memory usage gets enormous, and the result suffer if the coverage is too big. The alternative is to shred reads into Solexa sizes and use Velvet. However, it appears quite illogical to reduce sequence sizes before increasing them! Also, none of the assemblers use comparative genomics, which should be one obvious path to guided assembly. So, no matter what assembler you are using, the result is a set of contigs and not a single sequence because of repetitive elements in the genome. The gaps between contigs can be closed in the laboratory with tedious PCR (a number of different protocols for gap closing exists), or if you have a closely related genome sequence that can be used for scaffolding - the ordering of contigs in the correct order and orientation.

After assembly the next task is to annotate the genome and luckily, this process has been made easy with the RAST annotation service. The resulting annotation is of high quality and can be viewed in the RAST annotation viewer. However, if you want to view the sequence data along with the annotation you want a real Genome Browser like the UCSC genome browser, Ensembl, or Gbrowse. Unfortunately, these Browsers are not optimal for browsing bacterial draft genomes, and it is very difficult to upload custom genomes and annotation tracks while at the same time controlling permissions on which users are allowed access to what data. Therefore I decided to write my own genome browser, that I call the Biopieces Genome Browser. It is described in more detail here, but a screen shot can be seen below: