Shorty Assembler - Improvements


Running Shorty with test Data ( fixed configuration issues in the assembler )

The Previous version of shorty had lot of configuration issues and it was not executing properly. Running shorty on sample data will give lot of errors. We found out the errors and changed the respective header file descriptions, paths and some issues in the code so that it can be executed on latest C++ compilers. In addition to that read me file in the assembler didn't expalin running geographer propelry. By using the mummer software and scripts file that was mentioned we fixed those things. We wrote a new Read Me file along with the little changes in code and put that in the website. So that these issues won't arrive in future when others are using it. By following the new Read Me document, shorty is working properly on the test data we've collected from Charles.By running according to this and using mummer software we got the outputs in graph format. Download the latest shorty code (with newly updated Read me file) here.


Comparison with Velvet made easy. Download here

Velvet is another short read genome assembler which uses debruijn graphs while sequencing data. We downloaded velvet assembler from the website. In the vevet manual executing various kinds of sequencing data on velvet is explained. We did accordingly and executed it on the test data. But all those results were in .txt format and hardly understandable. So we once again used mummer software in similar way to shorty for velvet also. Our process worked and results are represented in graphs. Now comparing both shorty and velvet will be very easy, we used these results and duly developed a comparison chart.


Data Collection

Collected various types (Solexa Illumina, Helicos, ABI SOLID) of data.Although there are various kinds of sequencing data is available, the data we needed such as solexa/illumina paired end read data, and helicos bio sciences data that is less than 5 gb is rarley available. We searched for those kinds of data and find the data. Check out references section for more details.


Conversion between data formats

After collecting data we changed the different formats into FASTA files, since shorty works on fasta files only. ABI solid data is only available in csfasta files. Similarly Solexa data available in fastq format. But to run it on shorty we need to convert it to fasta format. So we found some of the scripts online and modified them accordingly. These scripts are used to convert the sequencing data formats.These scripts can be executed using bio perl tools. By using this scripts all the downloaded data is converted to fasta format.Click here to download these scripts.


ABI Solid on shorty and Velvet - Comparison

Solexa data works with velvet in fastq or fasta format. Velvet assmebler follows different configuration while running solexa and ABI data types. While shorty has limitations regarding solexa. After converting fastq format files to fasta format a seed file is needed for running solexa data on shorty. Also shorty works only on paired end reads which are difficult to find in solexa data. Click here to view the results.


Solexa on shorty and velvet- Comparison

Solexa data works with velvet in fastq or fasta format. Velvet assmebler follows different configuration while running solexa and ABI data types. While shorty has limitations regarding solexa. After converting fastq format files to fasta format a seed file is needed for running solexa data on shorty. Also shorty works only on paired end reads which are difficult to find in solexa data.lick here to view the results.


Performance Improvements

We used gprof, opensource gnu c profiler for profiling shorty assembler and finding botlenecks in the code. Bottle necks from gmon output are shown below.

Click here to view the performance results output.

We have optimized the performance of the above methods by in-lining the functions called from these functions. One such example is returning the base of the corresponding alphabet. Other steps where program is lagging is printing the contigs to standar I/O and additional information produced during the assembly like bambus-contig file. We have added debug statemnets to the code to print these values only when debug option is set to on. This value can be configured in the config file.

In addition to the above optimizations, there is scope for more optimizations. We also observed that most of the time is spent in the below steps. 1. writing the contig output and the geography information of the contig. 2. Currently the code takes only one seed, it can be further improved by providing more seeds to the assembler.


Documentation

As said earlier we created a new Read Me file. This document is up-to-date. This document clearly demonstrates how to run shorty on the test data and collected data. This file along with the shorty project with modifications can be downloaded from the website. Click here to download new Read me file.