Efficient Cache-Oblivious String Algorithms for Bioinformatics

Rezaul Alam Chowdhury, Hai-Son Le, and Vijaya Ramachandran

Technical Report TR-07-03
The University of Texas at Austin, Department of Computer Sciences, Feb. 2007, 26 pages


We present theoretical and experimental results on cache-efficient and parallel algorithms for some well-studied string problems in bioinformatics:

1. Pairwise alignment. Optimal pairwise global sequence alignment using affine gap penalty;

2. Median. Optimal alignment of three sequences using affine gap penalty;

3. RNA secondary structure prediction. Maximizing number of base pairs in RNA secondary structure with simple pseudoknots.

For each of these three problems we present cache-oblivious algorithms that match the best-known time complexity, match or improve the best-known space complexity, and improve significantly over the cache-efficiency of earlier algorithms. We also show that these algorithms are easily parallelizable, and we analyze their parallel performance.

We present experimental results that show that all three cache-oblivious algorithms run faster on current desktop machines than the best previous algorithm for the problem. For the first two problems we compare our code to publicly available software written by others, and for the last problem our comparison is to our implementation of Akutsu's algorithm. We also include empirical results showing good performance for the parallel implementations of our algorithms for the first two problems.

Our methods are applicable to several other dynamic programs for string problems in bioinformatics including local alignment, generalized global alignment with intermittent similarities, multiple sequence alignment under several scoring functions such as ‘sum-of-pairs’ objective function and RNA secondary structure prediction with simple pseudoknots using energy functions based on adjacent base pairs.

Download: PSPDF