Next: 2. Environments for Combinatorial Up: Summary of Recent Research Previous: Summary of Recent Research

1. Computational Biology

Optimizing Combinatorial Library Construction via Split Synthesis (with B. Cohen) Proc. Third International Conf. on Computational Molecular Biology (RECOMB 99). Also, to appear in the Journal of Combinatorial Chemistry.
Fabricating Arrays of Strings (with R. Bradley). Proc. First International Conf. on Computational Molecular Biology (RECOMB 97).

Combinatorial chemistry is one of the most exciting recent developments in drug discovery, the foundations of which rest on the systematic synthesis and screening of large libraries of small molecules. In the standard one-bead, one-compound approach to combinatorial chemistry, a large number of distinct small molecules are fabricated on distinct resin beads. These beads are reacted against a target to establish which beads contain biologically active agents. The compound associated with a positively-reacting bead can then be identified via sequencing or indirect coding methods.

We propose a new algorithmicly-intensive technology for fabricating libraries for combinatorial chemistry-- optimized split-synthesis--and evaluate its potential through extensive simulation. Our methods holds promise in synthesizing libraries which are (1) too large to reasonably employ sequential synthesizers (say greater than 50 distinct molecules), (2) too long or irregular for conventional split-synthesis generation techniques and (3) not used in production quantities-- thus making array-based synthesis techniques economically unviable.

Identifying Gene Regulatory Networks from Experimental Data (with T. Chen and V. Filkov) Proc. Third International Conf. on Computational Molecular Biology (RECOMB 99)

Identifying gene regulatory networks from experimental data is now an area of extremely active research. New experimental technologies in molecular biology now make it possible to quickly obtain vast amounts of data on gene expression in a particular organism under particular conditions. Associating functions to genes based on this huge amount of data is an important and challenging problem.

We propose a new computational methodology for making sense of large, multiple time-series data sets arising in expression analysis, and evaluate it both theoretically and through a case study. First, we build a graph representing all putative activation/inhibition relationships by analyzing the expression profiles for all pairs of genes. Second, we prune this graph by solving a combinatorial optimization problem to identify a small set of interesting candidate regulatory elements. We do not assert that we identify ``the'' regulatory network as a result of this computation. However, our approach quickly enables biologists to identify and visualize interesting features from raw expression array data sets.

We have implemented our methodology and applied it to the analysis of Saccraromyces cerevisiae data sets, in collaboration with yeast biologists. A natural objective criteria suggests an interesting combinatorial problem. For this particular model, the maximum gene regulation problem, we present several algorithmic and complexity results

A Method of Identifying Sequence in a Nucleic Acid Target Using Interactive Sequencing by Hybridization. U.S. Patent 5,683,881. Issued November 4, 1997.
Reconstructing Strings from Substrings in Rounds (with D. Margaritis). Proc. 36th IEEE Symp. Foundations of Computer Science (FOCS '95), pp. 613-620, Milwaukee WI, October 23-25, 1995.
Reconstructing Strings from Substrings, (with Gopalkrishnan Sundaram), J. Computational Biology 2 (1995) 333-353.
Positional Sequencing by Hybridization (with S. Hannenhalli, W. Feldman, H. Lewis, and P. Pevzner). Computer Applications in the Biological Sciences (CABIOS) 12 (1996) 19-24.

Sequencing by hybridization (SBH) is a new and promising approach to DNA sequencing which offers the potential of reduced cost and higher throughput over traditional gel-based approaches. In SBH, the target sequence is reconstructed by identifying which oligonucleotides are and are not subsequences of the target.

We have developed a new approach to sequencing by hybridization, which uses interaction to dramatically reduce the number of oligonucleotides for de novo sequencing of large DNA fragments, while preserving the parallelism which is the primary advantage of SBH. We have recently begun a collaboration with biochemists at Oxford University to perform laboratory the first experiments on interactive SBH, using the Southern Array Maker apparatus for constructing oligonucleotide arrays. Constructing dense arrays of oligonucleotides is essential for practical ISBH, but which requires the solution of an interesting combinatorial problem on fabricating strings via a sequence of row and column operations.

Trie-Based Data Structures for Sequence Assembly (with T. Chen). Combinatorical Pattern Matching (CPM '97).
A Case Study in Genome-Level Fragment Assembly (with T. Chen). Submitted to Bioinformatics.

We worked with Dr. William Studier's group at Brookhaven National Laboratory on their project to sequence the one-megabase genome of Borrelia burgdorferi, the bacterium which causes Lyme disease. Recent improvements by the Brookhaven group in producing primer oligonucleotides facilitate directly sequencing large stretches via primer walking. This improved strategy uses incremental sequencing, ie. primer walking of clones deemed important to fragment assembly following an initial shotgun stage.

We have developed a fast sequence assembler, STROLL, based on exact-match overlap detection, which proved capable of assembling the first 3,800 clones of the Borrelia Burgdorferi project within ten minutes on a Sparc1000. By comparison, the Brookhaven group's previous assembler program took approximately fifteen hours on the same data. Using our assembler, the Brookhaven group completed its sequencing of Borrelia in August 1997.

Borrelia was the first large genome simultaneously sequenced by two independent groups. We took advantage this unique event to study the performance of three different fragment assemblers on data obtained using two different sequencing strategies on the same organism.

Local Rules for Protein Folding on a Triangular Lattice and Generalized Hydrophobicity in the HP Model (with R. Agarwala, S. Batzoglou, V. Dancik, S. Decatur, M. Farach, S. Hannenhalli, and S. Muthukrishnan). To appear in J. Computational Biology. Preliminary versions appear in Eighth ACM-SIAM Symp. on Discrete Algorithms (SODA) pp. 390-399 and the First International Conf. on Computational Molecular Biology (RECOMB 97).

We consider the problem of determining the three-dimensional folding of a protein given its one-dimensional amino acid sequence. We use the HP model for protein folding proposed by Dill, which models protein as a chain of amino acid residues that are either hydrophobic or polar, and hydrophobic interactions are the dominant initial driving force for the protein folding. In this paper, we examine the choice of a lattice by considering its algorithmic and geometric implications and argue that triangular lattice is a more reasonable choice. We present a set of folding rules for a triangular lattice and analyze the approximation ratio which they achieve. After describing the biological foundation for this generalization, we show that in the new model we are able to achieve similar constant factor approximation guarantees on the triangular lattice as were achieved in the standard HP model.

Sorting with Fixed-Length Reversals (with T. Chen). Discrete Applied Mathematics, special issue on Computational Biology, 71 (1996) 79-95.

Due to applications in reconstructing the evolutionary history of the genome, there has been considerable recent interest in problems of sorting permutations with reversals. In this paper, we study the problem of sorting permutations and circular permutations using as few fixed-length reversals as possible. For all values of n and k, we characterize the number of connected components of the underlying group. To bound the diameter of the group, we give an algorithm to sort all sortable circular permutations in O(n²/k+nk) reversals, and show that $\Omega(n^{2}/k^{2}+n)$ reversals are necessary.

A Partial Digest Approach to Restriction Site Mapping, (with Gopalkrishnan Sundaram), Bulletin of Mathematical Biology 56 (1994) 275-294.

We present a new, practical algorithm to resolve the experimental data in restriction site analysis, which is a common technique for mapping DNA. Specifically, we assert that multiple digestions with a single restriction enzyme can provide sufficient information to identify the positions of the restriction sites with high probability. The motivation for the new approach comes from combinatorial results on the number of mutually homeometric sets in one dimension, where two sets of n points are homeometric if the multiset of n(n-1)/2 distances they determine are the same. Since experimental data contains error, we propose algorithms for reconstructing sets from noisy interpoint distances, including the possibility of missing fragments. We analyze the performance of these algorithms under a reasonable probability distribution, establishing a relative error limit of $r = \Theta( 1/ n^2)$ beyond which our technique becomes infeasible. Through simulations, we establish that our technique is robust enough to reconstruct data with relative errors of up to 7.0% in the measured fragment lengths for typical problems, which appears sufficient for certain biological applications.

Next: 2. Environments for Combinatorial Up: Summary of Recent Research Previous: Summary of Recent Research

Steve Skiena
1999-12-04