Deconvolving Sequence Variation in Mixed DNA Populations notes: Steven Skiena sessions: April 13 and 20, 2001 ---------------------------------------------------------------------- DNA sequencing machines (including the model under development in the EE department at Stony Brook) plot curves registering the amount of each of the four nucleotide bases as a function of sequence position. If the DNA sample is homogeneous, consisting of amplified copies of a single DNA sequence, then the largest peaks at each position completely define the sequence. Such machines can also be used to analyze sequence from heterogenious populations. Possible applications include p53 analysis of tumor tissues, HIV populations, and mitochondrial DNA analysis. If the population is comprised of two sequences which differ only in substituting one base position (a single nucleotide polymorphism or SNP), then reconstructing the sequences is easy: look for the single base position with two peaks. But how can we deconvolve more complicated mutation populations, with insertion and deletion operations? Input: A wildtype sequence $S$, a set $V$ of allowable variations on $S$, and an experimental profile, consisting of an ordered sequence of $n$ subsets on alphabet $\Sigma$. Output: The smallest subset $V' \in V$ such that the character position subsets of $V'$ and $S$ yield exactly the input profile. In a more general problem, we might be given the fraction of each base at each position and have to reconstuct the frequency of each population. We consider the problem for increasingly general sets of allowable variations: Single character substititions ------------------------------ There are $(\alpha-1) n$ such variation sequences. If only single character substititions are allowed, there is only one possible way to create each additional profile character. Thus the algorithmic problem is trivial. Single k-string substitutions ----------------------------- Each variation differs from $S$ by replacing a substring of length $k$ with a different substring (although not all $k$ characters have to change). This can be solved optimally using a left to right greedy algorithm. Start with the leftmost uncovered profile character, and select the string which covers them and as large a set of other uncovered profile characters. Since all $k$-string replacements are available, no decision precludes any other covering option. Single character deletion ------------------------- Each variation differs from $S$ in the deletion of one character position, thus there are $n$ possible variation sequences. Any realizable profile can be covered using $S$ and at most one other sequence. Observe that any two varients are identical to the left of the first deletion and to the right of the second deletion. Also observe that $S$ is identical to both sequences to the left of their deletion. 01234567 ------- 1234567 0234567 0134567 0124567 0123567 0123457 0123456 Thus to cover the rightmost uncovered character of the profile, we can select the variant which has the leftmost deletion which is consistant with the rest of the profile. Note that if this does not completely cover the profile, then the profile cannot be covered, as all other consistant profiles share a prefix with $S$. Single range deletion --------------------- Each variation differs from $S$ in the deletion of a contigious subsequence, thus there are $\\binomial{n}{2}$ possible variation sequences. This is the most interesting of the deconvolution problems to date, and is still open. We have some observations, however. First, a generalization of the argument for single character deletion yields: Claim: Let $V'$ be a minimum size solution. Then $V'$ does not contain two strings of length $k$ for any $1 \leq k \leq n$. Proof: As in the example before, any two such strings must be missing character blocks of the same length. Thus they are both identical to the right of the rightmost deletion. Further, the prefix of $S$ is identical to the left of the rightmost deletion. Thus if both are consistant with the profile, the string with the leftmost deletion provides strictly increased covering power. 012345678 --------- 045678 012378 Selecting the proper subset from these candidates is not trivial, consider the following profile AAAAAAAAA A A A A A A A CCCCCCC TTTTTTTT T T T T T T T T GGGGGGGG ------------------------ *****ACGATATATATATATATAT Matching the righmost extra A in the profile can be done using a large number of strings, so a decision is needed. Note that the greedy heuristic gives a log factor approximation: Identify all elements of $V$ consistant with the profile Partition the survivors by length, and select the ones with leftmost deletion in each group. While there are uncovered characters in the profile select the survivor which covers the most remaining profile characters.