Deconvolving Sequence Variation in Mixed DNA Populations

notes: Steven Skiena
sessions: April 13 and 20, 2001

----------------------------------------------------------------------

DNA sequencing machines (including the model under development in the
EE department at Stony Brook) plot curves registering the amount of
each of the four nucleotide bases as a function of sequence position.
If the DNA sample is homogeneous, consisting of amplified copies of a
single DNA sequence, then the largest peaks at each position completely
define the sequence.

Such machines can also be used to analyze sequence from heterogenious
populations.   Possible applications include p53 analysis of tumor tissues,
HIV populations, and mitochondrial DNA analysis.

If the population is comprised of two sequences which differ only
in substituting one base position (a single nucleotide polymorphism
or SNP), then reconstructing the sequences is easy: look for the
single base position with two peaks.  But how can we deconvolve
more complicated mutation populations, with insertion and deletion
operations?


Input: A wildtype sequence $S$, a set $V$ of allowable variations
on $S$, and an experimental profile, consisting of
an ordered sequence of $n$ subsets on alphabet $\Sigma$.

Output: The smallest subset $V' \in V$ such that the character position
subsets of $V'$ and $S$ yield exactly the input profile.

In a more general problem, we might be given the fraction of each base
at each position and have to reconstuct the frequency of each population.

We consider the problem for increasingly general sets of allowable
variations:


Single character substititions
------------------------------
There are $(\alpha-1) n$ such variation sequences.
If only single character substititions are allowed, there is 
only one possible way to create each additional profile character.
Thus the algorithmic problem is trivial.

Single k-string substitutions
-----------------------------
Each variation differs from $S$ by replacing a substring of length
$k$ with a different substring (although not all $k$ characters
have to change).

This can be solved optimally using a left to right greedy algorithm.
Start with the leftmost uncovered profile character, and select
the string which covers them and as large a set of other uncovered
profile characters.  Since all $k$-string replacements are available,
no decision precludes any other covering option.

Single character deletion
-------------------------
Each variation differs from $S$ in the deletion of one character
position, thus there are $n$ possible variation sequences.

Any realizable profile can be covered using $S$ and at most one
other sequence.   Observe that any two varients are identical to the
left of the first deletion and to the right of the second deletion.
Also observe that $S$ is identical to both sequences to the left
of their deletion.

		01234567
		-------
		1234567
		0234567
		0134567
		0124567
		0123567
		0123457
		0123456

Thus to cover the rightmost uncovered character of the profile, we
can select the variant which has the leftmost deletion which
is consistant with the rest of the profile.  Note that if this
does not completely cover the profile, then the profile cannot be
covered, as all other consistant profiles share a prefix with $S$.


Single range deletion
---------------------
Each variation differs from $S$ in the deletion of a contigious
subsequence, thus there are $\\binomial{n}{2}$ possible variation
sequences.

This is the most interesting of the deconvolution problems to date,
and is still open.  We have some observations, however.

First, a generalization of the argument for single character deletion
yields:

Claim: Let $V'$ be a minimum size solution.  Then $V'$ does not
contain two strings of length $k$ for any $1 \leq k \leq n$.

Proof: As in the example before, any two such strings must be missing
character blocks of the same length.  Thus they are both identical to the
right of the rightmost deletion.  Further, the prefix of $S$ is
identical to the left of the rightmost deletion.  Thus if both are
consistant with the profile, the string with the leftmost deletion
provides strictly increased covering power.

			012345678
			---------
			045678
			012378

Selecting the proper subset from these candidates is not trivial,
consider the following profile

AAAAAAAAA A A A A A A A
CCCCCCC
TTTTTTTT T T T T T T T T
GGGGGGGG
------------------------
*****ACGATATATATATATATAT

Matching the righmost extra A in the profile can be done using a large
number of strings, so a decision is needed.

Note that the greedy heuristic gives a log factor approximation:

	Identify all elements of $V$ consistant with the profile
	Partition the survivors by length, and select the ones
		with leftmost deletion in each group.
	While there are uncovered characters in the profile
		select the survivor which covers the most remaining
			profile characters.