Identifying Gene Regulatory Networks
	------------------------------------

I start below with the biological motivation -- stick with it because
at the end is an abstract problem which seems quite pretty and natural.

New experimental techologies in molecular biology (particularly oligonucleotide
arrays) are now making it possible to quickly obtain vast amounts of data
on gene expression in a particular organism under particular conditions. 
Several groups are experimenting with an Affymetrix array containing 6000 
different genes for yeast, measuring how strongly each gene is expressed
as a function of time.  Making sense of this huge amount of data is an
important computational problem.

Pat Brown at Stanford has made public a time series dataset of 7 samples
for each gene at http://cmgm.stanford.edu/pbrown/explore/array.txt
More general information is available at
http://cmgm.stanford.edu/pbrown/explore/index.html
Ron Davis apparently has a time series dataset of 17 samples prepared
for eventual release.

The primary interest in analyzing this data is to discover "gene regulatory
networks".  Certain genes "turn on" other genes (activators), while certain
genes "turn off" other genes (inhibitors).  Only a small number of genes
function as activators or inhibitors, but identifying them is a very important
and complicated problem.

By analyzing the expression data, we can determine whether gene A is a
candidate activator or inhibitor of gene B.  If the leading edge of A's
growth curve occurs before the leading edge of B's growth curve, A is
a candidate activator of B.  If the leading edge of A's growth curve
occurs before the trailing edge of B's growth curve, A is a candidate 
inhibitor of B.  Most of these relations are spurious -- so we seek to
identify (or at least suggest) the real regulatory network from the data.

------------------- The Problem ---------------------------------------

Suppose that we are given a directed graph, where the vertices are genes,
and each edge is labeled as a candidate activator or inhibitor (or both).
We seek to delete excess edges from the graph, such that

(1) every vertex has at least one activator and one inhibitor edge into it.

(2) the remaining edges from every vertex are either all activators or
all inhibitors.  (implying that no single gene functions in both roles)

Questions:
----------

(1) What is the complexity of this problem for general graphs and labelings?
(big hint: I can show it is NP-complete with a reduction from 3-SAT)

(2) What is the complexity of this problem for directed acyclic graphs?

(3) What is the complexity of this problem if each edge is exclusively
either an inhibitor or an activator?

(4) What other formulations are interesting either computationally or
biologically, or both?

(5) What might be an interesting heuristic algorithm, which we can use to
actually experiment with the data?

I got this problem in discussions with Ting Chen (tchen@salt2.med.harvard.edu)
at RECOMB 98.

------------------------------------------------------------------

	NOTES FROM 3/27/98

Saurabh observed that the following gadget for the boolean variable
in my SAT reduction suffices to prove hardness for DAGs with exclusive
edges (i.e. solving (2) and (3) above).  Each variable V will be represented
by 4 vertices (a,b,c,d) with the following edges:

	(a,c), (b,c) -- both activators
	(b,d), (c,d) -- both inhibitors

Further c has an inhibtor input, while d has an activator input.  The only
way to satisfy c and d is to make a and b of different types, corresponding
to true and false.

This now motivates the following approximation algorithm problem:

(6) Given a directed graph with (A/I) labeled edges, assign each vertex
an A or I label so as to maximize the number of vertices with both input
A and I labels, after deleting all whose label differs from its parent
vertex.

Perhaps a randomized strategy can guarantee a constant factor approximation?


------------------------------------------------------------------
 
        NOTES FROM 4/3/98

We considered approximation algorithms to maximize the number of genes
with both A and I edges (problem 6 above).  

We observed that assigning the label of each vertex to be A or I uniformly
at random yields a 1/4 approximation.  Each vertex that can be expressed
must have at least one A and I edge into it, each of which will be labeled
appropriately with probability 1/2.  By linearity of expectation, this
means we expected to satisfy at least 1/4 of optimal.

Note that this strategy does not even look at the outgoing edges of each
vertex to whether there are in fact any A or I edges!  A seemingly
smarter idea to weight the probability of labeling a vertex A or I according
to the fraction of labeled outgoing edges doesn't work.  Consider a
a complete directed graph on $n$ vertices, with a directed cycle of $n$ I
edges, and the rest of the edges labeled A.  This randomized algorith will
pick roughly 1 vertex to be labeled I, and only activate one gene.  However,
the optimal strategy would be to select all but one vertex to be I, thus
activating $n-1$ genes.

The following weighting scheme was proposed.  Each vertex $x$ will be given
a weight 1 to be assigned to the vertices $v$ with edges into $x$, ie. $(v,x)$.
This weight will be equally divided between the A and I ancestors of $x$.
Specifically, each incoming edge will get weight $1 / (2 d_w)$, where $d_w$ is
the indegree of the appropriate label (A or I).  Thus a vertex with three
incoming edges (two As, one I) will pass back a weight of 1/4 for each of
the A edges and 1/2 for the I edge.  This seems to give us an expected
1/2 approximation on two pathological examples we considered.  Hence:

(7) Does this scheme give us a 1/2 approximation?  Note that it cannot
be better than a 1/2 approximation because of the example of two oppositely
directed $n$-cycles (one A and one I), where a random selection yields
$n/4$ expressed genes while labeling them AAIIAAII.. gives $n/2$.

(8) What about deterministic strategies?  In particular, consider the
case where each vertex has at most two inputs and two outputs.  Can
this be solved optimally in polynomial-time?


------------------------------------------------------------------
 
        NOTES FROM 4/10/98
 

We spent most of our time on problem (8) above, trying for a deterministic
strategy for the two input, two output case.  It seems hopeful that we
can solve this in polynomial time, by taking advantage of the following
constraints:

(a) any vertex with both outputs of identical labels can be set to that label.
(b) any vertex with both inputs of identical labels can be have both input
	edges deleted, since we cannot possibly satisfy it.
(c) any vertex with one output label can be set to that label.
(d) any vertex with one input label can have that input edge deleted.

From here, there may exist vertices with one adjacent labeled vertex
activating the edge, and one unactivated edge.  Question -- without
loss of generality, can we activiate the other edge and still have an
optimal solution?

Another direction of interest was mapping this problem to "min 2-SAT",
finding the truth assignment which minimizes (instead of maximizes) the
number of satisfied clauses.  This problem remains hard -- see:
Kohli & Krishnamurti & Mirchandani, The Minimum Satisfiability Problem,
SIAM J. Discrete Math, v. 7 1994.

The mapping is as follows.  Needing both A and I into a vertex can
be written as the boolean formula (A AND I) = NOT (NOT A OR NOT I).
Thus maximizing the number of AND terms equals minimizing the number of
OR terms.

 
------------------------------------------------------------------
 
        NOTES FROM 4/17/98

The following algorithm very neatly solves the case of indegree/outdegree
at most two networks to optimality.  Construct the following n-vertex graph.
Create an edge (x,y) with label z iff the two input edges to z are (x,z)
and (y,z) and they have opposite A/I labels.  I claim that a maximum matching
in this graph defines an optimal labeling.  Further, because of the restricted
degree conditions, the connected components of this graph can only be paths
and cycles, so the matching can be found in linear time!

This leads to two new questions:

(9) Does there exist a polynomial algorithm for graphs of indegree at most 2?

(10) Does there exist a polynomial algorithm for graphs of outdegree at most 2?