Identifying Gene Regulatory Networks ------------------------------------ I start below with the biological motivation -- stick with it because at the end is an abstract problem which seems quite pretty and natural. New experimental techologies in molecular biology (particularly oligonucleotide arrays) are now making it possible to quickly obtain vast amounts of data on gene expression in a particular organism under particular conditions. Several groups are experimenting with an Affymetrix array containing 6000 different genes for yeast, measuring how strongly each gene is expressed as a function of time. Making sense of this huge amount of data is an important computational problem. Pat Brown at Stanford has made public a time series dataset of 7 samples for each gene at http://cmgm.stanford.edu/pbrown/explore/array.txt More general information is available at http://cmgm.stanford.edu/pbrown/explore/index.html Ron Davis apparently has a time series dataset of 17 samples prepared for eventual release. The primary interest in analyzing this data is to discover "gene regulatory networks". Certain genes "turn on" other genes (activators), while certain genes "turn off" other genes (inhibitors). Only a small number of genes function as activators or inhibitors, but identifying them is a very important and complicated problem. By analyzing the expression data, we can determine whether gene A is a candidate activator or inhibitor of gene B. If the leading edge of A's growth curve occurs before the leading edge of B's growth curve, A is a candidate activator of B. If the leading edge of A's growth curve occurs before the trailing edge of B's growth curve, A is a candidate inhibitor of B. Most of these relations are spurious -- so we seek to identify (or at least suggest) the real regulatory network from the data. ------------------- The Problem --------------------------------------- Suppose that we are given a directed graph, where the vertices are genes, and each edge is labeled as a candidate activator or inhibitor (or both). We seek to delete excess edges from the graph, such that (1) every vertex has at least one activator and one inhibitor edge into it. (2) the remaining edges from every vertex are either all activators or all inhibitors. (implying that no single gene functions in both roles) Questions: ---------- (1) What is the complexity of this problem for general graphs and labelings? (big hint: I can show it is NP-complete with a reduction from 3-SAT) (2) What is the complexity of this problem for directed acyclic graphs? (3) What is the complexity of this problem if each edge is exclusively either an inhibitor or an activator? (4) What other formulations are interesting either computationally or biologically, or both? (5) What might be an interesting heuristic algorithm, which we can use to actually experiment with the data? I got this problem in discussions with Ting Chen (tchen@salt2.med.harvard.edu) at RECOMB 98. ------------------------------------------------------------------ NOTES FROM 3/27/98 Saurabh observed that the following gadget for the boolean variable in my SAT reduction suffices to prove hardness for DAGs with exclusive edges (i.e. solving (2) and (3) above). Each variable V will be represented by 4 vertices (a,b,c,d) with the following edges: (a,c), (b,c) -- both activators (b,d), (c,d) -- both inhibitors Further c has an inhibtor input, while d has an activator input. The only way to satisfy c and d is to make a and b of different types, corresponding to true and false. This now motivates the following approximation algorithm problem: (6) Given a directed graph with (A/I) labeled edges, assign each vertex an A or I label so as to maximize the number of vertices with both input A and I labels, after deleting all whose label differs from its parent vertex. Perhaps a randomized strategy can guarantee a constant factor approximation? ------------------------------------------------------------------ NOTES FROM 4/3/98 We considered approximation algorithms to maximize the number of genes with both A and I edges (problem 6 above). We observed that assigning the label of each vertex to be A or I uniformly at random yields a 1/4 approximation. Each vertex that can be expressed must have at least one A and I edge into it, each of which will be labeled appropriately with probability 1/2. By linearity of expectation, this means we expected to satisfy at least 1/4 of optimal. Note that this strategy does not even look at the outgoing edges of each vertex to whether there are in fact any A or I edges! A seemingly smarter idea to weight the probability of labeling a vertex A or I according to the fraction of labeled outgoing edges doesn't work. Consider a a complete directed graph on $n$ vertices, with a directed cycle of $n$ I edges, and the rest of the edges labeled A. This randomized algorith will pick roughly 1 vertex to be labeled I, and only activate one gene. However, the optimal strategy would be to select all but one vertex to be I, thus activating $n-1$ genes. The following weighting scheme was proposed. Each vertex $x$ will be given a weight 1 to be assigned to the vertices $v$ with edges into $x$, ie. $(v,x)$. This weight will be equally divided between the A and I ancestors of $x$. Specifically, each incoming edge will get weight $1 / (2 d_w)$, where $d_w$ is the indegree of the appropriate label (A or I). Thus a vertex with three incoming edges (two As, one I) will pass back a weight of 1/4 for each of the A edges and 1/2 for the I edge. This seems to give us an expected 1/2 approximation on two pathological examples we considered. Hence: (7) Does this scheme give us a 1/2 approximation? Note that it cannot be better than a 1/2 approximation because of the example of two oppositely directed $n$-cycles (one A and one I), where a random selection yields $n/4$ expressed genes while labeling them AAIIAAII.. gives $n/2$. (8) What about deterministic strategies? In particular, consider the case where each vertex has at most two inputs and two outputs. Can this be solved optimally in polynomial-time? ------------------------------------------------------------------ NOTES FROM 4/10/98 We spent most of our time on problem (8) above, trying for a deterministic strategy for the two input, two output case. It seems hopeful that we can solve this in polynomial time, by taking advantage of the following constraints: (a) any vertex with both outputs of identical labels can be set to that label. (b) any vertex with both inputs of identical labels can be have both input edges deleted, since we cannot possibly satisfy it. (c) any vertex with one output label can be set to that label. (d) any vertex with one input label can have that input edge deleted. From here, there may exist vertices with one adjacent labeled vertex activating the edge, and one unactivated edge. Question -- without loss of generality, can we activiate the other edge and still have an optimal solution? Another direction of interest was mapping this problem to "min 2-SAT", finding the truth assignment which minimizes (instead of maximizes) the number of satisfied clauses. This problem remains hard -- see: Kohli & Krishnamurti & Mirchandani, The Minimum Satisfiability Problem, SIAM J. Discrete Math, v. 7 1994. The mapping is as follows. Needing both A and I into a vertex can be written as the boolean formula (A AND I) = NOT (NOT A OR NOT I). Thus maximizing the number of AND terms equals minimizing the number of OR terms. ------------------------------------------------------------------ NOTES FROM 4/17/98 The following algorithm very neatly solves the case of indegree/outdegree at most two networks to optimality. Construct the following n-vertex graph. Create an edge (x,y) with label z iff the two input edges to z are (x,z) and (y,z) and they have opposite A/I labels. I claim that a maximum matching in this graph defines an optimal labeling. Further, because of the restricted degree conditions, the connected components of this graph can only be paths and cycles, so the matching can be found in linear time! This leads to two new questions: (9) Does there exist a polynomial algorithm for graphs of indegree at most 2? (10) Does there exist a polynomial algorithm for graphs of outdegree at most 2?