Although biologists understand the basic mechanisms through which DNA produces biochemical behavior, they have not yet determined most of the regulatory networks that control the degree to which each gene is expressed. In collaboration with researchers from the Center for Computational Astrobiology at NASA Ames Research Center and the Department of Plant Biology at the Carnegie Institution of Washington, we are developing computational tools to assist the scientist with the formulation, evaluation, refinement, and discovery of causal theories for genomic and proteomic regulatory networks. Our approach combines ideas from qualitative physics, theory revision, and statistical analysis.
We are developing an interactive environment that lets biologists state qualitative models, encode background knowledge, evaluate models in the context of this knowledge and expression data, and suggest model revisions that are biologically plausible. We believe that biologists will find such an interactive environment more attractive than more automated methods that leave no room for input from the scientist. As part of this framework, we are developing tools that will let users visualize their models and expression data's relation to these models.
A key aspect of our work is a focus on causal relationships. While many conventional data analysis techniques such as clustering and classical hypothesis testing can be employed to reveal associational patterns between genes or proteins, it is cause-effect relationships that are often the primary interest for the microbiologist. Unlike associations alone, causal relationships are predictive of the effects of hypothetical interventions. For example, if gene A regulates gene B, their expression levels are likely to be associated; however, while directly interfering with gene A would be expected to have an impact on gene B's expression level, an interference to gene B would not be expected to transfer (directly) to gene A's expression level. A causal description is generally necessary to capture such phenomena. To more directly address the data-analysis needs of the biologist, we are creating tools to assist with the direct evaluation of evidence for causal relationships. For the microbiologist, such tools are directly relevant to the discovery and refinement of biochemical, regulatory, and signaling pathways.
Various different types of information are relevant for detecting evidence for causation in experimental data. These include conditional independence or partial correlation measures, time precedence, comparison of response under different experimental interventions, comparison of response by mutant organisms, and domain-specific background knowledge. Bringing these together in a principled way is a primary challenge and key contribution of our work.
Another central aspect of our work involves using representations of biological systems that make contact with microbiologists' ways of thinking, while still being interpretable by computers. First, our computational models are structured around the conceptual framework, taxonomies and jargon used by the biologists. For example, genes are regulated by transcription factor proteins, and so a biologist might expect to see the transcription factor activation level as a parent of the gene expression level, even though the transcription factor activation level might not be measured in his microarray experiments. In many cases, these considerations have major impact on structural considerations for a computational model and can introduce additional challenges such as handling latent variables. Second, existing biological knowledge tends to be more qualitative than quantitative; thus, we have focused on qualitative causal models, in which only the direction and sign of a causal influence is specified, but not the functional form or values of numeric parameters.
Conventional experimental metholodogies often rely on experimental design to elucidate causal influences, by comparing reactions of a system being studied under a variety of different interventions. For example, the difference in reaction to a cell culture exposed to a certain chemical compared to a control culture that is not exposed reveals responses that are causally downstream. By suppressing or knocking out intermediate steps, causal ordering of intervening genes or proteins are revealed. Experimental data obtained under different experimental interventions are, by definition, sampled from different distributions, and as a result many common statistical analysis methods are not applicable to the data taken together. Our work has allowed information about different intervention to be incorporated into the data analysis process. The combination of support for interventions and latent variables together represents an interesting technical contribution from our work.
The standard challenge for data analysis is to differentiate real effects from statistical fluctuation given only limited amounts of data. In biology, small sample sizes are particularly salient since each experiment can be quite costly and dimensionality can be extremely high. The situation is aggravated even further by the need to evaluate possible causal influences, which are in general more difficult to detect than associational relations. We believe that the incorporation of domain-specific background knowledge and other previous results into the analysis of a data set is essential for addressing these challenges. We like to say that additional background knowledge increases statistical power. Said another way, a particular data set may actually contain statistically significant support for a hypothesis if the data is interpreted in the context of background knowledge, while not containing any significant evidence for that hypothesis if analyzed alone without the background knowledge. The incorporation of diverse forms of background knowledge is a defining feature of our research program. Examples of relevant background knowledge that we have looked at include taxonomic role of substances (such as whether a substance is an external chemical, membrane receptor, kinase, transcription factor, gene expression level, etc); established indirect causal relationships such as "A up-regulates B", "exposure to C results in phenotype D", or "E mediates B's response to A"; and constraints on which specific causal influences are possible or impossible.
Noise and uncertainty pervade biological knowledge and experimental results. As such, we utilize probabilistic formalisms and explicitly encode and reason about uncertainty. We have made use of Tetrad-style representations based on structural equation models (e.g., Bay et al., 2002), as well as Bayesian formulations (e.g., Chrisman et al., 2003).
In our ongoing research, we are extending our representations to cover dynamic processes, to enhance support for interactive refinement of hypothetical models and management of large corpora of background knowledge and experimental results, and to apply these tools to real problems and data in conjunction with working biologists.
This work has been supported by Grants NCC 2-1202, NCC 2-5471, and NCC 2-1335 from NASA Ames Research Center, and by NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation.