Seminar on Computational Learning and Adaptation


  Analysing Stochastic Language Models

Rune Lyngso
Baskin School of Engineering
University of California at Santa Cruz
rlyngsoe@cse.ucsc.edu
http://www.cse.ucsc.edu/~rlyngsoe

Hidden Markov models (HMMs) are stochastic models of regular languages, while stochastic context-free grammars (SCFGs) are stochastic models of context-free languages. These formalisms were originally developed for natural language analysis, but they have also found their way into bioinformatics over the last decade. By and large, they are viewed as tools for annotating and classifying sequences. However, since these approaches define probability distributions over all finite sequences, we can alternatively view them as succinct probabilistic representations of sequence families. I will talk about comparing the probability distributions of HMMs and SCFGs. The core of the positive results presented are methods for computing the collision probability between the distribution of an HMM and the distribution of either an HMM or an SCFG, i.e. the probability of drawing identical elements from the two distributions. Collision probabilities allows us to easily compute some measures, e.g. the L_2-norm and Hellinger distance. I will also present negative results on the hardness of computing the L_1- and L_infinity-norms. The results are mainly of theoretical interest, as experimental results are not overly convincing; one should certainly not attempt to use our methods to try to detect remote sequence family homologies. However, as our methods for computing collision probabilities are generalizations of the algorithms for computing the probabilities of a sequence in an HMM or an SCFG, we immediately get similar generalizations from sequences to models of various methods using these algorithms as constituent parts.



Date: Thursday, December 6

Time: 4:15-5:30PM

Place: Ventura 17


Return to the seminar schedule