Seminar on Computational Learning and Adaptation



 
Applying the Minimum Description Length Principle
to General Model Classes and Error Measures

Peter Grünwald
Robotics Laboratory
Stanford University
Stanford, CA 94305-9010
grunwald@cs.stanford.edu

The Minimum Description Length (MDL) Principle is a general principle for inductive inference. It is based on the idea that the fewer bits one needs to describe (code) data, the more one has learnt about the data. In the simplest version of MDL, data is encoded by first encoding a model and then encoding the data with the help of the model. One then chooses the model (within some prespecified model class M) for which  this total two-stage description length of the data is minimized. In this way, one hopes to choose a model that optimizes the trade-off between model complexity and goodness-of-fit. To make 'coding data with the help of a model' meaningful, each model in M  must be associated with a code.

In case M is probabilistic, there is a straightforward way of doing this: one generally uses the Shannon-Fano code, coding data D using -log P(D|M) bits. This choice has some optimality properties; also, it enables one to re-interpret (parts of) MDL as a form of Bayesian statistics. In case M is not probabilistic (for example, a class of functions together with some error measure) it is not so clear how to associate codes with models.

In this talk, we present a simple formal recipe for associating codes with models. Our method works for arbitrary model classes (they can but need not be probabilistic) and error measures, thereby substantially extending the applicability of MDL. The recipe we give satisfies several optimality properties.

We present some theoretical results concerning our recipe, which seem promising for practical applications. First of all, it removes a certain arbitrariness that has been inherent in previous proposals. Second, it allows one to optimize the behaviour of MDL against (almost) any error function one wishes. For example, if one uses probabilistic model classes (like Naive Bayes) for classification, where the (non-probabilistic) zero/one-loss function is used, our recipe gives somewhat different codelengths than 'classical' MDL. Our theorems imply that, under fairly general circumstances (even if the data generating machinery is quite unrelated to any of the models under consideration), our recipe should improve on 'classical' MDL. Third, if one intends to use the model one inferred from data to predict future data against a different  error functions than the one one used during learning (this often happens in practice!), then using our recipe will still often lead to reasonable predictions.
 
In this talk, we present our recipe, show how it works and discuss its practical and theoretical aspects. We also discuss connections to Maximum Entropy. Finally, we show that our recipe has several other applications outside of optimizing the trade-off between complexity and goodness-of-fit. The talk assumes no prior knowledge of information theory.


Date: Thurs., Feb. 4 
Time: 4:15-5:30PM
Place: Cordura 100

Return to the seminar schedule