Seminar on Computational Learning and Adaptation



 A New Sketch Algorithm for Estimating Associations

Ping Li
Department of Statistics
Stanford University


Associations (set intersections) are fundamental in many applications, including databases and information retrieval. Computing associations exactly can be impractical for large corpora like the Web, but approximate answers often suffice for applications such as database query planning and Web clustering. Broder's sketch is a well-established randomized  algorithm for estimating associations. We propose a new sketch algorithm that starts with Broder's sketch but that is more flexible and efficient. Our method converts the sketching task into a statistical inference problem. The algorithm is provably better, reducing the mean squared error of Broder's sketch roughly by a factor of two.  We extend our method to estimate multi-way associations, which have practical importance because database and Web queries often involve more than two words. In contrast, Broder's sketch does not have such a straightforward extension. Sampling methods become more important with larger collections. Using our method, sampling rates as low as 10^{-4} may suffice at the Web scale.

This talk reports joint work with Kenneth Church at Microsoft Research.


Date: Wed., Nov. 16

Time: 4:15-5:30PM 

Place: Cordura 100


Return to the seminar schedule