A New Sketch Algorithm for
Estimating Associations
Ping Li
Department of Statistics
Stanford University
Associations (set intersections) are fundamental in many applications,
including databases and information retrieval. Computing associations
exactly can be impractical for large corpora like the Web, but approximate
answers often suffice for applications such as database query planning and
Web clustering. Broder's sketch is a well-established randomized
algorithm for estimating associations. We propose a new sketch algorithm
that starts with Broder's sketch but that is more flexible and efficient.
Our method converts the sketching task into a statistical inference problem.
The algorithm is provably better, reducing the mean squared error of
Broder's sketch roughly by a factor of two. We extend our method to
estimate multi-way associations, which have practical importance because
database and Web queries often involve more than two words. In contrast,
Broder's sketch does not have such a straightforward extension. Sampling
methods become more important with larger collections. Using our method,
sampling rates as low as 10^{-4} may suffice at the Web scale.
This talk reports joint work with Kenneth Church at Microsoft Research.
|
Date: Wed., Nov. 16 |
Time: 4:15-5:30PM |
Place: Cordura 100 |
Return to the seminar schedule