Large Conjunctive Clusters and Bicliques
Nina Mishra
Stanford University / HP Labs
We propose a new formulation of the clustering problem that differs from previous work in several ways. First, the goal is to explicitly output a collection of simple and meaningful conjunction of attributes that define the clusters. Second, the clusters might overlap. Third, the clusters might not cover all points. Finally, a point may be assigned to a cluster description even if it only satisfies most, and not necessarily all, of the attributes in the conjunction.
We give a graph-theoretic formulation of our clustering problem where the input
is a bipartite graph and the goal is to find a collection of large bicliques in
the graph. Identifying one largest conjunctive cluster is equivalent to
finding a maximum edge biclique. Since this problem is NP-hard [Peeters00] and
there is evidence that it is difficult to approximate [Feige02], we solve a
relaxed version where the objective is to find a large subgraph that is close
to being a biclique. We give a randomized algorithm that finds a relaxed
biclique with almost as many edges as the maximum biclique. We then extend
this algorithm to identify a good collection of large relaxed bicliques. A key
property of these algorithms is that they require a sample of data points of
size independent of both the number of data points and the number of
attributes. The running time of the algorithm is linear in the number of
attributes.
Date: Wednesday, February 4 |
Time: 4:15-5:30PM |
Place: Cordura 100 |
Return to the seminar schedule