A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie...

A Discriminative Framework for A Discriminative Framework for

Clustering via Similarity FunctionsClustering via Similarity Functions

Maria-Florina BalcanCarnegie Mellon University

Joint with Avrim Blum and Santosh Vempala

Brief Overview of the Talk

Vague, difficult to reason

at a general technical level.

Supervised

Learning

Good theoretical

models:

Clustering

Lack of good unified

models.

Learning from labeled

data. Learning from unlabeled data.

• PAC, SLT

• Kernels & Similarity fns

A PAC-style

framework

Our work: fix the problem

Clustering: Learning from Unlabeled Data

[documents]

[topic]

S set of n objects.

9 ground truth clustering.

Goal: h of low error where

x, l(x) in {1,…,t}.

err(h) = minPrx~S[(h(x)) l(x)]

Problem: unlabeled data only!

But have a Similarity Function!

[sports]

[fashion]

[sports]

[fashion]

9 ground truth clustering for S

i.e., each x in S has l(x) in {1,…,t}. The similarity function K has to be related to the ground-truth.

Input S, a similarity function K.

Output Clustering of small error.

Protocol

[sports]

[fashion]

What natural properties on a similarity function would be sufficient to allow one to cluster well?

Fundamental Question

Contrast with Standard Approaches

Approximation algorithms

- analyze algs to optimize

various criteria over edges

- score algs based on apx ratios

Input: graph or embedding into Rd

Much better when input graph/ similarity is based on heuristics.

Mixture models

Clustering Theoretical Frameworks

Our Approach

Discriminative, not

generative.

Input: embedding into Rd

- score algs based on error rate

- strong probabilistic assumptions

Input: graph or similarity info

- score algs based on error rate

- no strong probabilistic assumptions

E.g., clustering documents by topic, web search results by category

[sports][fashion]

Condition that trivially works.

K(x,y) > 0 for all x,y, K(x,y) > 0 for all x,y, ll(x) = (x) = ll(y).(y).

K(x,y) < 0 for all x,y, K(x,y) < 0 for all x,y, ll(x) (x) ll(y).(y).

C C’

A A’

Problem: same K can satisfy it for two very different, equally natural clusterings of the same data!

All x more similar to all y in own cluster than any z in any other cluster

sports fashion

soccer

tennis

Lacoste

sports fashion

soccer

tennis

Lacoste

K(x,x’)=1

K(x,x’)=0.5K(x,x’)=0

Relax Our Goals

1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it.

Relax Our Goals

soccer

tennis

Lacoste

1. Produce a hierarchical clustering s.t. correct answer is approximately some pruning of it.

soccer

sports fashion

Guccitennis Lacoste

All topics

2. List of clusterings s.t. at least one has low error.

Tradeoff strength of assumption with size of list.Obtain a rich, general model.

Strict Separation Property

Single-Linkage.

• merge “parts” whose max similarity is highest.

Sufficient for hierarchical clustering

(If K is symmetric)

soccer

sports fashion

Guccitennis Lacoste

All topics

sports fashion

soccer

tennis

Lacoste

Algorithm

Strict Separation Property

Use Single-Linkage, construct a tree s.t. ground-truth clustering is a pruning of the tree.

Theorem

If use c-approx. alg. to objective f (e.g, k-median) to minimize error rate, then implicit assumption:

Most points (1-O() fraction) satisfy Strict Separation.

Clusterings within factor c of optimal are Clusterings within factor c of optimal are -close to the target.-close to the target.

Incorporate Approximation Assumptions in Our Model

Can still cluster well in the tree model.

k-median,

k-means

Stability Property

Merge “parts” whose average similarity is highest.

Single linkage fails, but average linkage works.

Neither A or A’ more attracted to the other one than to the rest of its own cluster.

For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’)(K(A,A’) - average attraction between A and A’)

C C’

Stability Property

K(P1,P3) ¸ K(P1,C-P1) and

K(P1,C-P1) > K(P1,P2).

All “parts” laminar wrt target clustering.

Contradiction.

Analysis:

Use Average Linkage, construct a tree s.t. the ground-truth clustering is a pruning of the tree.

Theorem

• Failure iff merge P1, P2 s.t. P1½ C, P2Å C =.

• But must exist P3 ½ C s.t.

(K(A,A’) - average attraction between A and A’)

C C’

For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’)

Stability Property

(K(A,A’) - average attraction between A and A’)

C C’

Average Linkage breaks down if K is not symmetric. 0.5

Instead, run “Boruvka-inspired” algorithm:

– Each current cluster Ci points to argmaxCjK(Ci,Cj)

– Merge directed cycles.

For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’)

Unified Model for Clustering

Algorithm A1

…Property P1 Property Pi Property Pn

Algorithm A2 Algorithm Am

of the similarity functionwrt the ground-truth clustering

Question 1: Given a property of the similarity function w.r.t. ground truth clustering, what is a good algorithm?

Unified Model for Clustering

Algorithm A1

…Property P1 Property Pi Property Pn

Algorithm A2 Algorithm Am

of the similarity functionwrt the ground-truth clustering

Question 2: Given the algorithm, what property of the similarity function w.r.t. ground truth clustering should the expert aim for?

Other Examples of Properties and Algorithms

C C’

Find hierarchy using a multi-stage learning-based algorithm.

Average Attraction Property

Not sufficient for hierarchical clustering

Can produce a small list of clusterings.(sampling based algorithm)

Stability of Large Subsets Property

Upper bound:tO(t/2 log t/) Lower bound:tO(1/)

EEx’ x’ 22 C(x) C(x)[K(x,x’)] > E[K(x,x’)] > Ex’ x’ 22 C’ C’ [K(x,x’)]+[K(x,x’)]+ (8 C’C(x))

For all clusters C, C’, for all Aµ C, A’ µ C, |A|+|A’|¸ sn, neither A nor A’ more attracted to the other one than to the rest of its own cluster.

1) Generate list L of candidate clusters (average attraction alg.)

2) For every (C, C0) in L s.t. all three parts are large:

3) Clean and hook up the surviving clusters into a tree.

If K(C Å C0, C \ C0) ¸ K(C Å C0, C0 \ C),

then throw out C0

C Å C0

Ensure that any ground-truth cluster is f-close to one in L.

Else throw out C.

ClusteringClustering

Algorithm

C C’For all C, C’, all A ½ C, A’ µ C’, K(A,C-A) > K(A,A’)

|A|+|A’| ¸ sn

Stability of Large Subsets Property

Stability of Large Subsets

For all C, C’, all A½C, A’µC’, |A|+|A’| ¸ sn

K(A,C-A) > K(A,A’)+

ClusteringClustering

C C’

If s=O(2/k2), f=O(2 /k2), then produce a tree s.t. the ground-truth is -close to a pruning.

Theorem

The Inductive Setting

Insert new points as they arrive.

Draw sample S, cluster S (in the list or tree model).

Inductive Setting

Many of our algorithms extend naturally to this setting.

instance space XSample S

To get poly time for stab of all subsets, need to argue that sampling preserves stability. [AFKK]

Similarity Functions for Clustering, Summary

• Natural conditions on K to be useful for clustering.

• For robust theory, relax objective: hierarchy, list.

• Algos for stability of large subsets; -strict separation.

• Algos and analysis for the inductive setting.

Main Conceptual Contributions

Technically Most Difficult Aspects

• A general model that parallels PAC, SLT, Learning with

Kernels and Similarity Functions in Supervised Classification.

A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie...

Documents

Human-Computable Passwords Jeremiah Blocki Manuel Blum Anupam Datta Santosh Vempala

Spectral Algorithms for Learning and Clustering Santosh Vempala Georgia Tech

Balcan in 15the century

20101 Stretch O Max TM Balcan Plastics /First Film Extruding Stretch O Max TM Balcan Plastics /First Film Extruding

Correlation Clustering - SpringerMACH.0000033116.… · Correlation Clustering∗ NIKHIL BANSAL nikhil@cs.cmu.edu AVRIM BLUM avrim@cs.cmu.edu SHUCHI CHAWLA shuchi@cs.cmu.edu Computer

Pranjal Awasthi Maria-Florina Balcan Ruth Urnerarxiv.org/pdf/1503.03594.pdf · Maria-Florina Balcan ninamf@cs.cmu.edu Nika Haghtalab nhaghtal@cs.cmu.edu Ruth Urner rurner@tuebingen.mpg.de

Online Learning for Online Pricing Problems Maria Florina Balcan

1 Computing for Global Health: Blood Safety Monitoring Santosh Vempala

New Approximation Algorithms for Graph Coloring › afs › cs.cmu.edu › usr › avrim › www › Papers › col… · New Approximation Algorithms for Graph Coloring Avrim Blum∗

Path Splicing Nick Feamster, Murtaza Motiwala, Megan Elmore, Santosh Vempala

Distributed Machine Learning: Communication, Efficiency, and Privacy Avrim Blum [RaviKannan60] Joint work with Maria-Florina Balcan, Shai Fine, and Yishay

Santosh Vempala Algorithmic aspects of convexity.wiki-math.univ-mlv.fr/gemecod/lib/exe/fetch.php/vempala-notes.pdf · Then the complexity under this model is the number of calls to

Balcan Can Contemporary - Performance Art

Maria-Florina Balcan Georgia Tech Avrim Blum Carnegie Mellon

New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on

Maria-Florina Balcan Active Learning of Binary Classifiers Presenters: Nina Balcan and Steve Hanneke

A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon

Kernels and Margins Maria Florina Balcan 10/13/2011

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th

Path Splicing with Network Slicing Nick Feamster Murtaza Motiwala Santosh Vempala