Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown...

Data-driven Clustering via Parameterized Lloyds Families

Travis DickJoint work with Maria-Florina Balcan and Colin White

Carnegie Mellon UniversityNeurIPS 2018

Data-driven Clustering

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.

Data-driven Clustering• Clustering aims to divide a dataset into self-similar clusters.• Goal: find some unknown natural clustering.

• However, most clustering algorithms minimize a clustering cost function.

• Goal: find some unknown natural clustering.

• Hope that low-cost clusterings recover the natural clusters.

• Hope that low-cost clusterings recover the natural clusters.• There are many algorithms and many objectives.

How do we choose the best algorithm for a specific application?

Can we automate this process?

Learning Model

Learning Model• An unknown distribution ! over clustering instances.

Learning Model• An unknown distribution ! over clustering instances.• Given a sample "#, … , "& ∼ ! annotated by their target clusterings.

• Find an algorithm " that produces clusterings similar to the target clusterings.

• Given a sample #$, … , #' ∼ ! annotated by their target clusterings.

• Want " to also work well for new instances from !!

• In this work:1. Introduce large parametric family of clustering algorithms, (*, +)-Lloyds.

• In this work:1. Introduce large parametric family of clustering algorithms, (*, +)-Lloyds.2. Efficient procedures for finding best parameters on a sample.

• In this work:1. Introduce large parametric family of clustering algorithms, (*, +)-Lloyds.2. Efficient procedures for finding best parameters on a sample.3. Generalization: optimal parameters on sample are nearly optimal on !.

Lloyds Method

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.

Lloyds Method• Maintains ! centers "#, … , "& that define clusters.• Perform local search to improve the !-means cost of the centers.

1. Assign each point to nearest center.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.

1. Assign each point to nearest center.2. Update each center to be the mean of assigned points.3. Repeat until convergence.

Initial Centers are Important!

Initial Centers are Important!• Lloyd’s method can get stuck if initial centers are chosen poorly

Initial Centers are Important!• Lloyd’s method can get stuck if initial centers are chosen poorly• Initialization is a well-studied problem with many proposed procedures (e.g., !-means++)

Initial Centers are Important!• Lloyd’s method can get stuck if initial centers are chosen poorly• Initialization is a well-studied problem with many proposed procedures (e.g., !-means++)• Best method will depend on properties of the clustering instances.

The (", $)-Lloyds Family

The (", $)-Lloyds FamilyInitialization: Parameter "

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)• Choose initial centers from dataset * randomly.

The (", $)-Lloyds FamilyInitialization: Parameter "• Use &'-sampling (generalizing &(-sampling of )-means++)• Choose initial centers from dataset * randomly.• Probability that point + ∈ * is center -. is proportional to & +, -/, … , -.1/ '.

" = 0: random initialization

• Choose initial centers from dataset , randomly.• Probability that point - ∈ , is center /0 is proportional to & -, /1, … , /031 '.

" = 0: random initialization " = 2: )-means++

• Choose initial centers from dataset - randomly.• Probability that point . ∈ - is center 01 is proportional to & ., 02, … , 0142 '.

" = 0: random initialization " = 2: )-means++ " = ∞: farthest first

• Choose initial centers from dataset . randomly.• Probability that point / ∈ . is center 12 is proportional to & /, 13, … , 1253 '.

Local search: Second parameter $ tweaks the local search. Details in paper.

• Choose initial centers from dataset . randomly.• Probability that point / ∈ . is center 12 is proportional to & /, 13, … , 1253 '.

Local search: Second parameter $ tweaks the local search. Details in paper.

Question: For a distribution . over tasks, what parameters give best performance?

• Choose initial centers from dataset / randomly.• Probability that point 0 ∈ / is center 23 is proportional to & 0, 24, … , 2364 '.

Results

ResultsEfficient Tuning on Sample:

ResultsEfficient Tuning on Sample:• Efficient algorithm for finding parameters on sample with best agreement to targets.

ResultsEfficient Tuning on Sample:• Efficient algorithm for finding parameters on sample with best agreement to targets.• “Algorithmically feasible to tune parameters on sample.”

Generalization Guarantee:

• Efficient algorithm for finding parameters on sample with best agreement to targets.• “Algorithmically feasible to tune parameters on sample.”

• Analyze the intrinsic complexity of (", $)-Lloyds

• Analyze the intrinsic complexity of (", $)-Lloyds• Show that need only roughly &' ( )*+ ,

-. clustering instances to ensure empirical cost for all parameters within / of expected cost.

• “Parameters tuned on the sample will work well for new instances!”

Experiments: Evaluate (", $)-Lloyds family on real and synthetic data.CIFAR-10 MNIST Mixture of Gaussians CNAE-9

• “Parameters tuned on the sample will work well for new instances!”

Data-driven Clustering via Parameterized Lloyds Families04... · Learning Model •An unknown...

Documents

Approaches in highly parameterized inversion ... · PDF fileApproaches in Highly Parameterized Inversion: PESTCommander, a Graphical ... the ability to invoke and ... in Highly Parameterized

Lloyds Metals and Energy Limited Lloyds Steels … · LLOYDS METALS AND ENERGY LIMITED 2 Lloyds Metals and Energy Limited CIN: L40300MH1977PLC019594 Chief Financial Officer Company

Lloyds Bank

LLOYDS STEELS INDUSTRIES LIMITED - lloydsengg.in · Accounting Standards, ... electrical, metallurgical, ... Lloyds Steels Industries Limited Information Memorandum : Lloyds Steels

Parameterized Unit Tests

Incremental List Coloring of Graphs, Parameterized by ...fpt.akt.tu-berlin.de/publications/Inc_List_Color_TCS.pdf · 2. the constraint that at most cobjects of the previous clustering

Parameterized Algorithms and Hardness Results for Some ...stelo/cpm/cpm08/31_Betzler.pdf · Parameterized Algorithms and Hardness Results for Some Graph Motif Problems ... Parameterized

Robust Parameterized Component Analysis

Carving Parameterized Unit Tests

Automated generic parameterized design of aircraft …509584/FULLTEXT01.pdf · Automated generic parameterized design of aircraft fairing ... Automated generic parameterized design

Eﬃcient Fiber Clustering using Parameterized Polynomialsklein/publications/pdf/2008_klein_spie_1.pdf · Keywords: Visualization, Fiber Clustering, Fiber Tracking, Diﬀusion Imaging

Classic Account - Lloyds TSB - Lloyds Bank - Personal Banking

250 YEARS OF LLOYDS BANK LLOYDS BANK - Lloyds … · LLOYDS BANK lloydsbank.com North Edition NEWS 2nd June 1765 Exciting new bank on the horizon 250 YEARS OF LLOYDS BANK Highwayman

Lloyds News

LLOYDS BANKING GROUP plc LLOYDS BANK plc · of Lloyds Bank plc (the “Bank” or “Lloyds Bank”) and Lloyds Banking Group plc (the “Company”) (each an ... that such is the

LLOYDS GRE.pdf

2 Dimensional Parameterized Matching

Parameterized Valiant’s classes

A Parameterized Floating Point Library Applied to Multispectral Image Clustering

Parameterized Unit Testing