33
Research Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

Ranked Recall: Efficient Classification by Learning Indices That Rank

  • Upload
    astro

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Ranked Recall: Efficient Classification by Learning Indices That Rank. Omid Madani with Michael Connor (UIUC). Many Category Learning (e.g. Y! Directory). Arts&Humanities. Business&Economy. Recreation&Sports. Sports. Photography. History. Contests. Amateur. Magazines. Education. - PowerPoint PPT Presentation

Citation preview

Page 1: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Ranked Recall: Efficient Classification by Learning

Indices That Rank

Omid Madani

with Michael Connor (UIUC)

Page 2: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearchMany Category Learning (e.g. Y! Directory)

Arts&Humanities

Photography

Magazines ContestsEducation

History

Business&Economy Recreation&Sports

Sports

Amateur

college

basketballOver 100,000 categories in the Yahoo! directory

Given a page, quickly categorize… Larger for vision, text prediction,...

(millions and beyond)

Page 3: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Supervised Learning • Often two phases:

• Training

• Execution/Testing

A Learnt classifier f(categorizer)

f(unseen) instance class prediction(s)

Classfeatures11 0 3

50 0 1

21 1 0

20 0 0

?0 0 1 Y)x(f

x

x1x2x3

Often learn binary classifiers

Page 4: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Massive Learning

• Lots of ...• Instances (millions, unbounded..)• Dimensions (1000s and beyond)• Categories (1000s and beyond)

• Two questions:1. How to quickly categorize?

2. How to efficiently learn to categorize efficiently?

Page 5: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Efficiency1. Two phases (combined when online):

1. Learning

2. Classification time/deployment

2. Resource requirements:1. Memory

2. Time

3. Sample efficiency

Page 6: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Idea

• Cues in input may quickly narrow down possibilities => “index” categories

• Like search engine, but learn a good index

• Goal: learn to strike a good balance between accuracy and efficiency

Page 7: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Summary Findings

• Very fast: • Train time: minutes versus hours/days

(compared against one-versus-rest and top-down)

• Classification time: O(|x|)?• Memory efficient

• Simple to use (runs on laptop..)• Competitive accuracy!

Page 8: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Problem Formulation

Page 9: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearchInput-Output Summary

features categoriesinstances

Input:tripartite graph

learn

features categories

Output: an index = sparse weighted

directed bipartite graph(sparse matrix)

21w

if jcijw

2f1c

Page 10: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Scheme

• Learn a weighted bipartite graph

• Rank categories retrieved

• For category assignment, could use rank, or define thresholds, or map scores to probabilities, etc.

Page 11: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Three Parts to the Online of Solution

• How to use the index?

• How to update (learn) it?

• When to update it?

Page 12: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Retrieval (Ranked Recall)

}f,f{x 32

1. Features are “activated”

features categories

c1

c2

c3

c4

c5

f1

f2

f3

f42. Edges are activated

3. Receiving categories are activated4. Categories sorted/ranked

).,c(),.,c(),.,c(),.,c(

:list sorted

10104050 1534

40.

30.20.

10.

10.

1. Like use of inverted indices2. Sparse dot products

Page 13: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Computing the Index

• Efficiency: Impose a constraint on every feature’s maximum out-degree

• Accuracy: Connect and compute weights so that some measure of accuracy is maximized..

Page 14: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

• Measure average performance per instance

• Recall: The proportion of instances for which the right category ended up in top k

• Recall at k = 1 (R1), 5 (R5), 10, …• R1=“Accuracy” when “multiclass”

Measure of Accuracy: Recall

Page 15: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Computational Complexity• NP-Hard!• The problem: given a finite set of

instances (Boolean features), exactly one category per instance, is there an index with max out-degree 1, such that R1 on training set is greater than a threshold t ?

• Reduction from set cover• Approximation? (not known)

Page 16: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

How About Practice?

• Devised two main learning algorithms:• IND treats features independently.• Feature Normalize (FN) doesn’t make an

independence assumption; it’s online.• Only non-negative weights are learned.

Page 17: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Feature Normalize (FN) Algorithm

• Begin with an empty index

• Repeat• Input instance (features + categories), and

retrieve and rank candidate categories

• If margin is not met, update index

Page 18: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Three Parts (Online Setting)

• How to use the index?

• How to update it?

• When to update it?

Page 19: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Index Updating

• For each active feature:• Strengthen weights between active

feature and true category• Weaken the other connections to

the feature

• Strengthening = Increase weight by addition or multiplication

Page 20: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Updating

features categories

c1

c2

c3

c4

c5

f1

f2

f3

f4

3

2

Cx

xf

1. Identify connection

2. Increase weight

3. Normalize/weaken other weights

4. Drop small weights

Page 21: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Three Parts

• How to use an index?

• How to update it?

• When to update it?

Page 22: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

A Tradeoff

1. To achieve stability (helps accuracy), we need to keep updating (think single feature scenario)

2. To “fit” more instances, we need to stop updates on instances that we get “right”

Use of margin threshold strikes a balance.

Page 23: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Margin Definition

• Margin = score of the true positive category MINUS score of highest ranked negative category

• Choice of margin threshold: • Fixed, e.g. 0,0.1, 0.5, …• Online average (eg: average of the

last 10000 margins + 0.1)

Page 24: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Salient Aspects of FN

• “Differentially” updates, attempts to improve retrieved ranking (in “context”)

• Normalizes, but from “feature’s side”• No explicit weight demotion/punishment!

(normalization/weakening achieves demotion/reordering ..)

• Memory/Efficiency conscious design from the outset • Very dynamic/adaptive:

• edges added and dropped• Weights adjusted, categories reordered

• Extensions/variations exit (e.g. each feature’s out-degree may dynamically adjust)

Page 25: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Reuters 21578

Domain statistics

121014k685k70k Web

1.42712.6k301k369kAds

2.087641447k23kReuters RCV1

Avg labels per x

Avg vector length|C|

# of

features

# of

Instances

industry

115.117.4k299k749kJane Austin

• Experiments are average of 10 runs, each run is a single pass, with 90% for training, 10% held out• |C| is the number of classes, L is avg vector length, Cavg is average

Number of categories per instance

20 News grp

23k9.6k

20k

9.4k

60k

33k

1

180.9 10

20

1120

80

10469k

Page 26: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearchSmaller Domains

Keerthi and DeCoste, 06 (fast linear SVM)

• Max out-degree = 25, min allowed weight = 0.01, tested with margins 0, 0.1, and 0.5 and up to 10 passes• 90-10 random splits

10 categories, 10k instances

Page 27: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch Three Smaller Domains

20 categories, 20k instances

Page 28: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch Three Smaller Domains

104 categories, 10k instances

Page 29: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch3 Large Data Sets (top-down comparisons)

~500 categories, 20k instances

~12.6k categories, ~370k instances

~14k categories, ~70k instances

Page 30: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearchAccuracy vs. Max Out-Degree

max out-degree allowed

accuracy

Web page categorization

Ads

RCV1

Page 31: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearchAccuracy vs. Passes and Margin

# passes

Accuracy

Page 32: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Related Work and Discussion • Multiclass learning/categorization

algorithms (top-down, nearest neighbors, perceptron, Naïve Bayes, MaxEnt, SVMs, online methods, ..),

• Speed up methods (trees, indices, …)• Feature selection/reduction• Evaluation criteria• Fast categorization in the natural world• Prediction games! (see poster)

Page 33: Ranked Recall: Efficient Classification by Learning Indices That Rank

ResearchResearch

Summary

• A scalable supervised learning method for huge class sets (and instances,..)

• Idea: learn an index (a sparse weighted bipartite graph, mapping features to categories)

• Online time/memory efficient algorithms• Current/future: more algorithms, theory,

other domains/applications, ..