Multi-core Structural SVM Training

1

Multi-core Structural SVM Training

Kai-Wei ChangDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

Joint Work With Vivek Srikumar and Dan Roth

2

Decisions are structured in many applications. Global decisions in which several local decisions play a role but there

are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the

interdependencies into account. E.g., Part-of-Speech tagging (sequential labeling)

Input: a sequence of words Output: Part-of-speech tags {NN, VBZ, …}

“A cat chases a mouse” => “DT NN VBZ DT NN”. Assignment to can depend on both and . Feature vector defined on both input and output variables:

can be “: Cat”, “ VBZ-VBZ”.

Motivation

3

Inference with General Constraint Structure [Roth&Yih’04,07]Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3

R12 R2

3

other 0.05

per 0.85

loc 0.10

other 0.05

per 0.50

loc 0.45

other 0.10

per 0.60

loc 0.30

irrelevant 0.10

spouse_of 0.05

born_in 0.85

irrelevant 0.05

spouse_of 0.45

born_in 0.50

irrelevant 0.05

spouse_of 0.45

born_in 0.50

other 0.05

per 0.85

loc 0.10

other 0.10

per 0.60

loc 0.30

other 0.05

per 0.50

loc 0.45

irrelevant 0.05

spouse_of 0.45

born_in 0.50

irrelevant 0.10

spouse_of 0.05

born_in 0.85

other 0.05

per 0.50

loc 0.45

Improvement over no inference: 2-5%

4

Structured prediction: predicting a structured output variable y based on the input variable x. variables form a structure: sequences, clusters, trees, or arbitrary

graphs. Structure comes from interactions between the output variables

through mutual correlations and constraints. TODAY:

How to efficiently learn models that are used to make global decisions. We focus on training a structural SVM model. Various approaches have been proposed [Joachims et. al. 09, Chang

and Yih 13, Lacoste-Julien et. al. 13] but they are single-threaded. DEMI-DCD: a multi-threaded algorithm for training structural SVM.

Structured Learning and Inference

5

Structural SVM: Inference and Learning DEMI-DCD for Structural SVM Related Work Experiments Conclusions

Outline

6


Outline

7

Features vector on input-output

Inference constitutes predicting the best scoring structure. Efficient inference algorithms have been proposed for some

specific structures. Integer linear programing (ILP) solver can deal with general

structures.

Set of allowed structuresoften specified by constraints

Weight parameters (to be estimated during learning)

Features on input-output

Structured Prediction: Inference

8

Given a set of training examples . Solve the following optimization problem to learn w

is the scoring function used in inference. We use (L2-loss structural SVM).

Structural SVM

Score of gold structure

Score ofpredicted structure

Loss function

Slack variable

For all samples and feasible structures

9

Quadratic programming with bounded constraints

where Number of variables can be exponentially large. Relationship between and

For linear model: maintain the relationship between and though out the learning process [Hsieh et.al. 08].

Dual Problem of Structural SVM

Corresponds to different

10

Maintain an active set of dual variables: Identify that will be non-zero in the end of optimization process.

In a single-thread implementation, training consists of two phases: Select and maintain (active set selection step).

Require solving a loss-augmented inference problem for each

Solving loss-augmented inferences is usually the bottleneck. Update the values of (learning step).

Require (approximately) solving a sub-problem. Related to cutting-plane methods.

Active Set

11


Outline

DEMI-DCD: Decouple Model-update and Inference with Dual Coordinate Descent. Let be #threads, and split training data into Active set selection (Inference) thread : select and maintain

the active set for each example in Learning thread: loop over all examples and update model .

and are shared between threads using shared memory buffers.

Overview of DEMI-DCD

Learning

Active Set Selection

12

Sequentially visit each instance and update To update , solve the following one variable sub-problem

Then and Shrinking heuristic: remove from if = 0 and

Learning Thread

13

DEMI-DCD requires little synchronization. Only learning thread can write , and only one active set

selection thread can modify . Each thread maintains a local copy of and copies to/from a

shared buffer after iterations.

Synchronization

14

15


Outline

A Master-Slave architecture (MS-DCD): Given processors: split data into parts. At each iterations:

Master sends current model to slave threads. Each slave thread solves loss-augmented inference problems

associated with a data block and updates the active set. After all slave threads finish, master thread updates the model

according to the active set. Implemented in JLIS.

A parallel Dual Coordinate Descent Algorithm

Master

Slave Slave Slave SlaveSent current w

Solve loss-augmented inferenceand update A

Master Update w based on A16

Structured Perceptron [Collins 02]: At each iteration: pick an example and find the best

structured output according to the current model . Then update with a learning rate

SP-IPM [McDonald et al 10]:1. Split data into parts.2. Train Structured Perceptron on each data block in parallel.3. Mixed the models using a linear combination.4. Repeat Step 2 and use the mixed model as the initial model.

Structured Perceptron and its Parallel Version

17

18


Outline

19

Experiment Settings

POS tagging (POS-WSJ): Assign POS label to each word in a sentence. We use standard Penn Treebank Wall Street Journal corpus with

39,832 sentences. Entity and Relation Recognition (Entity-Relation):

Assign entity types to mentions and identify relations among them. 5,925 training samples. Inference is solved by an ILP solver.

We compare the following methods: DEMI-DCD: the proposed method. MS-DCD: A master-slave style parallel implementation of DCD. SP-IPM: parallel structured Perceptron.

20

Convergence on Primal Function ValueRelative primal function value difference along training time

POS-WSJ Entity-RelationLog-scale

21

Test PerformanceTest Performance along training time

POS-WSJ

SP-IPM converges to a different model

22

Test PerformanceTest Performance along training time Entity-Relation Task

Entity F1 Relation F1

23

Moving Average of CPU Usage

POS-WSJ Entity-Relation

DEMI-DCD fully utilizes CPU power

CPU usage drops because of the synchronization

25


Outline

26

Conclusion

We proposed DEMI-DCD for training structural SVM on multi-core machine.

The proposed method decouples the model update and inference phases of learning.

As a result, it can fully utilize all available processors to speed up learning.

Software will be available at: http://cogcomp.cs.illinois.edu/page/software

Thank you.

http://cogcomp.cs.illinois.edu/page/software

Documents

Multi-core Structural SVM Training