January 23 rd, 2011 1. Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document

1

Soft-Supervised Learning for Text

Classification

January 23rd, 2011

2

Document classification taskWe are interested to solve a task of Text

Classification , i.e. to automatically assign a given document to a fixed number of semantic categories.

In general, text classification is a multi-class problem, where each document may belong to one, many, or none of the categories.

Many algorithms were proposed to handle this task, part of them are trained using semi-supervised text.

3

Semi-supervised learning (SSL)Semi-supervised learning (SSL) employs

small amount of labeled data with relatively large amount of unlabeled data to train classifiers.

Graph-based SSL algorithms are an important class of SSL techniques, where the data (both labeled and unlabeled) is embedded within a low-dimensional manifold expressed by a graph.

4

New approach for graph-based semi-supervised text classification

Previous work on semi-supervised text classification has relied primarily on one vs. rest approaches to overcome the multi-class nature of the text classification problem. Such an approach may be sub-optimal since it disregards data overlaps.

In order to address this drawback , a new framework is proposed based on optimizing a loss function composed of Kullback-Leiber divergence between probability distribution defined for each graph vertex.

The use of probability distributions, rather than fixed integer labels, leads to straightforward multi-class generalization.

5

Learning problem definition is a training set.

is a set of labeled samples.

is a sets of unlabeled samples.

is the total number of samples in the training set.

Given , the graph-based SSL algorithm use an undirected weighted graph where are the data points(vertices) and are the set of undirected edges between vertices.

denote the weight of the edge between vertix and .

The weights are formed as . Where is the set of ’s k-nearest-neighbors (), and is a measure of similarity between and .

6

Multi-class SSL optimization procedure Let’s define two sets of probability distribution over the elements

of : where the later is fixed

throughout the training.we set to be uniform over only the possible categories and zero otherwise:

where is

a set of possible outputs for input . The goal of the learning is to find the best set of distributions by

minimizing the next objective function:

where denotes the entire set of distributions to be

learned,

is the standard entropy function of , is the KL-divergence between and ,

and are positive parameters.

The final labels for are then given by

Y 1 1

ln

i ji jp and r

ip

1 1 1 1 1

1

min ( ) , ( ) ( || ) ( || )

( )

l n n

KL i i ij KL i ji i jp

n

ii

C p C p D r p w D p p

H p

1( ,..., )nP p p( ) ( ) log ( )i i iy

H p p y p y ip

( || )KL i jD p q ip jq

,

ir

ˆ( ) (1/ ) ( )i ir y k y y (1) ( )ˆ ˆ ˆ,..., ,ki i iy y y k Y

uD ˆ argmax ( )iy Y

y p y

7

Analysis for the objective function

The first term penalize the solution, when it is far away from the labeled training data , but it doesn’t insist that , allowing for deviations from which can help especially with noisy labels.

The second term penalizes a lack of consistency with the geometry of the data. If is large, we prefer a solution in which and are close in the KL-divergence sense.

The third term encourages each to be close to the uniform distribution if not preferred by the first two terms.

1 1 1 1 1

1

min ( ) , ( ) ( || ) ( || )

( )

l n n

KL i i ij KL i ji i jp

n

ii

C p C p D r p w D p p

H p

8

Learning with alternating minimizationThe problem of minimizing over the space of collections of

probability distributions, constitutes a convex programing problem. Thus, our problem has a unique global optimum.

One possible method to yield this global optimum might take the derivative of the objective along with Lagrange multipliers to ensure that we stay within the space of probability distributions.

Unfortunately our problem doesn’t yield a closed form solution because the gradient of is of the form:

where , , are constants.

Thus, we adopt different strategy based on alternating minimization. This approach has a single additional optimization parameter. It admits a closed form solution for each iteration and yields guaranteed convergence to the global optimum.

9

Learning with alternating minimizationWe define a new objective function to be optimized:

Where is a newly defined set of distributions for all the training samples.

is a new weight matrix of the same size as the original one. Where , .

Thus, . In the original objective , were irrelevant since for all , but now we have two distributions for each training point, which are encouraged (through ) to approach each other.

2 2 1 1 1,

1

min ( , ) , ( , ) ( || ) ' ( || )

( )

l n n

KL i i ij KL i ji i jp q

n

ii

C p q C p q D r q w D p q

H p

10

Learning with alternating minimizationA key advantage of this relaxed objective is that it is amenable to alternating minimization, i.e. a method that produces a sequence of sets of distributions

That converges to the minimum of i.e.

( ) ( 1) ( ) ( )2 2argmin ( , ) , argmin ( , )n n n n

p qp C p q q C p q

11

The update equationsThe update equations are then given by:

Whereand is a normalization constant to ensure is a valid probability distribution.

Note that each iteration has a closed form solution and is relatively simple to implement even for very large graphs.

( 1) ( )

( )

( )

( )

1( ) exp

( ) ( ) ' ( )( )

( ) '

ni

i

y

ni

i

ni ji jjn

ijij

p yZ

r y i l w p yq y

i l w

( 1) ( 1)' , ' (log ( ) 1)n ni ij i ij jj j

w w q y

12

ConclusionsThe paper introduce a new graph based semi

supervised learning approach for document classification. This approach uses Kullback-Leiber divergence between probability distributions in the objective function instead of Euclidian distances, thus allowing for each document to be a member of different classes.

Unfortunately the objective derivative didn’t yield closed form solution, so an alternating minimization approach was proposed. This approach admits a closed form and relatively simple solution for each iteration.

13

Any questions?

Documents

January 23 rd, 2011 1. Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document