CS583 Partially Supervised Learning

  • Upload
    akh-il

  • View
    226

  • Download
    1

Embed Size (px)

Citation preview

  • 8/2/2019 CS583 Partially Supervised Learning

    1/48

    Chapter 5: Partially-Supervised Learning

  • 8/2/2019 CS583 Partially Supervised Learning

    2/48

    CS583, Bing Liu, UIC 2

    Outline

    Fully supervised learning (traditionalclassification)

    Partially (semi-) supervised learning (or

    classification) Learning with a small set of labeled examples and

    a large set of unlabeled examples (LU learning) Learning with positive and unlabeled examples

    (no labeled negative examples) (PU learning).

  • 8/2/2019 CS583 Partially Supervised Learning

    3/48

    Learning from a smalllabeled set and a large

    unlabeled setLU learning

  • 8/2/2019 CS583 Partially Supervised Learning

    4/48

    CS583, Bing Liu, UIC 4

    Unlabeled Data

    One of the bottlenecks of classification is thelabeling of a large set of examples (data

    records or text documents).

    Often done manually Time consuming

    Can we label only a small number of examples

    and make use of a large number of unlabeled

    examples to learn?

    Possible in many cases.

  • 8/2/2019 CS583 Partially Supervised Learning

    5/48

    CS583, Bing Liu, UIC 5

    Why unlabeled data areuseful? Unlabeled data are usually plentiful, labeled

    data are expensive. Unlabeled data provide information about

    the joint probability distribution over wordsand collocations (in texts).

    We will use text classification to study thisproblem.

  • 8/2/2019 CS583 Partially Supervised Learning

    6/48

    CS583, Bing Liu, UIC 6

    DocNo: k ClassLabel: Positive

    ...homework.

    ...

    DocNo: n ClassLabel: Positive

    ...homework.

    ...

    DocNo: m ClassLabel: Positive

    ...homework.

    ...

    DocNo: x (ClassLabel: Positive)

    ...homework.

    ...lecture.

    DocNo: z ClassLabel: Positive

    ...homework.

    lecture.

    DocNo: y (ClassLabel: Positive)

    lecture..

    ...homework.

    ...

    Labeled Data Unlabeled Data

    Documents containing homework

    tend to belong to the positive class

  • 8/2/2019 CS583 Partially Supervised Learning

    7/48

  • 8/2/2019 CS583 Partially Supervised Learning

    8/48

    CS583, Bing Liu, UIC 8

    Incorporating unlabeled Datawith EM (Nigam et al, 2000)

    Basic EM

    Augmented EM with weighted unlabeled data

    Augmented EM with multiple mixture

    components per class

  • 8/2/2019 CS583 Partially Supervised Learning

    9/48

    CS583, Bing Liu, UIC 9

    Algorithm Outline

    1. Train a classifier with only the labeled

    documents.

    2. Use it to probabilistically classify the

    unlabeled documents.

    3. Use ALL the documents to train a new

    classifier.

    4. Iterate steps 2 and 3 to convergence.

  • 8/2/2019 CS583 Partially Supervised Learning

    10/48

    CS583, Bing Liu, UIC 10

    Basic Algorithm

  • 8/2/2019 CS583 Partially Supervised Learning

    11/48

    CS583, Bing Liu, UIC 11

    Basic EM: E Step & M Step

    E Step:

    M Step:

  • 8/2/2019 CS583 Partially Supervised Learning

    12/48

    CS583, Bing Liu, UIC 12

    The problem

    It has been shown that the EM algorithm in Fig. 5.1works well if the The two mixture model assumptions for a particular data set

    are true. The two mixture model assumptions, however, can

    cause major problems when they do not hold. Inmany real-life situations, they may be violated.

    It is often the case that a class (or topic) contains anumber of sub-classes (or sub-topics).

    For example, the class Sports may contain documents aboutdifferent sub-classes of sports, Baseball, Basketball, Tennis,and Softball.

    Some methods to deal with the problem.

  • 8/2/2019 CS583 Partially Supervised Learning

    13/48

    CS583, Bing Liu, UIC 13

    Weighting the influence ofunlabeled examples by factor

    New M step:

    The prior probability also needs to be weighted.

  • 8/2/2019 CS583 Partially Supervised Learning

    14/48

    CS583, Bing Liu, UIC 14

    Experimental Evaluation

    Newsgroup postings 20 newsgroups, 1000/group

    Web page classification student, faculty, course, project 4199 web pages

    Reuters newswire articles

    12,902 articles 10 main topic categories

  • 8/2/2019 CS583 Partially Supervised Learning

    15/48

    CS583, Bing Liu, UIC 15

    20 Newsgroups

  • 8/2/2019 CS583 Partially Supervised Learning

    16/48

    CS583, Bing Liu, UIC 16

    20 Newsgroups

  • 8/2/2019 CS583 Partially Supervised Learning

    17/48

    CS583, Bing Liu, UIC 17

    Another approach: Co-training

    Again, learning with a small labeled set and a large

    unlabeled set.

    The attributes describing each example or instance

    can be partitioned into two subsets. Each of them issufficient for learning the target function. E.g., hyperlinks and page contents in Web page

    classification.

    Two classifiers can be learned from the same data.

  • 8/2/2019 CS583 Partially Supervised Learning

    18/48

    CS583, Bing Liu, UIC 18

    Co-training Algorithm[Blum and Mitchell, 1998]

    Given: labeled data L,

    unlabeled data U

    Loop:

    Train h1 (e.g., hyperlink classifier) using L

    Train h2 (e.g., page classifier) using L

    Allow h1 to label p positive, n negative examples from UAllow h2 to label p positive, n negative examples from U

    Add these most confident self-labeled examples to L

  • 8/2/2019 CS583 Partially Supervised Learning

    19/48

    CS583, Bing Liu, UIC 19

    Co-training: Experimental Results

    begin with 12 labeled web pages (academic course) provide 1,000 additional unlabeled web pages

    average error: learning from labeled data 11.1%;

    average error: co-training 5.0%

    Page-baseclassifier

    Link-basedclassifier

    Combinedclassifier

    Supervisedtraining

    12.9 12.4 11.1

    Co-training 6.2 11.6 5.0

  • 8/2/2019 CS583 Partially Supervised Learning

    20/48

    CS583, Bing Liu, UIC 20

    When the generative model isnot suitable

    Multiple Mixture Components per Class (M-EM). E.g., aclass --- a number of sub-topics or clusters.

    Results of an example using 20 newsgroup data 40 labeled; 2360 unlabeled; 1600 test Accuracy

    NB 68% EM 59.6%

    Solutions M-EM(Nigam et al, 2000): Cross-validation on the training

    data to determine the number of components. Partitioned-EM (Cong, et al, 2004): using hierarchical

    clustering. It does significantly better than M-EM.

  • 8/2/2019 CS583 Partially Supervised Learning

    21/48

    CS583, Bing Liu, UIC 21

    Summary

    Using unlabeled data can improve the accuracy ofclassifier when the data fits the generative model.

    Partitioned EM and the EM classifier based on multiplemixture components model (M-EM) are more suitablefor real data when multiple mixture components are inone class.

    Co-training is another effective technique whenredundantly sufficient features are available.

  • 8/2/2019 CS583 Partially Supervised Learning

    22/48

    Learning from Positiveand Unlabeled

    ExamplesPU learning

  • 8/2/2019 CS583 Partially Supervised Learning

    23/48

    CS583, Bing Liu, UIC 23

    Learning from Positive &Unlabeled data Positive examples: One has a set of examples of a

    class P, and Unlabeled set: also hasa set Uof unlabeled (or

    mixed) examples with instances from Pand also notfrom P(negative examples). Build a classifier: Build a classifier to classify the

    examples in Uand/or future (test) data.

    Key feature of the problem: no labeled negativetraining data.

    We call this problem, PU-learning.

  • 8/2/2019 CS583 Partially Supervised Learning

    24/48

    CS583, Bing Liu, UIC 24

    Applications of the problem

    With the growing volume of online texts availablethrough the Web and digital libraries, one often wants tofind those documents that are related to one's workorone's interest.

    For example, given a ICML proceedings, find all machine learning papers from AAAI, IJCAI, KDD No labeling of negative examples from each of these

    collections. Similarly, given one's bookmarks (positive documents),

    identify those documents that are of interest to him/herfrom Web sources.

  • 8/2/2019 CS583 Partially Supervised Learning

    25/48

    CS583, Bing Liu, UIC 25

    Direct Marketing

    Company has database with details of its customer positive examples, but no information on those who

    are not their customers, i.e., no negative examples.

    Want to find people who are similar to their

    customers for marketing

    Buy a database consisting of details of people, some

    of whom may be potential customers unlabeled

    examples.

  • 8/2/2019 CS583 Partially Supervised Learning

    26/48

    CS583, Bing Liu, UIC 26

    Are Unlabeled ExamplesHelpful?

    Function known to be

    either x1 < 0 or x2 > 0

    Which one is it?

    x1

    < 0

    x2

    > 0

    +

    +

    ++ +

    + + ++

    u

    uu

    uu

    u

    u

    u

    u

    uu

    Not learnable with only positive

    examples. However, addition ofunlabeled examples makes it

    learnable.

  • 8/2/2019 CS583 Partially Supervised Learning

    27/48

    CS583, Bing Liu, UIC 27

    Theoretical foundations

    (X, Y): X- input vector, Y {1, -1} - class label. f: classification function We rewrite the probability of error

    Pr[f(X) Y]= Pr[f(X) = 1 and Y = -1]+ (1) Pr[f(X) = -1 and Y = 1]

    We have Pr[f(X) = 1 and Y = -1]=Pr[f(X) = 1] Pr[f(X) = 1 and Y = 1]

    =Pr[f(X) = 1] (Pr[Y = 1] Pr[f(X) = -1 and Y = 1]).

    Plug this into (1), we obtainPr[f(X) Y] = Pr[f(X) = 1] Pr[Y = 1] (2)

    +2Pr[f(X) = -1|Y = 1]Pr[Y = 1]

  • 8/2/2019 CS583 Partially Supervised Learning

    28/48

    CS583, Bing Liu, UIC 28

    Theoretical foundations (cont)

    Pr[f(X) Y]=Pr[f(X) = 1]Pr[Y = 1] (2) +2Pr[f(X) = -1|Y = 1] Pr[Y = 1] Note that Pr[Y = 1]is constant. If we can hold Pr[f(X) = -1|Y = 1] small, then learning is

    approximately the same as minimizing Pr[f(X) = 1].

    Holding Pr[f(X) = -1|Y = 1] small while minimizing Pr[f(X) = 1] isapproximately the same as minimizing Pru[f(X) = 1]

    while holding PrP[f(X) = 1] r(where r is recall Pr[f(X)=1| Y=1])

    which is the same as (Prp[f(X) = -1] 1 r)

    if the set of positive examples P and the set of unlabeledexamples U are large enough.

    Theorem 1 and Theorem 2 in [Liu et al 2002] state these formallyin the noiseless case and in the noisy case.

  • 8/2/2019 CS583 Partially Supervised Learning

    29/48

    CS583, Bing Liu, UIC 29

    Put it simply

    A constrained optimization problem.

    A reasonably good generalization (learning)

    result can be achieved If the algorithm tries to minimize the number of

    unlabeled examples labeled as positive

    subject to the constraint that the fraction of errors

    on the positive examples is no more than 1-r.

  • 8/2/2019 CS583 Partially Supervised Learning

    30/48

    CS583, Bing Liu, UIC 30

    An illustration

    Assume a linear classifier. Line 3 is the bestsolution.

  • 8/2/2019 CS583 Partially Supervised Learning

    31/48

    CS583, Bing Liu, UIC 31

    Existing 2-step strategy

    Step 1: Identifying a set of reliable negative documentsfrom the unlabeled set. S-EM [Liu et al, 2002] uses a Spy technique, PEBL [Yu et al, 2002] uses a 1-DNF technique Roc-SVM [Li & Liu, 2003] uses the Rocchio algorithm.

    Step 2: Building a sequence of classifiers by iteratively

    applying a classification algorithm and then selecting agood classifier.

    S-EM uses the Expectation Maximization (EM) algorithm, withan error based classifier selection mechanism PEBL uses SVM, and gives the classifier at convergence. I.e.,

    no classifier selection. Roc-SVM uses SVM with a heuristic method for selecting the

    final classifier.

  • 8/2/2019 CS583 Partially Supervised Learning

    32/48

    CS583, Bing Liu, UIC 32

    Step 1 Step 2

    positive negative

    Reliable

    Negative(RN)

    Q

    =U - RN

    U

    P

    positive

    Using P, RN and Q

    to build the finalclassifier iteratively

    or

    Using only P and RNto build a classifier

  • 8/2/2019 CS583 Partially Supervised Learning

    33/48

    CS583, Bing Liu, UIC 33

    Step 1: The Spy technique

    Sample a certain % of positive examples and putthem into unlabeled set to act as spies.

    Run a classification algorithm assuming all unlabeled

    examples are negative, we will know the behavior of those actual positive examples

    in the unlabeled set through the spies.

    We can then extract reliable negative examples from

    the unlabeled set more accurately.

  • 8/2/2019 CS583 Partially Supervised Learning

    34/48

    CS583, Bing Liu, UIC 34

    Step 1: Other methods

    1-DNF method: Find the set of words W that occur in the positive

    documents more frequently than in the unlabeledset.

    Extract those documents from unlabeled set thatdo not contain any word in W. These documentsform the reliable negative documents.

    Rocchio method from information retrieval.

    Nave Bayesian method.

  • 8/2/2019 CS583 Partially Supervised Learning

    35/48

    CS583, Bing Liu, UIC 35

    Step 2: Running EM or SVMiteratively(1) Running a classification algorithm iteratively

    Run EM using P, RN and Q until it converges, or

    Run SVM iteratively using P, RN and Q until this no

    document from Q can be classified as negative. RN and

    Q are updated in each iteration, or

    (2) Classifier selection.

  • 8/2/2019 CS583 Partially Supervised Learning

    36/48

    CS583, Bing Liu, UIC 36

    Do they follow the theory?

    Yes, heuristic methods because Step 1 tries to find some initial reliable negative

    examples from the unlabeled set.

    Step 2 tried to identify more and more negative

    examples iteratively.

    The two steps together form an iterative

    strategy ofincreasing the number of

    unlabeled examples that are classified asnegative while maintaining the positive

    examples correctly classified.

  • 8/2/2019 CS583 Partially Supervised Learning

    37/48

    CS583, Bing Liu, UIC 37

    Can SVM be applieddirectly? Can we use SVM to directly deal with the

    problem of learning with positive and

    unlabeled examples, without using two steps?

    Yes, with a little re-formulation.

  • 8/2/2019 CS583 Partially Supervised Learning

    38/48

    CS583, Bing Liu, UIC 38

    Support Vector Machines

    Support vector machines (SVM) are linear functions ofthe form f(x)=wTx + b, where w is the weight vector

    and x is the input vector.

    Let the set of training examples be {(x1, y1), (x2, y2),

    , (xn, yn)}, where xi is an input vector and yi is its

    class label, yi {1, -1}.

    To find the linear function:

    Minimize:

    Subject to:

    wwT

    2

    1

    niby ii ...,2,1,,1)(T

    =+xw

  • 8/2/2019 CS583 Partially Supervised Learning

    39/48

    CS583, Bing Liu, UIC 39

    Soft margin SVM

    To deal with cases where there may be no separatinghyperplane due to noisy labels of both positive andnegative training examples, the soft margin SVM isproposed:

    Minimize:

    Subject to:

    where C 0 is a parameter that controls the amountof training errors allowed.

    =+n

    i

    iC1

    T

    2

    1

    ww

    niby iii ...,2,1,,1)(T

    =+ xw

  • 8/2/2019 CS583 Partially Supervised Learning

    40/48

    CS583, Bing Liu, UIC 40

    Biased SVM (noiseless case)

    Assume that the first k-1 examples are positive

    examples (labeled 1), while the rest are unlabeled

    examples, which we label negative (-1).

    Minimize:

    Subject to:

    i 0, i= k, k+1, n

    =

    +n

    ki

    iC wwT

    2

    1

    1...,2,1,,1T =+ kibixw

    nkkib ii ...,,1,,1)1(-T +=+ xw

  • 8/2/2019 CS583 Partially Supervised Learning

    41/48

    CS583, Bing Liu, UIC 41

    Biased SVM (noisy case)

    If we also allow positive set to have some noisynegative examples, then we have:

    Minimize:

    Subject to:

    i 0, i= 1, 2, , n.

    This turns out to be the same as the asymmetric costSVM for dealing with unbalanced data. Of course, we

    have a different motivation.

    niby iii ...,2,1,1)(T

    =+ xw

    =

    =

    + ++n

    ki

    i

    k

    i

    i CC 1

    1

    T

    2

    1ww

  • 8/2/2019 CS583 Partially Supervised Learning

    42/48

    CS583, Bing Liu, UIC 42

    Estimating performance

    We need to estimate the performance in order to

    select the parameters.

    Since learning from positive and negative examples

    often arise in retrieval situations, we use F score asthe classification performance measure F = 2pr/ (p+r)

    (p: precision, r: recall).

    To get a high F score, both precision and recall have

    to be high. However, without labeled negative examples, we do

    not know how to estimate the F score.

  • 8/2/2019 CS583 Partially Supervised Learning

    43/48

    CS583, Bing Liu, UIC 43

    A performance criterion

    Performance criteriapr/Pr[Y=1]: It can be estimateddirectly from the validation set as r2/Pr[f(X) = 1] Recall r= Pr[f(X)=1| Y=1] Precisionp = Pr[Y=1| f(X)=1]

    To see thisPr[f(X)=1|Y=1] Pr[Y=1] = Pr[Y=1|f(X)=1] Pr[f(X)=1]

    //both side times r

    Behavior similar to the F-score (= 2pr/ (p+r))]1Pr[]1)(Pr[

    =

    =

    =

    Y

    p

    Xf

    r

    f i i (

  • 8/2/2019 CS583 Partially Supervised Learning

    44/48

    CS583, Bing Liu, UIC 44

    A performance criterion (cont) r2/Pr[f(X) = 1] rcan be estimated from positive examples in

    the validation set.

    Pr[f(X) = 1]can be obtained using the fullvalidation set.

    This criterion actually reflects the theory very

    well.

  • 8/2/2019 CS583 Partially Supervised Learning

    45/48

    CS583, Bing Liu, UIC 45

    Empirical Evaluation Two-step strategy: We implemented a benchmark system, called

    LPU, which is available at http://www.cs.uic.edu/~liub/LPU/LPU-download.html Step 1:

    Spy 1-DNF Rocchio Nave Bayesian (NB)

    Step 2: EM with classifier selection SVM: Run SVM once. SVM-I: Run SVM iteratively and give converged classifier. SVM-IS: Run SVM iteratively with classifier selection

    Biased-SVM (we used SVMlight package)

    T a b l e 1 : A v e r a g e F s c o r e s o n R e u t e r s c o ll e c t io n

  • 8/2/2019 CS583 Partially Supervised Learning

    46/48

    CS583, Bing Liu, UIC 46

    1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 16 1 7

    S tep 11 -D N F1 -D N F 1 -D N F S p y S p y Sp y R o c c h ioR o c c h ioR o c c h io N B N B N B N B

    S tep 2 E M S V M S V M -IS S V M S V M - IS V M -IS E M S V M S V M -I E M S V M S V M - IS V M - IS

    0 . 1 0 . 1 8 7 0 . 4 2 3 0 . 0 0 1 0 . 4 2 3 0 . 5 4 7 0 . 3 2 9 0 . 0 0 6 0 .3 2 8 0 . 6 4 4 0 . 5 8 9 0 . 0 0 1 0 . 5 8 9 0 . 5 4 7 0 . 1 1 5 0 . 0 0 6 0 . 1 1 5 0 . 5 1

    0 . 2 0 . 1 7 7 0 . 2 4 2 0 . 0 7 1 0 . 2 4 2 0 . 6 7 4 0 . 5 0 7 0 . 0 4 7 0 .5 0 7 0 . 6 3 1 0 . 7 3 7 0 . 1 2 4 0 . 7 3 7 0 . 6 9 3 0 . 4 2 8 0 . 0 7 7 0 . 4 2 8 0 . 6 80 . 3 0 . 1 8 2 0 . 2 6 9 0 . 2 5 0 0 . 2 6 8 0 . 6 5 9 0 . 7 3 3 0 . 2 3 5 0 .7 3 3 0 . 6 2 3 0 . 7 8 0 0 . 2 4 2 0 . 7 8 0 0 . 6 9 5 0 . 6 6 4 0 . 2 3 5 0 . 6 6 4 0 . 6 9

    0 . 4 0 . 1 7 8 0 . 1 9 0 0 . 5 8 2 0 . 2 2 8 0 . 6 6 1 0 . 7 8 2 0 . 5 4 9 0 .7 8 0 0 . 6 1 7 0 . 8 0 5 0 . 5 6 1 0 . 7 8 4 0 . 6 9 3 0 . 7 8 4 0 . 5 5 7 0 . 7 8 2 0 . 7 0

    0 . 5 0 . 1 7 9 0 . 1 9 6 0 . 7 4 2 0 . 3 5 8 0 . 6 7 3 0 . 8 0 7 0 . 7 1 5 0 .7 9 9 0 . 6 1 4 0 . 7 9 0 0 . 7 3 7 0 . 7 9 9 0 . 6 8 5 0 . 7 9 7 0 . 7 2 1 0 . 7 8 9 0 . 7 0

    0 . 6 0 . 1 8 0 0 . 2 1 1 0 . 8 1 0 0 . 5 7 3 0 . 6 6 9 0 . 8 3 3 0 . 8 0 4 0 .8 2 0 0 . 5 9 7 0 . 7 9 3 0 . 8 1 3 0 . 8 1 1 0 . 6 7 0 0 . 8 3 2 0 . 8 0 8 0 . 8 2 4 0 . 6 9

    0 . 7 0 . 1 7 5 0 . 1 7 9 0 . 8 2 4 0 . 4 2 5 0 . 6 6 7 0 . 8 4 3 0 . 8 2 1 0 .8 4 2 0 . 5 8 5 0 . 7 9 3 0 . 8 2 3 0 . 8 3 4 0 . 6 6 4 0 . 8 4 5 0 . 8 2 2 0 . 8 4 3 0 . 6 8

    0 . 8 0 . 1 7 5 0 . 1 7 8 0 . 8 6 8 0 . 6 5 0 0 . 6 4 9 0 . 8 6 1 0 . 8 6 5 0 .8 5 8 0 . 5 7 5 0 . 7 8 7 0 . 8 6 7 0 . 8 6 4 0 . 6 5 1 0 . 8 5 9 0 . 8 6 5 0 . 8 5 8 0 . 6 7

    0 . 9 0 . 1 7 2 0 . 1 9 0 0 . 8 6 0 0 . 7 1 6 0 . 6 5 8 0 . 8 5 9 0 . 8 5 9 0 .8 5 3 0 . 5 8 0 0 . 7 7 6 0 . 8 6 1 0 . 8 6 1 0 . 6 5 1 0 . 8 4 6 0 . 8 5 8 0 . 8 4 5 0 . 6 7

    1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 16 1 7

    S tep 11 -D N F1 -D N F 1 -D N F S p y S p y Sp y R o c c h ioR o c c h ioR o c c h io N B N B N B N B

    S tep 2 E M S V M S V M -IS S V M S V M - IS V M -IS E M S V M S V M -I E M S V M S V M - IS V M - IS

    0 . 1 0 . 1 4 5 0 . 5 4 5 0 . 0 3 9 0 . 5 4 5 0 . 4 6 0 0 . 0 9 7 0 . 0 0 3 0 .0 9 7 0 . 5 5 7 0 . 2 9 5 0 . 0 0 3 0 . 2 9 5 0 . 3 6 8 0 . 0 2 0 0 . 0 0 3 0 . 0 2 0 0 . 3 3

    0 . 2 0 . 1 2 5 0 . 3 7 1 0 . 0 7 4 0 . 3 7 1 0 . 6 4 0 0 . 4 0 8 0 . 0 1 4 0 .4 0 8 0 . 6 7 0 0 . 5 4 6 0 . 0 1 4 0 . 5 4 6 0 . 6 4 9 0 . 2 3 2 0 . 0 1 3 0 . 2 3 2 0 . 6 1

    0 . 3 0 . 1 2 3 0 . 2 8 8 0 . 2 0 1 0 . 2 8 8 0 . 6 6 5 0 . 6 2 5 0 . 1 5 4 0 .6 2 5 0 . 6 7 3 0 . 6 4 4 0 . 1 2 1 0 . 6 4 4 0 . 6 8 9 0 . 4 6 9 0 . 1 2 0 0 . 4 6 9 0 . 6 70 . 4 0 . 1 2 2 0 . 2 6 0 0 . 3 4 2 0 . 2 5 8 0 . 6 8 3 0 . 6 8 4 0 . 3 5 4 0 .6 8 4 0 . 6 7 1 0 . 6 9 0 0 . 3 8 5 0 . 6 8 2 0 . 7 0 5 0 . 6 1 0 0 . 3 5 4 0 . 6 0 3 0 . 7 0

    0 . 5 0 . 1 2 1 0 . 2 4 8 0 . 5 6 3 0 . 3 0 6 0 . 6 8 5 0 . 7 1 5 0 . 5 6 0 0 .7 0 7 0 . 6 6 3 0 . 7 1 6 0 . 5 6 5 0 . 7 0 8 0 . 7 0 2 0 . 6 8 0 0 . 5 5 4 0 . 6 7 2 0 . 7 0

    0 . 6 0 . 1 2 3 0 . 2 0 9 0 . 6 4 6 0 . 4 1 9 0 . 6 8 9 0 . 7 5 8 0 . 6 7 4 0 .7 4 6 0 . 6 6 3 0 . 7 4 7 0 . 6 8 3 0 . 7 3 8 0 . 7 0 1 0 . 7 3 7 0 . 6 7 0 0 . 7 2 4 0 . 7 1

    0 . 7 0 . 1 1 9 0 . 1 9 6 0 . 7 1 5 0 . 5 6 3 0 . 6 8 1 0 . 7 7 4 0 . 7 3 1 0 .7 5 7 0 . 6 6 0 0 . 7 5 4 0 . 7 3 1 0 . 7 4 6 0 . 6 9 9 0 . 7 6 3 0 . 7 2 8 0 . 7 4 9 0 . 7 1

    0 . 8 0 . 1 2 4 0 . 1 8 9 0 . 6 8 9 0 . 5 0 8 0 . 6 8 0 0 . 7 8 9 0 . 7 6 0 0 .7 8 3 0 . 6 5 4 0 . 7 6 1 0 . 7 6 3 0 . 7 6 6 0 . 6 8 8 0 . 7 8 0 0 . 7 5 8 0 . 7 7 4 0 . 7 0

    0 . 9 0 . 1 2 3 0 . 1 7 7 0 . 7 1 6 0 . 5 7 7 0 . 6 8 4 0 . 8 0 7 0 . 7 9 7 0 .7 9 8 0 . 6 5 4 0 . 7 7 5 0 . 7 9 8 0 . 7 9 0 0 . 6 9 1 0 . 8 0 6 0 . 7 9 7 0 . 7 9 8 0 . 7 1

    N B

    N B

    T a b l e 2 : A v e r a g e F s c o r e s o n 2 0 N e w s g r o u p c o l le c t io n

    T a b l e 1 : A v e r a g e F s c o r e s o n R e u t e r s c o ll e c t io n

    P E B

    P E B S -E M

    S -E M R o c -S V

    R o c - S V

  • 8/2/2019 CS583 Partially Supervised Learning

    47/48

    CS583, Bing Liu, UIC 47

    Results of Biased SVM

    Average F score

    Biased-SVM

    Previous best F

    score

    0.3 0.785 0.780.7 0.856 0.845

    0.3 0.742 0.689

    0.7 0.805 0.774

    Tab le 3: Average F scores on the two collections

    Reuters

    20Newsgrou

  • 8/2/2019 CS583 Partially Supervised Learning

    48/48

    Summary

    Gave an overview ofthe theory on learning withpositive and unlabeled examples. Described the existing two-step strategy for learning. Presented an more principled approach to solve the

    problem based on a biased SVM formulation. Presented a performance measure pr/P(Y=1) that can

    be estimated from data. Experimental results using text classification show the

    superior classification power of Biased-SVM.

    Some more experimental work are being performed tocompare Biased-SVM with weighted logisticregressionmethod [Lee & Liu 2003].