Hands on Classification with Learning Based Java
Gourab Kundu
Adapted from a talk by Vivek Srikumar
Goals of this tutorial
At the end of these lectures, you will be able to
1. Get started with Learning Based Java
2. Use a generic, black box text classifier for different applications
…and write your own text classifier, if needed
3. Understand how features can impact the classifier performance
… and add features to improve your application
4. Build a badge classifier based on character features
A Quick Recap
Given: Examples (x,f(x)) of some unknown function f
Find: A good approximation of f
x provides some representation of the input The process of mapping a domain element into a
representation is called Feature Extraction. (Hard; ill-understood; important)
x €{0,1}n or x € Rn The target function (label)
f(x) € {-1,+1} Binary Classification f(x) € {1,2,3,.,k-1} Multi-class classification
What is text classification?
✓✗✗
✗A document
Some labels
A classifier (black box)
Several applications fit this framework
Spam detection Sentiment classification
What else can you do, if you had such a black box system that can classify text?
Try to spend 30 seconds brainstorming
Outline of this session
Getting started with LBJ
Writing our first classifier: Spam/Ham
Playing with features
Looking inside the black box classifier for feature
weights
LEARNING BASED JAVAWriting classifiers
What is Learning Based Java?
A modeling language for learning and inference
Supports Programming using learned models High level specification of features and
constraints between classifiers Inference with constraints Different learning algorithms
The learning operator Classifiers are functions defined in terms of data Learning happens at compile time
What does LBJ do for you?
Abstracts away the feature representation, learning and inference
Allows you to write learning based programs
Application developers can reason about the application at hand
Demo
A learning based program
First, we will write an application that assumes the existence of a black box classifier
SPAM DETECTION
Spam detection
Which of these (if any) are email spam?
Subject: save over 70 % on name brand software
ppharmacy devote fink tungstate brown lexicon pawnshop crescent railroad distaff cytosine barium cain application elegy donnelly hydrochloride common embargo shakespearean bassett trustee nucleolus chicano narbonne telltale tagging swirly lank delphinus bragging bravery cornea asiatic susanne
Subject: please keep in touch
just like to say that it has been great meeting and working with you all . iwill be leaving enron effective july 5 th to do investment banking in hongkong . i will initially be based in new york and will be moving to hong kongafter a few months . do contact me when you are in the vicinity .
How do you know?
What do we need to build a classifier?
1. Annotated documents*
2. A feature representation of the documents
3. A learning algorithm
* Here we are dealing with supervised learning
Our first LBJ program
/** A learned text classifier; its definition comes from data. */
discrete TextClassifier(Document d) <-learn TextLabel using WordFeatures from new DocumentReader("data/spam/train")
with SparseAveragedPerceptron { learningRate = 0.1 ; thickness = 3.5; } 5 rounds
testFrom new DocumentReader("data/spam/test”)end
Defines a classifier
The object beingclassified
The function being learned
The feature representation
The source of thetraining data
The learning algorithm
Demo
Let’s build a spam detector
How to train?
How do different learning algorithms perform? Does this choice matter much?
Features
Our current spam detector uses words as features
Can we do better?
Let’s try it out
MORE TEXT CLASSIFICATION
Sentiment classification
Which of these product reviews is positive?
I recently made the switch from PC to Mac, and I can say that I'm not sure why I waited so long. Considering that I have only had my computer a few weeks I can't say much about the durability and longevity of the hardware, but I can say that the operating system (mine shipped with Lion) and software is top notch.
I've been an Apple user for a long time, but my most recent MacBook Pro purchase has convinced me to reconsider. I've had several hardware issues, including a failed keyboard, battery failure, and a bad DVD drive. Now, the backlight on the display fails to turn on when waking from sleep
How do you know?
Classifying news groups
Which mailing list should this message be posted to?
I am looking for Quick C or Microsoft C code for image decoding from file forVGA viewing and saving images from/to GIF, TIFF, PCX, or JPEG format. I havescoured the Internet, but its like trying to find a Dr. Seuss spell checker TSR. It must be out there, and there's no need to reinvent the wheel.How do you know?alt.atheism
comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.xmisc.forsalerec.autosrec.motorcyclesrec.sport.baseball
rec.sport.hockeysci.cryptsci.electronicssci.medsci.spacesoc.religion.christiantalk.politics.gunstalk.politics.mideasttalk.politics.misctalk.religion.misc
Demo
Converting our spam classifier into a Sentiment classifier A newsgroup classifier
Note: How different are these at the implementation level?
Most of the engineering lies in the features
✓✗✗
✗A document
Some labels
A classifier (black box)
Summary
What is LBJ? How do we use it?
Writing a simple spam detector
Playing with features
How much do we need to change to move to a different application?
Assignment before Next Class (Not Graded)
Download the code & data (http://l2r.cs.uiuc.edu/~
danr/Teaching/CS446-12/handsonclassification.html) for this class and play with it
Try to solve the Badges game puzzle with LBJ Think about what features are needed Write a parser for reading the data Write a classifier for solving the puzzle
Next Class
We will solve the Badges Game puzzle by Machine Learning
We will look at more text classification examples
We will think about a famous people classifier
Questions
Badge Classifier
Brainstorm the possible Features Characters in entire name Two consecutive Characters Character as Vowel, Character as Consonant …. …
Feature Engineering is Important (especially if labeled data is small)
What is the baseline? 70 +, 24 -
THE FAMOUS PEOPLE CLASSIFIER
The Famous People Classifier
f( ) = Politician
f( ) = Athlete
f( ) = Corporate Mogul
The NLP version of the fame classifier
All sentences in the news, which the string Barack Obama occurs
All sentences in the news, which the string Roger Federer occurs
All sentences in the news, which the string Bill Gates occurs
Represented by
Our goal
Find famous athletes, corporate moguls and politicians
Athlete
• Michael Schumacher
• Michael Jordan• …
Politician
• Bill Clinton• George W.
Bush• …
Corporate Mogul
• Warren Buffet• Larry Ellison• …
Let’s brainstorm
How do we build a fame classifier?Remember, we start off with just raw text from a news website
One solution
Let us label entities using features defined on mentions
Identify mentions using the named entity recognizer
Define features based on the words, parts of speech and dependency trees
Train a classifier
All sentences in the news, which the string Barack Obama occurs
Summary
1. Get started with Learning Based Java
2. Use a generic, black box text classifier for different applications
…and write your own text classifier, if needed
3. Understand how features can impact the classifier performance
… and add features to improve your application
4. Build a badge classifier based on character features
Questions