An Introduction to Logic Regression

An Introduction to Logic Regression

John [email protected]@johnsarealtwitSSN: 249-543-0833BMI: 20.9

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyyA8wePstPC69PeuHFtOwyTecByonsHFAjHbVnZ+h0dpomvLZxUtbknNj3+c7MPYKqKBOx9gUKV/diR/mIDqsb405MlrI1kmNR9zbFGYAAwIH/Gxt0Lv5ffwaqsz7cECHBbMojQGEz3IH3twEvDfF6cu5p00QfP0MSmEi/eB+W+h30NGdqLJCziLDlp409jAfXbQm/4Yx7apLvEmkaYSrb5f/pfvYv1FEV1tS8/J7DgdHUAWo6gyGUUSZJgsyHcuJT7v9Tf0xwiFWOWL9WsWXa9fCKqTeYnYJhHlqfinZRnT/+jkz0OZ7YmXo6j4Hyms3RCOqenIX1W6gnIn+eQIkw== This is the key's comment

DC Data Science MeetupOctober 25, 2011

Other then getting Drunk on Marck’s Beer, What should I get out of Tonight?• The thought process behind technique selection.

• What are the questions you ask yourself when deciding upon a algorithm/technique.

• I’ll share some quick thoughts and then I hope to spark a more open discussion.

• An introduction to Logic Regression and LogicForest• Basic intro to CART and RandomForest• What is a Logic Tree?• Simulated Annealing• A Short R Demo – Because Nothing is as exciting as watching a

algorithm train

The Supervised Classification Problem• Supervised vs. Unsupervised

• Labeled vs Unlabeled Data• Supervised Learning:

• Use a set of pre-label/categorized data to train a learning classifier which can predict the ‘label’ of previously unidentified observations.

• Example:• Spam Filtering: Given a trove of previously classified emails as (Spam vs Non

Spam) train an algorithm to predict the ‘label’ of newly received emails. • Heart Attack Prediction: Given a set of health indicators and historical epidemiology

records predict the chance of a heart attack in new patients.

• Terms:• “dependent” variable ,response or outcome – What you are trying to predict• “predictor” or “independent” variables – What you use to predict• Test/Training Set – randomly sub-setted data used to validate and test for over-fitting

Considerations with Technique Selection• How is the model going to be used?

• Production vs. Exploration• Who is going to be exposed to it?

• Personal Analysis vs. Public/Internal Consumption• Business/Users vs. Nerds

• White Box vs. Black Box• Eg. CART vs. SVM• The importance of Interpretability • Interpretability vs. Accuracy – Ideally a false dichotomy

• What fits the data. • High Dimensionality - How the F’ do you deal with so many possible predictors

• Would recommend Breiman’s Wonderful article “The Two Cultures”

Where are these vexing questions of morality, ontology and efficient algorithm design answered?

Noob armed with a computer• The Double Edge sword of R?

• library(caret) • Classification and Regression Training – WONDERFUL

• Fits.of.Numbing.Variety <- train(TrainData, Response, method = “XXXX“,…)

• If you can’t understand it you probably shouldn’t be using it.• If you can’t explain it, you probably don’t understand it.• Fetishism of Complexity

• Thoughts?

Logic Regression- • Not Logistic Regression

• a GLM (generalized linear model) predicting probability of an outcome. But used in many of the same problems.

• Main Paper - Ruczinski I, Kooperberg C, LeBlanc ML (2003): Logic Regression, Journal of Computational and Graphical Statistics, 12 (3), pp 475-511.• Very Readable

• Published in R• library(LogicReg)

Logic Reg. Cont.• “Logic Regression is an adaptive regression methodology

that attempts to construct predictors as Boolean combinations of binary covariates.”

• The most important contribution that Logic Regression to the field is the focus on the interactions of dependent binary variables. • This is a major difference to rival techniques. The effect of each

predictor upon the response is measured independently. When these interactions are considered it is only in a 2 or 3 way manner.

• Again, this focus is purely application specific. It may not be important to you. However this can be very interesting where the binary predictors are highly correlated.

Example Logic Tree

(One and not Two) and either [(Three and Four) or (Five and either (Six or not Three))]

The not so English English Translation

Boolean Expression

Image Credit: Ruczinski et al.

CART:

The Highly Unscientific Explanation: A decision tree with a touch of Monte Carlo Magic. – Simulated Annealing

CART algorithms implements a “greedy” impurity reduction strategy. Using Gini, Entropy(information gain) or misclassification rate.

How?

1. Search all attributes. – Calculate the potential reduction in impurity2. Split on the attribute with the greatest gain(in the binary world this T

or F)3. Repeat with remaining attributes until no further reduction in impurity

or the preset maximum size is met

Simulated Annealing• A search technique for locating a good approximation of

the global optimum in a large search space. • The inspiration come from metal work.

• An adaptation of the Monte Carlo method. • Monte Carlo Methods use repeated random sampling in order

calculate results.• Personally found this application to be a useful introduction to the

much larger field of Monte Carlo Methods.• Metropolis-Hastings Algorithm is a application of Simulated

Annealing where Temperature is kept constant.

Basic Components of Simulated Annealing

1. Two solution are said to be adjacent when they are with in one move. Which can be:1. Alternate Leaf2. Alternate Operator – change just one 3. Purne Branch/ Delete Leaf4. Split Leaf5. Grow Branch

• State Space –“Search space” – All difference solutions to a problem• Neighborhood System – How all possible state are related to each other

• Neighbors of a State – Possible moves from current state that result in altering the current solution, i.e. they are connect by a “Move”

• Temperature – a globally time-sensitive parameter that reflects the algorithm’s acceptance probability. Depending on what point in the annealing chain this probability gets increasingly strict.

General Definitions:

Application Specific Definitions:

Image Credit: Ruczinski et al.

Simulated Annealing - Application

A subtle point is that while adjacent states are compared and either accepted or rejected. This does not always mean that a “move” is an improvement over the originating state, it can be worse. A move is accepted if it an improvement or it is within a certain probability(i.e. the Temperature). This is the key difference between a greedy search strategy and SA. The ability for the chain to move through locally sub-optimal solution allows it to eventually reach a global optimum. If run time is unbounded, this algorithm will find the global optimum.

1. Given a certain state, move to adjacent state in the search space.2. If the state is an improvement(by misclassification rate) accept, else

test to see if the score difference is within the acceptance probability 1. The annealing chain is given temperature bounds(high starting

spot), and an end to dictate the cooling period. Because the temperature rate reduces over time(the cooling period) less and less moves are accepted

3. This is repeated until the chain reaches a predetermined number of iterations or reaches a “breakout point”(where repeated iterations that lead to no moves) .

Steps:

Acceptance Rate Over ScoresAcceptance function = min(1,exp(-diff(scores)/temp))

Image Credit: Wikipedia

Visualization of Metropolis-Hastings Algorithm

Est. of Global Optimum

The “Burn” – Early Discarded simulations

Parameters and Acceptance• logreg.anneal.control(start = 1, end = -2, iter = 50000)• Start and End are on log10 scale

• A start temp that is too high can lead to wasted time. Where the chain is wondering, accepting every possible move. (essentially an drug-addled unemployed graduate on his gap year in India).

• The chain should end without accepting more then 5% of moves, otherwise it will not converge properly. This can be monitored with the update parameter.

• If the end temperature is too low then then chain spend most of its time at the end rejecting the last moves and wasting time. Not the end of the world and the creators of the packages have implemented an optional early exit parameter. If a certain number iterations go by without a large number of acceptances then the chain terminates.

• Like all Markov Chain applications – Trial and Error is the name of the game.

LogicForest• Ensemble Technique using LR as component classifiers• Described in paper:

• “Logic Forest: an ensemble classifier for discovering logical combinations of binary markers.” Bioinformatics. 2010 Sep 1;26(17):2183-9. Epub 2010 Jul 13.

• Less Readable

• Implemented in R in the package: LogicForest

• Introduces some powerful features and improves upon the basic LR model.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025651/pdf/btq354.pdf

http://www.ncbi.nlm.nih.gov/pubmed/20628070

Basics of RandomForest• Invented(and trademarked) by Leo Breiman and Adele Cutler. It uses

both Breiman’s “bagging” and controlled variation(stochastic discrimination) to create an incredible powerful classifier

http://www.springerlink.com/content/u0p06167n6173512/fulltext.pdf

Overly simple Explanation of Random Forest Algorithm1. Select a random subset of attributes and observation from the training set.2. Grow a tree using CART methodology to maximum size and do not prune.3. Repeat n numbers of times.4. Have each one of these un-prunned trees predict the label of a new

observation and take the majority rule of all the trees.

• One of the most interesting aspects of RandomForest is because the attributes are iteratively subseted(or “masked”) one can calculate the importance that attributes has on the outcome of the model.

• Also this randomization allows for a measure of independence between the models and is the heart of the ensemble technique.

• -A great intuitive discussion of this on Day 11 of Statistical Aspect of Data Mining(youtube)

LogicForest• Unlike Random Forest, LogicForest do not build each competent LR

using a random subset of variables. • The biggest Difference from a classic RF is the maximum size of the

LR tree is randomized – not the search space.• The inherent randomness of Simulated Annealing and randomized tree size

introduces the model independence that lies at the heart of ALL ensemble techniques

• Also implements a RandomForest style variable importance calculation• Iterative Variable Masking leading to variable and Interaction importance. • It also provides the proportion of trees that voted for a particular classification

giving a very useful ranking of predictive certainty

• http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025651/pdf/btq354.pdf

Conclusion: Why LogicForest is the Bee’s Knees• Variable Importance

• Feature Selection• Interaction Importance

• High level of Accessibility. Almost a Clear-Box.• Classification Confidence

• Provides an ordinal ranking of confidence• The use of Simulated Annealing to escape early split traps

and the looming specter of the local optimum• An interesting approach to binary data that is

understandable. • Its still just trees.

• Makes for a good introduction to more complex and advanced technique. (i.e. Markov Chain)

• Free (Let the angels sing on high for R-core team and package contributors)

Fire up the Servers…its Demo Time

Documents

An Introduction to Logic Regression