Upload
goodb
View
693
Download
0
Tags:
Embed Size (px)
Citation preview
Su lab meeting March 30, 2012
Benjamin Good
HUMAN GUIDED FORESTS
1. Motivation
2. Idea
3. Game
AGENDA
Need to build biological class predictors that:
1. Have high accuracy
2. Use relatively few variables
To do this we have to use datasets that:
3. Are very noisy
4. Contain enormous numbers of variables
CHALLENGE
Van’tVeer 2002 Nature
98 breast cancer samples:
34 developed metastases within 5 years, 44 did not
18 had BRCA1 mutations, 2 had BRCA2 mutations
expression levels of 25,000 genes measured
5,000 genes “significantly regulated across the sample groups”
EXAMPLE: BREAST CANCER PROGNOSIS
5000 genes
98 tumors
70% bad 30% bad
genes that coregulate with ER
co-regulated genes indicating lymphocytic infiltrate
231 genes were found to be significantly associated with disease outcome
Using leave-one-out cross-validation they empirically selected the 70 best individual genes to build their
predictor
Of the 78 samples in the training set
the predictor correctly classified 65 (83%)
->MammaPrint test from Agendia
Still in clinical trials (10 years since original study) (MINDACT)
METASTASIS PREDICTOR
This signature does not take advantage of:• interactions between genes
(together two variables may be much more predictive then either one alone)
• biological knowledge
(this signature leaves out several known cancer predictors and does not make use of biological knowledge in any way)
WE CAN DO BETTER
THERE ARE MANY MANY CHALLENGES LIKE THIS IN BIOLOGY
1. Motivation
2. Idea
3. Game
AGENDA
The standard signature does not take advantage of:• interactions between genes
machine learning algorithms can find and use these but can have problems when faced with large feature
spaces• biological knowledge
can be used to guide the machine learning process towards meaningful features in the data and thus
reduce chances of overfitting
WE CAN DO BETTER BY INTEGRATING MACHINE LEARNING WITH BIOLOGICAL
EXPERTISE
MACHINE LEARNING ALGORITHM OF THE
MOMENT
• In each of many iterations, a small subset of features are chosen randomly and used to build one decision tree
• Decision trees are stored and classifications are made based on the majority vote of all of the trees.
• Good classifier!
• But you get different forests every time you run it and it faces the same challenges of generalizability as any other learning algorithm.
NETWORK GUIDED FOREST (NGF)
Same algorithm except each tree is constructed from a particular area of a relevant protein-protein interaction network.
1) Pick a gene randomly2) Walk out along the
network to get the other N genes to use to build that tree
3) repeat
The premise is that biologically coherent modules will give better signal than individual genes randomly grouped together
Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
NGF RESULTS
A) Identical performance to random forest and random network guided forest as assessed by 5 fold cross-validation repeated 100 times.
B) More known breast cancer genes show up in the forest
C) Similar genes selected for forests in two different training sets (different patient cohorts)
HUMAN GUIDED RANDOM FOREST (HGF)
Same algorithm again except each trees are constructed from a manually selected subset of genes (or other features).
1) Find a person2) Let them select what
they think is an optimal feature set
3) back to step one, N times
4) aggregate
The premise is that biological knowledge can produce better than random decision modules and that not all biological knowledge is captured in interaction networks
1) Find a person
2) Let them select what they think is an optimal feature set
3) back to step one, N times
• N may be large (e.g. 1,000)• Need many knowledgeable people to work hard... for free
HGF CHALLENGES
1. Motivation
2. Idea
3. Game
AGENDA
COMBO! COMPONENTS
Java server• hosts training
data• executes
decision tree algorithm
• performs cross-validation tests
• logs data generated by game
Game interface(s)• manages game
events: users, moves, etc.
• two implementations so far • server side
(JSP) card game• client side
(javascript) table game
server provides features for games (e.g. gene cards)
client sends groups of features (‘hands’) to the server for scoring.
COMBO CODE AND DEMO
Next steps
1. Better preprocessing of training data
• map contigs to genes where possible, filter out clearly useless genes
• identify individually predictive genes
• Game
• build domain-specific boards, let players pick their knowledge area
• Real two player feeling with robot partner
• Special cards: robber card, any-gene selector card
• High scores
• ??????????????