Human Guided Forests (HGF)

Su lab meeting March 30, 2012

Benjamin Good

HUMAN GUIDED FORESTS

1. Motivation

2. Idea

3. Game

AGENDA

Need to build biological class predictors that:

1. Have high accuracy

2. Use relatively few variables

To do this we have to use datasets that:

3. Are very noisy

4. Contain enormous numbers of variables

CHALLENGE

Van’tVeer 2002 Nature

98 breast cancer samples:

34 developed metastases within 5 years, 44 did not

18 had BRCA1 mutations, 2 had BRCA2 mutations

expression levels of 25,000 genes measured

5,000 genes “significantly regulated across the sample groups”

EXAMPLE: BREAST CANCER PROGNOSIS

5000 genes

98 tumors

70% bad 30% bad

genes that coregulate with ER

co-regulated genes indicating lymphocytic infiltrate

231 genes were found to be significantly associated with disease outcome

Using leave-one-out cross-validation they empirically selected the 70 best individual genes to build their

predictor

Of the 78 samples in the training set

the predictor correctly classified 65 (83%)

->MammaPrint test from Agendia

Still in clinical trials (10 years since original study) (MINDACT)

METASTASIS PREDICTOR

This signature does not take advantage of:• interactions between genes

(together two variables may be much more predictive then either one alone)

• biological knowledge

(this signature leaves out several known cancer predictors and does not make use of biological knowledge in any way)

WE CAN DO BETTER

THERE ARE MANY MANY CHALLENGES LIKE THIS IN BIOLOGY

1. Motivation

2. Idea

3. Game

AGENDA

The standard signature does not take advantage of:• interactions between genes

machine learning algorithms can find and use these but can have problems when faced with large feature

spaces• biological knowledge

can be used to guide the machine learning process towards meaningful features in the data and thus

reduce chances of overfitting

WE CAN DO BETTER BY INTEGRATING MACHINE LEARNING WITH BIOLOGICAL

EXPERTISE

MACHINE LEARNING ALGORITHM OF THE

MOMENT

• In each of many iterations, a small subset of features are chosen randomly and used to build one decision tree

• Decision trees are stored and classifications are made based on the majority vote of all of the trees.

• Good classifier!

• But you get different forests every time you run it and it faces the same challenges of generalizability as any other learning algorithm.

NETWORK GUIDED FOREST (NGF)

Same algorithm except each tree is constructed from a particular area of a relevant protein-protein interaction network.

1) Pick a gene randomly2) Walk out along the

network to get the other N genes to use to build that tree

3) repeat

The premise is that biologically coherent modules will give better signal than individual genes randomly grouped together

Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology

NGF RESULTS

A) Identical performance to random forest and random network guided forest as assessed by 5 fold cross-validation repeated 100 times.

B) More known breast cancer genes show up in the forest

C) Similar genes selected for forests in two different training sets (different patient cohorts)

HUMAN GUIDED RANDOM FOREST (HGF)

Same algorithm again except each trees are constructed from a manually selected subset of genes (or other features).

1) Find a person2) Let them select what

they think is an optimal feature set

3) back to step one, N times

4) aggregate

The premise is that biological knowledge can produce better than random decision modules and that not all biological knowledge is captured in interaction networks

1) Find a person

2) Let them select what they think is an optimal feature set

3) back to step one, N times

• N may be large (e.g. 1,000)• Need many knowledgeable people to work hard... for free

HGF CHALLENGES

1. Motivation

2. Idea

3. Game

AGENDA

COMBO! COMPONENTS

Java server• hosts training

data• executes

decision tree algorithm

• performs cross-validation tests

• logs data generated by game

Game interface(s)• manages game

events: users, moves, etc.

• two implementations so far • server side

(JSP) card game• client side

(javascript) table game

server provides features for games (e.g. gene cards)

client sends groups of features (‘hands’) to the server for scoring.

COMBO CODE AND DEMO

Next steps

1. Better preprocessing of training data

• map contigs to genes where possible, filter out clearly useless genes

• identify individually predictive genes

• Game

• build domain-specific boards, let players pick their knowledge area

• Real two player feeling with robot partner

• Special cards: robber card, any-gene selector card

• High scores

• ??????????????

Education

Human Guided Forests (HGF)