32
Answering biological questions using large genomic data collections Curtis Huttenhower 10-05-09 rvard School of Public Health partment of Biostatistics

Answering biological questions using large genomic data collections

  • Upload
    channer

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Answering biological questions using large genomic data collections. Curtis Huttenhower 10-05-09. Harvard School of Public Health Department of Biostatistics. A Definition of Computational Functional Genomics. Prior knowledge. Genomic data. Gene ↓ Function. Gene ↓ Gene. Data ↓ - PowerPoint PPT Presentation

Citation preview

Page 1: Answering biological questions using large genomic data collections

Answering biological questions using large genomic data collections

Curtis Huttenhower

10-05-09Harvard School of Public HealthDepartment of Biostatistics

Page 2: Answering biological questions using large genomic data collections

2

A Definition ofComputational Functional Genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

Page 3: Answering biological questions using large genomic data collections

3

MEFIT: A Framework forFunctional Genomics

BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…

Related Gene Pairs

HighCorrelation

LowCorrelation

Freq

uenc

y

MEFIT

Page 4: Answering biological questions using large genomic data collections

4

MEFIT: A Framework forFunctional Genomics

BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…

BRCA2 SOX2 0.1RAD51 FOXP2 0.2ACTR1 H6PD 0.15…

Related Gene Pairs

Unrelated Gene PairsHigh

CorrelationLow

Correlation

Freq

uenc

y

MEFIT

Page 5: Answering biological questions using large genomic data collections

5

MEFIT: A Framework forFunctional Genomics

Golub 1999

Butte 2000

Whitfield 2002

Hansen 1998

Functional Relationship

Page 6: Answering biological questions using large genomic data collections

6

MEFIT: A Framework forFunctional Genomics

Golub 1999

Butte 2000

Whitfield 2002

Hansen 1998

Functional Relationship

Biological Context

Functional areaTissueDisease…

Page 7: Answering biological questions using large genomic data collections

7

Functional Interaction Networks

MEFIT

Global interaction network

Autophagy networkVacuolar transport

network Translation network

Currently have data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

Page 8: Answering biological questions using large genomic data collections

8

Predicting Gene Function

Cell cycle genes

Predicted relationships between genes

HighConfidence

LowConfidence

Page 9: Answering biological questions using large genomic data collections

9

Predicting Gene FunctionPredicted relationships

between genes

HighConfidence

LowConfidence

Cell cycle genes

Page 10: Answering biological questions using large genomic data collections

10

Cell cycle genes

Predicting Gene FunctionPredicted relationships

between genes

HighConfidence

LowConfidence

These edges provide a measure of how likely a gene is to

specifically participate in the process of

interest.

Page 11: Answering biological questions using large genomic data collections

11

Functional Associations Between Contexts

Predicted relationships between genes

HighConfidence

LowConfidence

The average strength of these relationships

indicates how cohesive a process is.

Cell cycle genes

Page 12: Answering biological questions using large genomic data collections

12

Functional Associations Between Contexts

Predicted relationships between genes

HighConfidence

LowConfidence

Cell cycle genes

Page 13: Answering biological questions using large genomic data collections

13

Functional Associations Between Contexts

DNA replication genes

The average strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Cell cycle genes

Page 14: Answering biological questions using large genomic data collections

14

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 15: Answering biological questions using large genomic data collections

15

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

Page 16: Answering biological questions using large genomic data collections

16

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox HomeostasisAldehyde

Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

CatabolismNegative Regulation

of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

AHP1DOT5GRX1GRX2…

APE3LAP4PAI3PEP4 …

Page 17: Answering biological questions using large genomic data collections

17

HEFalMp: Predicting human gene function

HEFalMp

Page 18: Answering biological questions using large genomic data collections

18

HEFalMp: Predicting humangenetic interactions

HEFalMp

Page 19: Answering biological questions using large genomic data collections

19

HEFalMp: Analyzing human genomic data

HEFalMp

Page 20: Answering biological questions using large genomic data collections

20

HEFalMp: Understanding human disease

HEFalMp

Page 21: Answering biological questions using large genomic data collections

21

Validating Human Predictions

Autophagy

Luciferase(Negative control)

ATG5(Positive control) LAMP2 RAB11A

NotStarved

Starved(Autophagic)

Predicted novel autophagy proteins

5½ of 7 predictions currently confirmed

With Erin Haley, Hilary Coller

Page 22: Answering biological questions using large genomic data collections

22

Comprehensive Validation of Computational Predictions

Genomic data

Computational Predictions of Gene FunctionMEFITSPELL

Hibbs et al 2007bioPIXIEMyers et al 2005

Genes predicted to function in mitochondrion organization

and biogenesis

Laboratory ExperimentsPetite

frequencyGrowthcurves

Confocal microscopy

New known functions for correctly predicted genes

Retraining

With David Hess, Amy Caudy

Prior knowledge

Page 23: Answering biological questions using large genomic data collections

23

Evaluating the Performance of Computational Predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

135Under-annotations

82Novel Confirmations,

First Iteration

17Novel Confirmations,

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Page 24: Answering biological questions using large genomic data collections

24

Evaluating the Performance of Computational Predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

95Under-annotations

40Confirmed

Under-annotations

80Novel Confirmations

First Iteration

17Novel Confirmations

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Computational predictions from large collections of genomic data can be

accurate despite incomplete or misleading gold standards, and they

continue to improve as additional data are incorporated.

Page 25: Answering biological questions using large genomic data collections

25

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

Page 26: Answering biological questions using large genomic data collections

26

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a researcher take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

Page 27: Answering biological questions using large genomic data collections

27

Thanks!

NIGMShttp://function.princeton.edu/hefalmp

Interested? I’m accepting students and postdocs!

Hilary CollerErin HaleyTsheko Mutungu

Olga TroyanskayaMatt HibbsChad MyersDavid HessEdo AiroldiFlorian Markowetz

Shuji OginoCharlie Fuchs

http://www.huttenhower.org

Page 28: Answering biological questions using large genomic data collections
Page 29: Answering biological questions using large genomic data collections

29

Next Steps:Microbial Communities

• Data integration is off to a great start in humans– Complex communities of distinct cell types– Very sparse prior knowledge

• Concentrated in a few specific areas– Variation across populations– Critical to understand mechanisms of disease

Page 30: Answering biological questions using large genomic data collections

30

Next Steps:Microbial Communities

• What about microbial communities?– Complex communities of distinct species/strains– Very sparse prior knowledge

• Concentrated in a few specific species/strains– Variation across populations– Critical to understand mechanisms of disease

Page 31: Answering biological questions using large genomic data collections

31

Next Steps:Microbial Communities

PKH1

PKH3

PKH2LPD1

CAR1W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2AGA

~120 available expression datasets

~70 species

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2AGA

Weskamp et al 2004

Flannick et al 2006

Kanehisa et al 2008

Tatusov et al 1997

• Data integration works just as well in microbes as it does in humans• We know an awful lot about some microorganisms and almost nothing about others• Purely sequence-based and purely network-based tools for function transfer both fall short• We need data integration to take advantage of both and mine out useful biology!

Page 32: Answering biological questions using large genomic data collections

32

Next Steps:Functional Metagenomics

• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!

• Another data integration problem– Must include datasets from multiple organisms

• Another context-specificity problem– Now “context” can also mean “species”

• What questions can we answer?– How do human microflora interact with diabetes,

obesity, oral health, antibiotics, aging, …– What’s shared within community X?

What’s different? What’s unique?– What’s perturbed in disease state Y?

One organism, or many? Host interactions?– Current methods annotate ~50% of synthetic data,

<5% of environmental data

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2AGA