71
Applications of network theory to human population genetics: from pathways to genotype networks Giovanni Marco Dall'Olio Pompeu Fabra University, Barcelona Advisors: Jaume Bertranpetit and Hafid Laayouni

Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

Embed Size (px)

DESCRIPTION

This is the presentation of my PhD thesis defence. It describes two applications of network theory to improve the methods to understand genetic adaptation in the human genome.

Citation preview

Page 1: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

Applications of network theory to human population genetics:

from pathways to genotype networks

Giovanni Marco Dall'Olio

Pompeu Fabra University, Barcelona

Advisors: Jaume Bertranpetit and Hafid Laayouni

Page 2: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

2

Acknowledgments

● I would like to thank:– My PhD supervisors, Jaume Bertranpetit and Hafid

Laayouni

– My committee: Dr. Mauro Santos, Dr. Ricard Solé, Prof. Guido Barbujani, Dr. Ferran Casals, Dra. Yolanda Espinosa

– The Evolutionary Systems Biology group at UPF

– The Institut of Biologia Evolutiva

Page 3: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

3

Topics

● Context and motivations● My research:

– Annotating the N-Glycosylation pathway

– Pathway approach on the N-Glycosylation pathway

– The Genotype Network Approach

– The Human Selection Browser and Biostar

● Conclusions

Page 4: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

4

Context of the thesis

● The first anatomically modern humans appeared about 200,000 years ago

● How can we understand the signals of genetic adaptation in our genome, since then?

Page 5: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

5

Factors that influenced recent human evolution

Agriculture

DiseasesNew climates

Page 6: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

6

The opportunity

● We have access to large datasets of human sequences

● Better annotations on gene function and role

Page 7: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

7

Contributions

● Find applications of network theory to understand genetic adaptation in the human species

Page 8: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

8

Applications of network theory

● The Pathway approach

● The Genotype Network approach

Page 9: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

9

Topics

● Context and motivations● My research:

– Annotating the N-Glycosylation pathway

– Pathway approach on the N-Glycosylation pathway

– The Genotype Network Approach

– The Human Selection Browser and Biostar

● Conclusions

Page 10: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

10

The Pathway approach

● Genes are organized in pathways

● Any eventual selection constraint will be distributed among all the genes of a pathway

Page 11: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

11

Distribution of Selection forcesin a pathway

● Some positions of the pathway will be more likely to have stronger signals of selection

Page 12: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

12

Pathway Approach - outline

● Build a Network representation of a pathway

● Execute a test for positive selection on each gene

● Determine how the signals of selection are distributed on the network

Page 13: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

13

Pathway approach on the N-Glycosylation pathway

● Asparagine N-Glycosylation is a metabolic pathway for a type of protein modification

● The structure of this pathway is easy to represent as a network

Page 14: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

14

N-glycosylation - upstream part● Produces a single sugar called “N-Glycan precursor”● This sugar is required for the proper folding of most

membrane proteins

Adapted from Stanley, P., Schachter, H., & Taniguchi, N. (2009). N-Glycans. Essentials of Glycobiology.

Page 15: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

15

N-Glycosylation and protein folding

● The product of the upstream part of N-glycosylation is used as a signal to distinguish folded and unfolded proteins

Folded protein Un-Folded protein

Page 16: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

16

● Complex pathway composed by thousands of reactions

● Produces multiple glycans, important for cell-to-cell interactions

N-glycosylation - downstream part

Hossler, P., Mulukutla, B. C., & Hu, W.-S. (2007). Systems analysis of N-glycan processing in mammalian cells. PloS one, 2(1), e713. doi:10.1371/journal.pone.0000713

Page 17: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

17

Glycans on the cell surface

● The surface of a cell is similar to a forest of glycosylated proteins

● Each organism and cell has a specific repertoire of glycans

A. Doeer, Glycoproteomics. Nature Methods, 2011. doi:10.1038/nmeth.1821

Page 18: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

18

Annotating theN-Glycosylation pathway

● In order to build a correct network model for the N-Glycosylation pathway, we annotated it first in the Reactome database

Page 19: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

19

The N-Glycosylation pathwayin Reactome

Page 20: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

20

The KEGG entry for N-Glycosylation is incomplete

Downstream N-Glycosylationin KEGG

Real representationof downstream N-Glycosylation

Page 21: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

21

Another error for N-Glycosylationin KEGG

Page 22: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

22

Erroneous annotation in String

● There are two genes with the symbol ALG2:

– ALG2 (Asparagine Linked Glycosylation 2)

– ALG-2 (Apoptosis Linked Gene – 2)

● In String, these two were confused

Page 23: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

23

Ambigous interpretation of the term N-Glycosylation in GO

N-Glycosylated protein

Merged

N-Glycosylated pathway

Page 24: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

24

Annotating theN-Glycosylation pathway

● Annotated ~100 reactions in Reactome

● Fixed ~50 Gene Ontology terms

● Fixed key errors in String and KEGG

Page 25: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

25

Network structure of N-Glycosylation pathway

Page 26: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

26

Dataset used● The CEPH-HGDP 650,000 Illumina chip dataset● 940 individuals, from 50 human populations

Page 27: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

27

Methods used

● The FST index → measure of population

differentiation● The iHS test → identification of signals of

recent positive selection

Page 28: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

28

FST

– Population differentiation

● FST is a measure of

population differentiation

● If the FST between two

population is 1, it means that the two populations are fixed for different alleles

Page 29: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

29

Signatures of population differentiation in the N-Glycosylation pathway

FST signals are concentrated

in the downstream part, and in the substrates biosynthesis

Page 30: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

30

Population Differentiationand network position

● Node degree correlates with the distribution of F

ST signals

● Genes with high FST are

generally more connected

Page 31: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

31

IHS and Long range haplotypes

● A selective sweep may cause the appearance of long homozygous haplotypes at a high frequency

● Example: a long homozygous haplotype present in the LCT gene in North-European populations

Vitti et al, Trends in genetics, 2012

Page 32: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

32

iHS: Compares the Extended Haplotype Homozygosity decay (EHH decay) between ancestral and derived allele

Voight et al., PLoS Genetics 2006

IHS and Long range haplotypes:

Page 33: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

33

Signatures of selection in the N-Glycosylation pathway

No difference in the distribution of iHS signals between upstream

and downstream

Page 34: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

34

Signatures of selection in the N-Glycosylation pathway

GCS1: redirects to protein folding quality control

MAN2A1: redirects to Complex GlycansMGAT3:

redirects to Hybrid Glycans

Page 35: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

35

● There is a difference in the patterns of population differentiation between the two parts of the N-Glycosylation pathway

● Signals of positive selection are more likely on key genes

● One of the few works applying the pathway approach on human genetics

Pathway approach on N-Glycosylation

Page 36: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

36

Topics

● Context and motivations● My research:

– Annotating the N-Glycosylation pathway

– Pathway approach on the N-Glycosylation pathway

– The Genotype Network Approach

– The Human Selection Browser and Biostar

● Conclusions

Page 37: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

37

The Genotype Network approach

● Genotype Networks have been used to study the “innovability” and evolvability of a genetic system

Page 38: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

38

The Genotype Network approach

● Genotype Networks have been used to study the “innovability” and evolvability of a genetic system

● Never applied to population genetics data, because they require too much data!

Page 39: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

39

Genotype Networks - theory

● John Maynard-Smith: the concept of a Protein Space, which is explored by populations

Page 40: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

40

Genotype Networks - theory

● John Maynard-Smith: the concept of a Protein Space, which is explored by populations

“if evolution by natural selection is to occur, functional proteins [or DNA sequences] must form a continuous network which can be traversed by unit mutational steps without passing through non- functional intermediates”

Page 41: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

41

Neutralism and Selectionism

● Neutralism: most mutations are neutral or deleterious

● Selectionism: positive mutations drive evolution

Page 42: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

42

Genotype Networks help recoincile Neutralism and Selectionism

● Cycles of Neutral evolution, alterned by cycles of Selection

● Even neutral or negative mutations can beneficial on the long run, because they allow to explore the genotype space

Page 43: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

43

The Genotype Network - definitions

● The Genotype Space of a region of 5 SNPs can be represented as a network

● Each node is a possible genotype, and edge connect nodes with only one difference

Page 44: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

44

The Genotype Network - definitions

● Green nodes are sequences observed in a population

● This is the Genotype Network of a population

Page 45: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

45

Average Path Length of a Genotype Network

● This figure represents two populations

● The yellow one has an higher Average Path Length than the blue one

Page 46: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

46

Average Degree

● This population has an high Average Degree

● It is more robust to mutations

● This population has a low Average Degree

● Mutations are more likely to fall outside the Genotype Network

Page 47: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

47

Dataset analyzed● 1000genomes data, phase 1● 850 individuals genotyped, grouped into three

continental groups (AFR, EUR and ASN)

Page 48: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

48

The VCF2Space library

● Suite of Python scripts to calculate Genotype Networks from a VCF file

● ~400,000 lines of code

● ~350 unit tests

Page 49: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

49

Splitting the genome into windows of 11 SNPs

● Less than 11 SNPs -> networks are too small and condensed

● More than 11 SNPs -> networks are too large and sparse

Small network Large network

Page 50: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

50

Why windows of 11 SNPs?

Page 51: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

51

Genotype Network properties of the human genome

http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hubUrl=http://bioevo.upf.edu/~gdallolio/genotype_space/hub.txt

Page 52: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

52

Coding & Non-Coding regions● Coding regions have higher average path

length and degree than non coding regions

Page 53: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

53

Genotype Networks and Selection (simulated data)

Selection

Neutral

Page 54: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

54

● Coding networks: high average path lenght and degree

● Non coding networks: low average path lenght and degree

● Recent selection: lower average path lenght and degree

Page 55: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

55

Genotype Network:currently under review..

Page 56: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

56

Topics

● Context and motivations● My research:

– Annotating the N-Glycosylation pathway

– Pathway approach on the N-Glycosylation pathway

– The Genotype Network Approach

– The Human Selection Browser and Biostar

● Conclusions

Page 57: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

57

Other works: The Human Selection Browser

● We applied 21 tests for positive selection to the 1,000 Genomes dataset

– FST, CLR, iHS, etc...

● This dataset will be published and made freely available as a genome browser

Page 58: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

58

Other works: Biostar● An online forum for bioinformatics

● About 150,000 visits per month

● Helped thousands of bioinformaticians!

Page 59: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

59

Topics

● Context and motivations● My research:

– Annotating the N-Glycosylation pathway

– Pathway approach on the N-Glycosylation pathway

– The Genotype Network Approach

– The Human Selection Browser and Biostar

● Conclusions

Page 60: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

60

Conclusions (I)

● We developed two applications of network theory to the study of human population genetics.

● We produced a network model of the N-Glycosylation pathway, contributing it to the Reactome database and improving the annotations in other databases.

● We showed that the downstream part of the N-Glycosylation pathway shows more signatures of genetic differentiation than the upstream part. This is compatible with the role and structure of this part of the pathway.

● We showed that key genes of the N-Glycosylation pathway, such as GCS1, MGAT3 and MAN2A1, show signatures of recent positive selection in human populations.

Page 61: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

61

Conclusions (II)

● We produced a suite of Python scripts, called VCF2Space, to apply the concept of Genotype Networks to Single Nucleotide Polimorphism data

● Our genome-wide application of Genotype Networks showed that coding regions tend to have networks with higher average degree and path length than non-coding regions

● We contributed positively to the bioinformatics community, providing resources such as the 1000 Genomes Selection Browser and Biostar

Page 62: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

63

Page 63: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

64

Figures credits● Slide 5:

humans: http://blogs.ancestry.com/ancestry/ star trek: http://en.wikipedia.org/wiki/Star_Trek:_The_Original_Series

● Slide 6:Malaria: http://science.psu.edu/news-and-events/2012-news/Read7-2012Climates: http://www.ancienteco.com/2012/03/climate-change-drives-human-evolution.htmlAgriculture: http://en.wikipedia.org/wiki/History_of_agriculture

● Slide 7:

– 1000 Genomes, CEPH-HGDP panel, UK10K, Hapmap websites

● Slide 14:

– Cover of Science, 23 March 2001

● Slide 15:

– Adapted from Stanley, P., Schachter, H., & Taniguchi, N. (2009). N-Glycans. Essentials of Glycobiology.

● Slide 17:

– Glycosylation, downstream: Hossler, P., Mulukutla, B. C., & Hu, W.-S. (2007). Systems analysis of N-glycan processing in mammalian cells. PloS one, 2(1), e713. doi:10.1371/journal.pone.0000713

Page 64: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

65

Figures credits● Slide 27:

http://www.cephb.fr/en/hgdp/diversity.php/

● Slide 29:http://www.rationalskepticism.org

● Slide 32Adapted from Vitti et al, 2012

● Slide 42:

– wikipedia

Page 65: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

66

The Pathway approach

Stronger Selection on Genes with high connectivity or upstream of a pathway

Page 66: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

67

N-glycosylation – how does it work

● All the N-glycans are generated from a single sugar with a very conserved structure, called N-glycan precursor

N-glycan precursor

Signal for folded proteins

Millions ofdifferent

glycans

Page 67: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

68

The FST test

Almost all the highest signals of F

ST are in

genes of the downstream part

Page 68: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

69

The iHS test

GCS1 in EUR

MAN2A1 in SSAFR and EASIA

MGAT3 in EASIA

Page 69: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

70

Combining p-values

● Fisher's combination test

● ZF follows a χ2(2K)

distribution● SNPs from the same

gene may violate the assumption of independency, but still the method is robust to errorsFrom Peng et al, Eur J Hum Genet. 2010

Page 70: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

71

Comparing upstream and downstream N-Glycosylation

● χ2 test comparing the number of events observed in the each part of the pathway, against what is the number expected if there were no pathway structure

Page 71: Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to human population genetics: from pathways to genotype networks

72

How to convert genotypes to networks

● Two haplotypes per individual● Reference allele → 0; Alternative allele → 1

Individual 1 AC AC AA GG TT TG CA TG

haplotype a 0 0 0 0 0 0 0 0

haplotype b 1 1 0 0 0 1 1 1

Ancestral alleles: A A A G T T C T