29
SNPs: the HapMap and 1000 Genomes Projects Joseph Replogle Cavalcanti Lab Group 5/25/2012

SNPs Presentation Cavalcanti Lab

  • Upload
    jsrep91

  • View
    881

  • Download
    2

Embed Size (px)

Citation preview

Page 1: SNPs Presentation Cavalcanti Lab

SNPs: the HapMap and 1000 Genomes Projects

Joseph ReplogleCavalcanti Lab Group

5/25/2012

Page 2: SNPs Presentation Cavalcanti Lab

Understanding Human Genetic VariationWithin and Among Populations

Page 3: SNPs Presentation Cavalcanti Lab

Types of Human Genetic Variation

• Individual: de novo and rare variations• Population: variations which have become

fixed within a population– Single Nucleotide Polymorphisms (SNPs): base

pair substitutions• Transition: purine -> purine (A<->G), pyrimidine ->

pyrimidine (C<->T)• Transversion: purine <-> pyrimidine• common ~1-5% minor allele frequency (MAF) in major

populations

Page 4: SNPs Presentation Cavalcanti Lab

Types of Human Genetic Variation (cont.)

– Copy-Number Variations (CNVs): • insertions, deletions, duplications of DNA segments

(>1kb)

– Other Variations:• Structural: inversions• Repeats: microsatellites (STRs), minisatellites (VNTRs)• Frameshift mutations

Page 5: SNPs Presentation Cavalcanti Lab

SNP Distribution throughout the Genome

• Genetic variability throughout the genome reflects function (among other factors)

HLA!Sachidanandam et al. 2001

Page 6: SNPs Presentation Cavalcanti Lab

Factors Affecting SNP Distribution• Intrinsic, Structural:

Mutation clusters due to recombination events and sequence context-specific effects [3,4]– a) Time to Most Recent

Common Ancestor of genes in population influences SNPs (older genes -> more SNPs in population)

– b) base composition, local recombination, gene density, chromatin structure, nucleosome position, replication timing

Lercher and Hurst 2002

Page 7: SNPs Presentation Cavalcanti Lab

Factors Affecting SNP Distribution (cont.)

• Functional: mutation clusters due to natural selection (examples include immunoglobulin genes)

a) balancing selection increases diversityb) purifying and directional selection

decrease diversityc) transcriptional activity

• Ascertainment bias: better characterization of SNPs around genes of interest [5]

Page 8: SNPs Presentation Cavalcanti Lab

Effects of Genetic Variation

• Pathogenic and non-pathogenic heritable traits• Genetic variation reveals millions of years of

human history– “One can think of selective pressures as natural, in

vivo human experiments in which we can measure the response of human populations to unknown perturbations, and these alterations can inform the function of genes within a given locus.” Raj et al. 2012

– Understand the history of mutation, selection and recombination within the human genome

Page 9: SNPs Presentation Cavalcanti Lab

Potential Uses of SNP data

Ultimately, synergy of genomics and functional work will allow us to understand human traits and disease.

• Association Mapping: Genome Wide Association (GWA) studies, Pharmacogenomics

• Modeling Mendelian and Complex diseases• eQTL and functional genomics• Selection!

Page 10: SNPs Presentation Cavalcanti Lab

Selection: EHH and iHS

• Extended Haplotype Homozygosity (EHH)• Integrated Haplotype Score (iHS)

Chromosome 2

Voight et al. 2006

Page 11: SNPs Presentation Cavalcanti Lab

Selection of Lassa Fever Susceptibility Genes in YRI populations

Andersen et al (2012)

Page 12: SNPs Presentation Cavalcanti Lab

eQTL

Positive Selection

SLE susceptibility locus (rs11755393; GWAS p= 2.20 x 10 -08 )

Slide from Replogle and Raj

Page 13: SNPs Presentation Cavalcanti Lab

International HapMap Project

• “to identify and catalog genetic similarities and differences in human beings”

• Haplotype Map: SNPs (genotypes) at separate loci whose alleles are statistically associated due to limited genetic recombination

HapMap Project

Page 14: SNPs Presentation Cavalcanti Lab

Linkage Disequilibrium (LD)

• Alleles at different loci are not independent due to

AB Ab

aB ab

AB

Ab

aBab

Af

af

Bf bf

Af

af

Bf bf

Linkage equilibrium Linkage disequilibrium

Image by Gil McVean

Page 15: SNPs Presentation Cavalcanti Lab

Origin of LD

The mutation arises on a particular genetic background

If the mutation increases in frequency, the associated haplotype will also increase in frequency.

Factors Increasing LD:1) Genetic Drift

(stochastic sampling)2) Selection3) Non-Random Mating4) Population Structure

Over time the association between the new mutation and linked mutations will decay by recombination

Recombination is the only factor which decreases LD.

... ... ...

Image modified from Gil McVean

Page 16: SNPs Presentation Cavalcanti Lab

Haplotype

• ~107 common (MAF >1%) SNPs in the human genome• ‘tag SNPs’ allow for identification of an individual’s haplotypes• Estimated 300,000-600,000 tag SNPs in genome• Genotyping: testing tag SNPs• Sequencing: whole genome sequence

HapMap Project

Page 17: SNPs Presentation Cavalcanti Lab

HapMap Populations

• 270 total DNA samples• Yoruba in Ibadan, Nigeria (YRI)• Japanese in Tokyo, Japan (JPT)• Han Chinese in Beijing, China (CHB)• CEPH (Utah residents with ancestry from

northern and western Europe) (CEU)

Page 18: SNPs Presentation Cavalcanti Lab

HapMap Methodology

• Genotype individuals for several million SNPs– 1 SNP per 5kb or less– MAF >1% as estimated by TSC project, JSNP, dbSNP, and initial

SNP map– Random shotgun sequencing to obtain additional SNPs– Coding and noncoding SNPs

• Data analysis to identify LD and Haplotype maps• Tag SNPs are useful with haplotype and recombination

map• Data available online in multiple formats

http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en

Page 19: SNPs Presentation Cavalcanti Lab

HapMap Methodology (cont.)

• Data analysis to identify LD and Haplotype maps

• Tag SNPs are useful with haplotype and recombination map

• Data available online in multiple formats http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en

• Phase III data released 2009

Page 20: SNPs Presentation Cavalcanti Lab

Reference Genome?

• Mosaic haploid DNA sequence

• GRCh37

Page 21: SNPs Presentation Cavalcanti Lab

1000 Genomes

• “to find most genetic variants that have frequencies of at least 1% in the populations studied”

• Low coverage sequencing of >2000 individuals, exome sequencing, trios

• Characterization of SNPs and Structural Variants (INDELs)

Page 22: SNPs Presentation Cavalcanti Lab

1000 Genomes Populations

• Yoruba in Ibadan, Nigeria (YRI)• Japanese in Tokyo, Japan (JPT)• Han Chinese in Beijing, China (CHB)• CEPH (Utah residents with ancestry from northern and

western Europe) (CEU)• Luhya in Webuye, Kenya (LWK)• Toscani in Italy (TSI)• Peruvians in Lima, Peru (PER) • Mexican ancestry in Los Angeles, CA (MXL)• And many more!

Page 23: SNPs Presentation Cavalcanti Lab

“Low-Coverage” Sequencing

• Sequencing: 1) DNA copies broken into short pieces2) Each piece is sequenced (random pieces means most

of genome is covered)3) Sequenced fragments are aligned and joined to

determine complete genome• 28X sequencing coverage necessary for complete

genome• Low-coverage sequencing (4X coverage): many pieces

of individual genomes are missed

Page 24: SNPs Presentation Cavalcanti Lab

1000 Genomes Data

• Latest release: – 1092 samples – SNP, indel, and large deletion– Autosomes and chrX– ~38.2 M SNPs from low coverage and exome

sequencing• 1000genomes site has a link to a NCBI FTP

with their latest data

Page 25: SNPs Presentation Cavalcanti Lab

VCF file format

• Variant Call Format 4.1: meta-info followed by header and data

• tab-delimited text file• Compressed .gzzcat file.vcf.gz| grep -e ^# -e SNP | bgzip -c >

snps.vcf.gz• http://www.1000genomes.org/wiki/Analysis/

Variant%20Call%20Format/vcf-variant-call-format-version-41

Page 26: SNPs Presentation Cavalcanti Lab

Columns in VCF format• CHROM: chromosome (no colons)• POS: numerical reference position, with the 1st base having position 1

(some variants have multiple pos records)• ID: semi-colon separated list of unique identifiers where available (ex.

dbSNP rs number)• EF: reference base(s) A,C,G,T,N (case insensitive) for a given variant• ALT: comma separated list of alternate non-reference alleles called on at

least one of the samples.• QUAL: phred-scaled quality score for the assertion made in ALT. i.e. -

10log_10 prob(call in ALT is wrong)• FILTER: another quality measure; PASS if this position has passed all filters• INFO: semicolon seperated additional info; ex. AF (allele frequency), DB

(dbSNP membership), VALIDATED

Page 27: SNPs Presentation Cavalcanti Lab

Durbin et al. 2004

Page 28: SNPs Presentation Cavalcanti Lab

Interested?

• Get Prof. Cavalcanti to buy Human Evolutionary Genetics: Origins, Peoples and Disease

Page 29: SNPs Presentation Cavalcanti Lab

References1. Sachidanandam R et al. (2001) A map of human genome sequence variation containing 1.42 million single

nucleotide polymorphisms. Nature 409: 928-933.2. Lercher MJ and Hurst LD (2002) Human SNP variability and mutation rate are higher in regions of high

recombination Trends Genet. 18: 337-340.3. Rogozin IB and Pavlov YI (2003) Theoretical analysis of mutational hotspots and their DNA sequence context

specificity. Mutat Res 544(1): 65-85.4. Ma X, et al. (2012) Mutation Hot Spots in Yeast Caused by Long-Range Clustering of Homopolymeric

Sequences.Cell Reports 1(1): 36-42.5. Clark AG, et al. (2005) Ascertainment bias in studies of human genome-wide polymorphism. Genome Res

15: 1496-1502. 6. Raj T et al. (2012) Alzheimer Disease Susceptibility Loci: Evidence for a Protein Network under Natural Selection.

AJHG 90 720-726. 7. Voight BF et al. (2006) A Map of Recent Positive Selection in the Human Genome. PLoS Biology 4(3): e72.8. Andersen KG et al. (2012) Genome-wide scans provide evidence for positive selection of genes implicated in Lassa

fever. Philos Trans R Soc Lond B Biol Sci 367(1590): 868-877.9. Hapmap.org10. McVean, Gil (2004). Population Genetics of the Human Genome. Oxford Human Genome Lecture Series.11. Gibbs RA et al. (2003) The International HapMap Project. Nature 426: 789-796.12. 1000genomes.org13. Durbin R M et al. (2010). A map of human genome variation from population-scale sequencing. Nature 467(7319):

1061-1073.