44
An integrated map of genetic variation from 1,092 human genomes The 1000 Genomes Project Consortium http://www.1000genomes.org Nature 491, 56–65 (01 November 2012)

An integrated map of genetic variation from 1,092

Embed Size (px)

DESCRIPTION

1000 Genomes project, Phase 1 results.

Citation preview

Page 1: An integrated map of genetic variation from 1,092

An integrated map of genetic variation from 1,092 human

genomes

The 1000 Genomes Project Consortiumhttp://www.1000genomes.org

Nature 491, 56–65 (01 November 2012)

Page 2: An integrated map of genetic variation from 1,092

Primary goal • to create a complete and detailed

catalogue of human genetic variations, which in turn can be used for association studies relating genetic variation to disease.

Page 3: An integrated map of genetic variation from 1,092

Primary goal • to discover >95 % of the variants (e.g.

SNPs, CNVs, indels) with minor allele frequencies as low as 1% across the genome and 0.1-0.5% in gene regions

• to estimate the population frequencies, haplotype backgrounds and linkage disequilibrium patterns of variant alleles

Page 4: An integrated map of genetic variation from 1,092

Secondary goals• support of better SNP and probe selection for

genotyping platforms in future studies• improvement of the human reference sequence.• the completed database will be a useful tool for

studying regions under selection, variation in multiple populations and understanding the underlying processes of mutation and recombination.

Page 5: An integrated map of genetic variation from 1,092

Project design• to sequence each sample to about 4X coverage; at

this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%.

• Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing.

Page 6: An integrated map of genetic variation from 1,092

Project design / Stages• The 1000 genomes full project has been

divided into phases to represent the dispersed nature of the sample collection.

Page 7: An integrated map of genetic variation from 1,092

Project design / Stages / PilotThree pilot studies provided data to inform the design of the full-scale project:• Pilot 1: low coverage pilot (2-4X, WGS of 180 samples)• Pilot 2: high coverage pilot (20-60X, WGS of 2 mother-

father-adult child trios)• Pilot 3: the exon targeted pilot (50X, 1000 gene

regions in 900 samples)The pilot was completed in 2009.

Page 8: An integrated map of genetic variation from 1,092

Project design / Stages / Phase 1 Phase 1 represents low coverage and exome data analysis available for the first 1092 samples.

Page 9: An integrated map of genetic variation from 1,092

Project design / Stages / Phase 1 Phase 1 represents low coverage and exome data analysis available for the first 1092 samples.DONE!Results published in Nature 491, 56–65 (01 November 2012)

Page 10: An integrated map of genetic variation from 1,092

Но это ещё не всё!

Page 11: An integrated map of genetic variation from 1,092

Project design / Stages / Phase 2• Phase 2 represents an expanded set of samples,

around 1700 in number (the sequence data has been finalized).

• This data is being used for method development to both improve on existing methods from phase 1 and also develop new methods to handle features like multi allelic variant sites and true integration of complex variation and structural variants.

Page 12: An integrated map of genetic variation from 1,092

Project design / Stages / Phase 3• Phase 3 represents 2500 samples including

new African samples and samples from South Asia. The new methods developed in phase 2 will be applied to this data set an a final catalogue of variation will be released.

Page 13: An integrated map of genetic variation from 1,092

Amounts of Data• Full genomic sequence of 1,700 individuals is

now available (200TB of genomic data).

Page 14: An integrated map of genetic variation from 1,092

Amounts of Data• > 2 human genomes every 24 hours• 60-fold more sequence data than what has

been published in DNA databases over the past 25 years.

Page 15: An integrated map of genetic variation from 1,092

Samples• 14 populations• 4 Ancestry-based groups

Page 16: An integrated map of genetic variation from 1,092
Page 17: An integrated map of genetic variation from 1,092

Samples / Ancestry-based groups• Europe (IBS (Iberian Populations in Spain), GBR (British from

England and Scotland ), CEU (Utah residents with ancestry from northern and western Europe), FIN (Finnish in Finland), TSI (Toscani in Italia));

• East Asia (JPT (Japanese in Tokyo, Japan), CHB (Han Chinese in Beijing, China), CHS (Han Chinese South));

• Africa (ASW (African Ancestry in SW USA), YRI (Yoruba in Ibadan, Nigeria), LWK (Luhya in Webuye, Kenya));

• Americas (MXL (Mexican Ancestry in Los Angeles, CA, USA), PUR (Puerto Ricans in Puerto Rico), CLM (Colombians in Medellin, Colombia)).

Page 18: An integrated map of genetic variation from 1,092

Data• combination of low-coverage (2–6x) whole-

genome sequence data, targeted deep (50–100x) exome sequence data and dense SNP genotype data.

• the approach was augmented with statistical methods for selecting higher quality variant calls from candidates obtained using multiple algorithms, and to integrate SNP, indel and larger structural variants within a single framework

Page 19: An integrated map of genetic variation from 1,092
Page 20: An integrated map of genetic variation from 1,092
Page 21: An integrated map of genetic variation from 1,092

• A key goal of the 1000 Genomes Project was to identify more than 95% of SNPs at 1% frequency in a broad set of populations.

• Our current resource includes ~50%, 98% and 99.7% of the SNPs with frequencies of ~0.1%, 1.0% and 5.0%, respectively, in ~2,500 UK sampled genomes.

Page 22: An integrated map of genetic variation from 1,092
Page 23: An integrated map of genetic variation from 1,092

Genetic variation• 3.60 million single nucleotide polymorphisms (SNPs),

of which 24,000 were in GENCODE (coding) regions• 344,000 small indels (440 coding) which gives a ratio of

1:10 with SNPs in human genomes, and demonstrates the strong selection against indels in coding regions.

• 717 large deletions (the most confident category of SVs that we currently can detect), of which 39 overlapped GENCODE regions.

Page 24: An integrated map of genetic variation from 1,092

• Most common variants (94% of variants with frequency>=5%) were known before the current phase of the project and had their haplotype structure mapped through earlier projects.

• Only 62% of variants in the range 0.5–5% and• 13% of variants with frequencies of <0.5% had

been described previously.

Page 25: An integrated map of genetic variation from 1,092
Page 26: An integrated map of genetic variation from 1,092
Page 27: An integrated map of genetic variation from 1,092

• Variants present at 10% and above across the entire sample are almost all found in all of the populations studied.

• By contrast, 17% of low-frequency variants in the range 0.5–5% were observed in a single ancestry group, and 53% of rare variants at 0.5% were observed in a single population.

Page 28: An integrated map of genetic variation from 1,092
Page 29: An integrated map of genetic variation from 1,092

• The derived allele frequency distribution shows substantial divergence between populations below a frequency of 40%, such that individuals from populations with substantial African ancestry carry up to three times as many low-frequency variants (0.5–5%) as those of European or East Asian origin, reflecting ancestral bottlenecks in non-African populations.

Page 30: An integrated map of genetic variation from 1,092

• However, individuals from all populations show an enrichment of rare variants (<0.5% frequency), reflecting recent explosive increases in population size and the effects of geographic differentiation.

Page 31: An integrated map of genetic variation from 1,092
Page 32: An integrated map of genetic variation from 1,092

• Variants present twice across the entire sample (referred to as f2 variants), typically the most recent of informative mutations, are found within the same population in 53% of cases

• However, between-population sharing identifies recent historical connections.

Page 33: An integrated map of genetic variation from 1,092
Page 34: An integrated map of genetic variation from 1,092

• At the most highly conserved coding sites, 85% of non-synonymous variants and more than 90% of stop-gain and splice-disrupting variants are below 0.5% in frequency, compared with 65% of synonymous variants.

Page 35: An integrated map of genetic variation from 1,092

• Individuals typically carry more than 2500 nonsynonymous variants at conserved positions, of which 20-40 are likely to be damaging (2-5 of which are rare), 150 loss-of-function variants (splice site variants, stop gains, frameshift indels) of which 10-20 are rare

• 130–400 non-synonymous variants per individual, 10–20 LOF variants, 2–5 damaging mutations, and 1–2 variants identified previously from cancer genome sequencing

Page 36: An integrated map of genetic variation from 1,092
Page 37: An integrated map of genetic variation from 1,092

Bonus Track

Page 38: An integrated map of genetic variation from 1,092

• The non-synonymous to synonymous ratio among rare (<0.5%) variants is typically in the range 1–2, and among common variants in the range 0.5–1.5, suggesting that 25–50% of rare non-synonymous variants are deleterious.

• However, the segregating rare load among gene groups in KEGG pathways varies substantially.

Page 39: An integrated map of genetic variation from 1,092
Page 40: An integrated map of genetic variation from 1,092
Page 41: An integrated map of genetic variation from 1,092
Page 42: An integrated map of genetic variation from 1,092
Page 43: An integrated map of genetic variation from 1,092
Page 44: An integrated map of genetic variation from 1,092