12
Sample Characterization Michael A. Eberle GiaB, January 2014

140127 platinum genomes pedigree analyses

Embed Size (px)

Citation preview

Page 1: 140127 platinum genomes pedigree analyses

Sample Characterization

Michael A. Eberle

GiaB, January 2014

Page 2: 140127 platinum genomes pedigree analyses

2

Pedigree including NA12878

12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893

! All 17 members sequenced to at least 50x depth (PCR-Free protocol)

! Variants are called across the pedigree using different software & technology

! Inheritance information provides high confident, direct validation of variant calls

NA12878

Page 3: 140127 platinum genomes pedigree analyses

3

Why sequence a pedigree?

A C A G T A

A C A G T A

A C A G T A

A C A T T A

A C A G T A

A T C T G A

A T C T G A

A T C T G A

G T C G T C

G T C G T C

G T C G T C

G C A T T A

G C A T T A

G C A T T A

G C A T T A

G C A T T A

With a sufficiently large pedigree the transmission of the parental chromosomes can unambiguously be determined

A T C T G A

G T C G T C

Error: T in blue haplotype should be G

Page 4: 140127 platinum genomes pedigree analyses

4

Why sequence a pedigree?

A C A G T A

A C A T T A

A T C T G A

G T C G T C

G C A T T A

G C A T T A

If only the trio were sequenced this error would not be detected When sequencing a trio we can never eliminate alternative genotypes in some of the samples

A C A G T A

A C A G T A

A C A G T A

A T C T G A

A T C T G A

G T C G T C

G T C G T C

G C A T T A

G C A T T A

G C A T T A

A T C T G A

G T C G T C

Could also be GG or GT

Either parent could also be TT

Page 5: 140127 platinum genomes pedigree analyses

5 # Siblings

Perc

ent

# Siblings

Perc

ent

# Siblings

Perc

ent

0

50

100

1 2 3 4 5 6 7 8 9 10 11

# Siblings

Perc

ent

0

50

100

1 2 3 4 5 6 7 8 9 10 11

A large pedigree identifies most errors

Can identify a single error in >99.7% of the variant positions (11 sibs)

2 sibs allows phasing & identifies errors in 25% of variant positions

More sibs adds confidence to more variant calls

Trio never positively identifies the genotypes in every sample

“Perfectly constrained” means could remove the genotype information of any sample and impute it based on the phasing and other sample genotypes

% S

ites

Per

fect

ly C

onst

rain

ed

Page 6: 140127 platinum genomes pedigree analyses

6 # Siblings

Perc

ent

# Siblings

Perc

ent

# Siblings

Perc

ent

0

50

100

1 2 3 4 5 6 7 8 9 10 11

# Siblings

Perc

ent

0

50

100

1 2 3 4 5 6 7 8 9 10 11

Cost to add more siblings %

Site

s P

erfe

ctly

Con

stra

ined

1 Trio of Sequencing

2 Trios of Sequencing / 4 sibs

Page 7: 140127 platinum genomes pedigree analyses

7

Understanding conflicts in the pedigree

Page 8: 140127 platinum genomes pedigree analyses

8

0

100

200

300 Errors per 50kb

0

100

200

300

0

1

2

3

4Normalized Depth

Somatic/cell-line deletions on chr22 #

Err

ors

Errors in NA12878 & NA12893

Page 9: 140127 platinum genomes pedigree analyses

9

0

100

200

300 Errors per 50kb

0

100

200

300

0

1

2

3

4Normalized Depth

Somatic/cell-line deletions on chr22

1Mb

None of the other children carry this deletion (though noise may indicate mosaic)

# E

rror

s

Errors in NA12878 & NA12893

Page 10: 140127 platinum genomes pedigree analyses

10

0 50 100 150 2000.00

0.05

0.10

Allele Counts

Frac

tion

Read counts for the haplotypes inferred in NA12878 at location of cell line deletion (200x depth)

•  Inferred the two haplotypes in NA12878 based on the other samples

•  Counts represent the predicted heterozygous locations

Paternal haplotype (NA12891)

Maternal haplotype (NA12892)

Page 11: 140127 platinum genomes pedigree analyses

11

NA12

8820

1000

2000

3000

4000

Tota

l Err

ors

Technical replicates validate de novo SNVs

82 (~4%) did not replicate

Tota

l Con

flict

s

1843 (~96%) replicate original call FPs?

Results in Tech. Rep.

Page 12: 140127 platinum genomes pedigree analyses

12

Thoughts on selecting the next samples for sequencing

! Identify and sequence pedigrees with multiple siblings –  WGS every individual in the pedigree to identify haplotype transmission vectors –  One “high quality” family (2 parents & 4 sibs) provides a “better” reference than two

lower quality trios for the same amount of sequencing –  Technical replicates allow alternative validation of biologically interesting calls – e.g.

de novo mutations, gene conversion etc.

! Choose one or two samples to target for long reads if sequencing-limited –  Sequencing both parent will provide 100% of the variants in the pedigree though with

four children only ~75% will be validated in the children –  Sequencing a child will guarantee that every variant has been sequenced in at least

one of the parents though will only contain ~50% of the variants in the family

! Quality of the DNA is important –  CEPH pedigree shows many cell line artifacts that are correctly genotyped but deviate

from inheritance –  Cell line artifacts complicate the analysis