Upload
genomeinabottle
View
401
Download
2
Tags:
Embed Size (px)
Citation preview
Sample Characterization
Michael A. Eberle
GiaB, January 2014
2
Pedigree including NA12878
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893
! All 17 members sequenced to at least 50x depth (PCR-Free protocol)
! Variants are called across the pedigree using different software & technology
! Inheritance information provides high confident, direct validation of variant calls
NA12878
3
Why sequence a pedigree?
A C A G T A
A C A G T A
A C A G T A
A C A T T A
A C A G T A
A T C T G A
A T C T G A
A T C T G A
G T C G T C
G T C G T C
G T C G T C
G C A T T A
G C A T T A
G C A T T A
G C A T T A
G C A T T A
With a sufficiently large pedigree the transmission of the parental chromosomes can unambiguously be determined
A T C T G A
G T C G T C
Error: T in blue haplotype should be G
4
Why sequence a pedigree?
A C A G T A
A C A T T A
A T C T G A
G T C G T C
G C A T T A
G C A T T A
If only the trio were sequenced this error would not be detected When sequencing a trio we can never eliminate alternative genotypes in some of the samples
A C A G T A
A C A G T A
A C A G T A
A T C T G A
A T C T G A
G T C G T C
G T C G T C
G C A T T A
G C A T T A
G C A T T A
A T C T G A
G T C G T C
Could also be GG or GT
Either parent could also be TT
5 # Siblings
Perc
ent
# Siblings
Perc
ent
# Siblings
Perc
ent
0
50
100
1 2 3 4 5 6 7 8 9 10 11
# Siblings
Perc
ent
0
50
100
1 2 3 4 5 6 7 8 9 10 11
A large pedigree identifies most errors
Can identify a single error in >99.7% of the variant positions (11 sibs)
2 sibs allows phasing & identifies errors in 25% of variant positions
More sibs adds confidence to more variant calls
Trio never positively identifies the genotypes in every sample
“Perfectly constrained” means could remove the genotype information of any sample and impute it based on the phasing and other sample genotypes
% S
ites
Per
fect
ly C
onst
rain
ed
6 # Siblings
Perc
ent
# Siblings
Perc
ent
# Siblings
Perc
ent
0
50
100
1 2 3 4 5 6 7 8 9 10 11
# Siblings
Perc
ent
0
50
100
1 2 3 4 5 6 7 8 9 10 11
Cost to add more siblings %
Site
s P
erfe
ctly
Con
stra
ined
1 Trio of Sequencing
2 Trios of Sequencing / 4 sibs
7
Understanding conflicts in the pedigree
8
0
100
200
300 Errors per 50kb
0
100
200
300
0
1
2
3
4Normalized Depth
Somatic/cell-line deletions on chr22 #
Err
ors
Errors in NA12878 & NA12893
9
0
100
200
300 Errors per 50kb
0
100
200
300
0
1
2
3
4Normalized Depth
Somatic/cell-line deletions on chr22
1Mb
None of the other children carry this deletion (though noise may indicate mosaic)
# E
rror
s
Errors in NA12878 & NA12893
10
0 50 100 150 2000.00
0.05
0.10
Allele Counts
Frac
tion
Read counts for the haplotypes inferred in NA12878 at location of cell line deletion (200x depth)
• Inferred the two haplotypes in NA12878 based on the other samples
• Counts represent the predicted heterozygous locations
Paternal haplotype (NA12891)
Maternal haplotype (NA12892)
11
NA12
8820
1000
2000
3000
4000
Tota
l Err
ors
Technical replicates validate de novo SNVs
82 (~4%) did not replicate
Tota
l Con
flict
s
1843 (~96%) replicate original call FPs?
Results in Tech. Rep.
12
Thoughts on selecting the next samples for sequencing
! Identify and sequence pedigrees with multiple siblings – WGS every individual in the pedigree to identify haplotype transmission vectors – One “high quality” family (2 parents & 4 sibs) provides a “better” reference than two
lower quality trios for the same amount of sequencing – Technical replicates allow alternative validation of biologically interesting calls – e.g.
de novo mutations, gene conversion etc.
! Choose one or two samples to target for long reads if sequencing-limited – Sequencing both parent will provide 100% of the variants in the pedigree though with
four children only ~75% will be validated in the children – Sequencing a child will guarantee that every variant has been sequenced in at least
one of the parents though will only contain ~50% of the variants in the family
! Quality of the DNA is important – CEPH pedigree shows many cell line artifacts that are correctly genotyped but deviate
from inheritance – Cell line artifacts complicate the analysis