Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
Why profile STRs?
Melissa Gymrek4/21/12 lobSTR
Huntington disease
Fragile X syndrome
Synpolydactyly
Medical Genetics
Intro. Algorithmic details Comparison Validation
(CAG)n
huntingtin
(GCA/GCG)n
HOXD13
(CGG)n
FMR15’
Why profile STRs?Forensics
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Why profile STRs?
Despite multitude of applications, STRs are not routinely profiled in whole genome sequencing
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Cell lineages
Challenges in profiling STRs from short reads
1. Only reads entirely spanning an STR are informative
Non-informative reads:
Informative reads:
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Challenges in profiling STRs from short reads
2. Non-reference alleles present as large indels
3. PCR stutter noise complicates calling
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
lobSTR provides an end-to-end solution for STR-profiling
• Takes FASTA/FASTQ/BAM
• Reports read alignments and STR alleles
• Supports multi-threading
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Our definition of STRs
• Motif size 2 through 6e.g. (AT)n, (CAG)n, (AGAT)n, (AGAGT)n, (AAAAAT)n
• At least 25bp long in the reference sequence
• Allow for imperfect repeat sequences (Tandem repeat finder score > 50)e.g. TATATACATATATATATATTATA
Aim: Find informative reads and characterize STR
SensingIntro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Entropy score successfully detects STRs
Sj: sequence jΣ: symbol alphabet (dinucleotides)fi: frequency of symbol i
98.3%99.4%
AGCATATATATATATATATATATG
i fiAG 0.04GC 0.04CA 0.04AT 0.43TA 0.39TG 0.04… …
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
E(S) = 1.8
i fiA 0.46C 0.04G 0.08T 0.42
E = 1.54
CAGCTATTCGGGACTGAGCGGTAT
i fiA 0.21C 0.21G 0.33T 0.25
E = 1.97
i fiCA 0.04AG 0.08GC 0.08TA 0.08TT 0.04TC 0.04… …
E(S) = 3.71
Entropy score partitions reads
TTTTGTGTGAAACCATGCTCGAGTGTGTGTGTGTGTGTATGTGTGTGTGTGTGTGTGTGTGGTGTCTTAAGACTGAAATATCTAAGATTAACTTGG
Left flank Right flankSTR region
2/22/12
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Repetitive sequences show distinct spectral signatures
ACCGT M =
1. Convert sequence to matrix 2. Compute power spectra
5mer4mer3merrandom
An STR of period k generates a strong signal in bin nc/k (c=0,±1,±2…)
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
The highest energy period gives the repeat unit length
• The period k whose first harmony has highest energy gives the repeat period
• Most frequent kmer gives the repeat unit
TTTTGTGTGAAACCATGCTCGAGTGTGTGTGTGTGTGTATGTGTGTGTGTGTGTGTGTGTGGTGTCTTAAGACTGAAATATCTAAGATTAACTTGG
GTGTGTGTGTGTGTATGTGTGTGTGTGTGTGTGTG GT
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Aim: Map STRs to the genome
AlignmentIntro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
We need to increase the specificity!
GT BWT
• Align only to known STR regions (Tandem Repeat Finder)
• Separate Burrows Wheeler Transforms store the left and right flanking regions of all genomic instances of each repeat
ATT BWT
GTGTGTGTGGTGTGTGTGGTGTGTGTGGTGTGTGTG
ATTATTATTATTATTATTATTATTATTATTATTATT
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
How to avoid gapped alignment?
TTTTGTGTGAAACCATGCTCGAGTGTGTGTGTGTGTGTATGTGTGTGTGTGTGTGTGTGTGGTGTCTTAAGACTGAAATATCTAAGATTAACTTGG
GT BWT GT BWT
Divide and conquer approach:
A unique match anchors flanking regions to the genome and determines the repeat unit difference from reference
Locus 145 (-)Locus 172 (+)
Locus 172 (+)Locus 25,678 (+)
Locus 172
100bp
94bp
+6 bp
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Aim: determine the STR alleles
AllelotypingIntro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
How to remove stutter noise?
• Use supervised learning to model PCR stutter noise
• Train on male sex chromosome STRs
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Determine the most likely genotype
1. Enumerate possible allelotypes at each locus given observed reads R = 13,13,13,14,14,15 x GT
(13,13) (14,14) (15,15)(13,14) (13,15) (14,15)
3. Return maximum likelihood allelotype (13,14)
Intro. Algorithmic details Comparison Validation
2. Calculate log likelihood for each allelotype (A,B) using the stutter model
Melissa Gymrek4/21/12 lobSTR
lobSTR outperforms mainstream short read aligners at STR loci
lobSTR Bowtie BWA NovoAlign0
10000
20000
30000
40000
50000
60000R
eads
/ s
lobSTR Bowtie BWA NovoAlign0
10000
20000
30000
40000
50000
60000 speed#var reads483
0
222
293
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
lobSTR can detect pathogenic repeat expansions
Simulating a heterozygous carrier of HOXD13 pathology (7 trinucleotide expansion):
BWA fails to detect the pathogenic allele
lobSTR detects both alleles
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
lobSTR shows high concordance on biological replicates
BloodgDNA
SalivagDNA
Genotype consistency: how many loci give identical genotype callse.g. (10,10), (10,11) 0 (11,11), (11,11) 1
23456all
Period
Melissa Gymrek4/21/12 lobSTR
Intro. Algorithmic details Comparison Validation
lobSTR shows high concordance on biological replicates
BloodgDNA
SalivagDNA
Melissa Gymrek4/21/12 lobSTR
Intro. Algorithmic details Comparison Validation
Allele consistency: how many alleles agree between samplese.g. (10,10), (10,11) 0.5 (11,11), (11,11) 1
23456all
Period23456all
Period
Validation with DNA electrophoresis shows high accuracy
3 Runs of 101PE Illumina GAIIx
Genetic genealogy service
lobSTR
14 CODIS markers
gDNA
8 correct
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
4 partially covered (< 3x)
1 incorrect
1 not covered
Validation with DNA electrophoresis shows high accuracy
Melissa Gymrek4/21/12 lobSTR
109bp PE reads, 1-2x77bp PE reads, 5-7xIllumina GAIIx
lobSTR
gDNA
HGDP samples
HGDP-CEPH panel~200 STR calls
At 3x coverage, lobSTR correctly returned 75% of genotypes calls and 85% of allele calls
Intro. Algorithmic details Comparison Validation
Genome-wide STR profiling of 126x genome confirms known STR mutation trends
• 6.1 million reads aligned to 180,000 STR loci• 55% of called loci ≥20x had at least one non-reference allele
Het. ref/non-ref
Hom. non-ref
STR variation decreases with unit length
Hom. ref
Het. non-ref
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Genome-wide STR profiling of 126x genome confirms known STR mutation trends
Longer STR regions show increased variation
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
lobSTR reports different variations for trinucleotide STRs in introns vs. exons
Intro. Algorithmic details Comparison Validation
Melissa Gymrek4/21/12 lobSTR
Introns 3 times as likely to have 1 non-ref allele
Introns 5 times as likely to have 2 non-ref alleles
0 frameshift mutations detected in exons vs. 1.9% in introns
Conclusion
• lobSTR is an end-to-end solution for STR profiling in personal genomes
• It outperforms mainstream aligners at STRs
• Validation by:- Consistency in biological replicates- Mendelian inheritance in trios- Capillary platform- Biological reasoning
• Thousands of STR variations in a single individualEnormous unexplored source of variation
lobSTR
Melissa Gymrek4/21/12 lobSTR