31
lobSTR: A Short Tandem Repeat Profiler for Personal Genomes Melissa Gymrek RECOMB 2012

lobSTR: a novel pipeline for short tandem repeats profiling in personal genomes

Embed Size (px)

Citation preview

lobSTR: A Short Tandem Repeat

Profiler for Personal Genomes

Melissa GymrekRECOMB 2012

Why profile STRs?

Melissa Gymrek4/21/12 lobSTR

Huntington disease

Fragile X syndrome

Synpolydactyly

Medical Genetics

Intro. Algorithmic details Comparison Validation

(CAG)n

huntingtin

(GCA/GCG)n

HOXD13

(CGG)n

FMR15’

Why profile STRs?Forensics

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Why profile STRs?

Despite multitude of applications, STRs are not routinely profiled in whole genome sequencing

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Cell lineages

Challenges in profiling STRs from short reads

1. Only reads entirely spanning an STR are informative

Non-informative reads:

Informative reads:

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Challenges in profiling STRs from short reads

2. Non-reference alleles present as large indels

3. PCR stutter noise complicates calling

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

lobSTR provides an end-to-end solution for STR-profiling

• Takes FASTA/FASTQ/BAM

• Reports read alignments and STR alleles

• Supports multi-threading

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Our definition of STRs

• Motif size 2 through 6e.g. (AT)n, (CAG)n, (AGAT)n, (AGAGT)n, (AAAAAT)n

• At least 25bp long in the reference sequence

• Allow for imperfect repeat sequences (Tandem repeat finder score > 50)e.g. TATATACATATATATATATTATA

Aim: Find informative reads and characterize STR

SensingIntro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Entropy score successfully detects STRs

Sj: sequence jΣ: symbol alphabet (dinucleotides)fi: frequency of symbol i

98.3%99.4%

AGCATATATATATATATATATATG

i fiAG 0.04GC 0.04CA 0.04AT 0.43TA 0.39TG 0.04… …

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

E(S) = 1.8

i fiA 0.46C 0.04G 0.08T 0.42

E = 1.54

CAGCTATTCGGGACTGAGCGGTAT

i fiA 0.21C 0.21G 0.33T 0.25

E = 1.97

i fiCA 0.04AG 0.08GC 0.08TA 0.08TT 0.04TC 0.04… …

E(S) = 3.71

Entropy score partitions reads

TTTTGTGTGAAACCATGCTCGAGTGTGTGTGTGTGTGTATGTGTGTGTGTGTGTGTGTGTGGTGTCTTAAGACTGAAATATCTAAGATTAACTTGG

Left flank Right flankSTR region

2/22/12

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Repetitive sequences show distinct spectral signatures

ACCGT M =

1. Convert sequence to matrix 2. Compute power spectra

5mer4mer3merrandom

An STR of period k generates a strong signal in bin nc/k (c=0,±1,±2…)

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

The highest energy period gives the repeat unit length

• The period k whose first harmony has highest energy gives the repeat period

• Most frequent kmer gives the repeat unit

TTTTGTGTGAAACCATGCTCGAGTGTGTGTGTGTGTGTATGTGTGTGTGTGTGTGTGTGTGGTGTCTTAAGACTGAAATATCTAAGATTAACTTGG

GTGTGTGTGTGTGTATGTGTGTGTGTGTGTGTGTG GT

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Aim: Map STRs to the genome

AlignmentIntro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

We need to increase the specificity!

GT BWT

• Align only to known STR regions (Tandem Repeat Finder)

• Separate Burrows Wheeler Transforms store the left and right flanking regions of all genomic instances of each repeat

ATT BWT

GTGTGTGTGGTGTGTGTGGTGTGTGTGGTGTGTGTG

ATTATTATTATTATTATTATTATTATTATTATTATT

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

How to avoid gapped alignment?

TTTTGTGTGAAACCATGCTCGAGTGTGTGTGTGTGTGTATGTGTGTGTGTGTGTGTGTGTGGTGTCTTAAGACTGAAATATCTAAGATTAACTTGG

GT BWT GT BWT

Divide and conquer approach:

A unique match anchors flanking regions to the genome and determines the repeat unit difference from reference

Locus 145 (-)Locus 172 (+)

Locus 172 (+)Locus 25,678 (+)

Locus 172

100bp

94bp

+6 bp

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Aim: determine the STR alleles

AllelotypingIntro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

How to remove stutter noise?

• Use supervised learning to model PCR stutter noise

• Train on male sex chromosome STRs

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Determine the most likely genotype

1. Enumerate possible allelotypes at each locus given observed reads R = 13,13,13,14,14,15 x GT

(13,13) (14,14) (15,15)(13,14) (13,15) (14,15)

3. Return maximum likelihood allelotype (13,14)

Intro. Algorithmic details Comparison Validation

2. Calculate log likelihood for each allelotype (A,B) using the stutter model

Melissa Gymrek4/21/12 lobSTR

lobSTR outperforms mainstream short read aligners at STR loci

lobSTR Bowtie BWA NovoAlign0

10000

20000

30000

40000

50000

60000R

eads

/ s

lobSTR Bowtie BWA NovoAlign0

10000

20000

30000

40000

50000

60000 speed#var reads483

0

222

293

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

lobSTR can detect pathogenic repeat expansions

Simulating a heterozygous carrier of HOXD13 pathology (7 trinucleotide expansion):

BWA fails to detect the pathogenic allele

lobSTR detects both alleles

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

lobSTR shows high concordance on biological replicates

BloodgDNA

SalivagDNA

Genotype consistency: how many loci give identical genotype callse.g. (10,10), (10,11) 0 (11,11), (11,11) 1

23456all

Period

Melissa Gymrek4/21/12 lobSTR

Intro. Algorithmic details Comparison Validation

lobSTR shows high concordance on biological replicates

BloodgDNA

SalivagDNA

Melissa Gymrek4/21/12 lobSTR

Intro. Algorithmic details Comparison Validation

Allele consistency: how many alleles agree between samplese.g. (10,10), (10,11) 0.5 (11,11), (11,11) 1

23456all

Period23456all

Period

Validation with DNA electrophoresis shows high accuracy

3 Runs of 101PE Illumina GAIIx

Genetic genealogy service

lobSTR

14 CODIS markers

gDNA

8 correct

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

4 partially covered (< 3x)

1 incorrect

1 not covered

Validation with DNA electrophoresis shows high accuracy

Melissa Gymrek4/21/12 lobSTR

109bp PE reads, 1-2x77bp PE reads, 5-7xIllumina GAIIx

lobSTR

gDNA

HGDP samples

HGDP-CEPH panel~200 STR calls

At 3x coverage, lobSTR correctly returned 75% of genotypes calls and 85% of allele calls

Intro. Algorithmic details Comparison Validation

Genome-wide STR profiling of 126x genome confirms known STR mutation trends

• 6.1 million reads aligned to 180,000 STR loci• 55% of called loci ≥20x had at least one non-reference allele

Het. ref/non-ref

Hom. non-ref

STR variation decreases with unit length

Hom. ref

Het. non-ref

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Genome-wide STR profiling of 126x genome confirms known STR mutation trends

Longer STR regions show increased variation

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

lobSTR reports different variations for trinucleotide STRs in introns vs. exons

Intro. Algorithmic details Comparison Validation

Melissa Gymrek4/21/12 lobSTR

Introns 3 times as likely to have 1 non-ref allele

Introns 5 times as likely to have 2 non-ref alleles

0 frameshift mutations detected in exons vs. 1.9% in introns

Conclusion

• lobSTR is an end-to-end solution for STR profiling in personal genomes

• It outperforms mainstream aligners at STRs

• Validation by:- Consistency in biological replicates- Mendelian inheritance in trios- Capillary platform- Biological reasoning

• Thousands of STR variations in a single individualEnormous unexplored source of variation

lobSTR

Melissa Gymrek4/21/12 lobSTR

jura.wi.mit.edu/erlich/lobSTR/

Acknowledgements

Funding:National Defense Science and Engineering Graduate Fellowship

Yaniv ErlichDavid Golan (Tel Aviv University)Saharon Rosset (Tel Aviv University)Dina EspositoMona Sheikh