42
Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net Why can GBS be complicated? Tools for filtering, error correction and imputation.

USDA-ARS Cornell University Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010 50% Plant 1 Plant 2 Plant 3 99% Person 1 Person 2 Person 3 Maize Humans. GWAS and Joint linkage

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Edward BucklerUSDA-ARS

    Cornell Universityhttp://www.maizegenetics.net

    Why can GBS be complicated? Tools

    for filtering, error correction and

    imputation.

  • Many Organisms Are Diverse

    Humans are at the lower end of diversity, which results in most genomic and bioinformatics

    being optimized to this situation

    What is diverse: Grapes, Flies, Arabidopsis, and most dominant species on the planet

  • Maize has more molecular diversity than humans and apes combined

    Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

    1.34%0.09%

    1.42%

  • Maize genetic variation has been evolving for 5 million years

    Modern Variation Begins Evolving

    Sister Genus Diverges

    Zea species begin diverging

    Maize domesticated

    5mya

    4mya

    3mya

    2mya

    1mya

    War

    m

    Plio

    cene

    Col

    d Pl

    eist

    ocen

    eDivergence from Chimps

    Ardipithecus

    Homo erectus

    Modern HumansModern Variation Begins

    Australopithecus

  • What are our expectations with GBS?

  • High Diversity Ensures High Return on Sequencing

    • Proportion of informative markers– Highly repetitive – 15% not easily

    informative– Half the genome is not shared between two

    maize line• Potentially all of these are informative with a

    large enough database– Low copy shared proportion (1% diversity)

    • Bi-parental information = (1-0.01)^64bp = 48% informative

    • Association information = (1-0.05)^64bp= 97% informative

  • Expectation of marker distribution

    Biallelic, 17%

    Too Repetitiv

    e, 15%

    Non-polymor

    phic; 18%

    Presense/Absense

    , 50%

    Multiallelic, 34%

    Too Repetitiv

    e, 15%

    Non-polymorphic; 1%

    Presense/Absense

    , 50%

    Biparental population Across the species

  • Sequencing Error

  • Illumina Basic Error Rate is ~1%

    • Error rates are associated with distance from start of sequence– Bad – GBS puts these all at the same

    position– Good – Reverse reads can correct– Good – Error are consistent and modelable

  • Reads with errors

    • Perfect sequences:0.9964=52.5% of the 64bp sequences are

    perfect47.5 are NOT perfect

    The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.

  • Do we see these errors?• Assume 10,000 lines genotyped at

    0.5X coverage

    Base Type Read # (no SNP)

    Read # (w/ SNP)

    A Major 4950 4900

    C Minor 17 67 (50 real)

    G Error 17 17

    T Error 17 17

  • Do Errors Matter?• Yes –Imputation, Haplotype

    reconstruction• Maybe – GWAS for low frequency

    SNPs• No – GS, genetic distance, mapping

    on biparental populations

  • Expectations of Real SNPs

    • Vast majority are biallelic• Homozygosity is predicted by

    inbreeding coefficient• Allele frequency is constrained in

    structured populations• In linkage disequilibrium with

    neighboring SNPs

  • Studying Errors in Biparental populationsLimited range of alleles,

    expected allele frequencies, high LD

  • Maize RIL population expectations

    • Allele frequency 0% or 50%• Nearby sites should be in very high

    LD (r2>50%)• Most sites can be tested if multiple

    populations are available

  • Bi-parental populations allow identification of error, and non-Mendelian segregation

    Error

    Non-segregating

    Segregating

  • Bi-parental populations allow identification of error, and non-Mendelian segregation

    Error

  • Median error rate is 0.004, but there is a long tail of some high error sites

    Median

  • HapMap

    Process

    File (data structure)

    Clean Up and Imputation

    HapMap

    GBSHapMapFiltersPluginSite Coverage, Taxa Coverage, Inbreeding

    Coefficient, LD

    ImputationImputation &

    Phasing

    KinshipDistance

    PhylogenyLDGS

    GWAS

    MergeDuplicateSNPsPluginMerge reads from opposite sides

    BiParentalErrorCorrectionPluginError rate estimation, LD filters

    MergeIdenticalTaxaPluginError rate estimation, LD filters

  • MergeDuplicateSNPsPlugin

    • When restriction sites are less than 128bp apart, we may read SNP from both directions (strands)

    • ~13% of all sites• Fusing increases coverage• Fixes errors• -misMat = set maximum mismatch rate• -callHets = mismatch set to hets or not

  • Use the biology of your system to filter SNP

    • Hardy-Weinberg Disequilibrium• For inbreeding crops use the expected

    inbreeding coefficient to filter SNPs• How to deal with the low coverage—

    – Although many individuals only have 1X coverage (27% of the samples will have 2X or more). Use these individuals to evaluate the quality of segregation.

  • Product of Filtering

    • After filters, in maize we find 0.0018 error rate– AAaa = < 0.0018– AAAa = 0.8 at low coverage

    • SNPs in wrong location

  • Imputation

  • Missing DataTwo major sources:• Sampling

    • Low coverage often used in big genomes with inbred lines

    • Differential coverage caused by fragment size biases

    • Biological• Region on genome not shared between lines• Cut site polymorphisms

    We want to impute the missing sampling but not the biological

  • Standard Imputation

    Lots of algorithms: FastPhase, NPUTE, BEAGLE, etc.

    These are appropriate for high coverage loci, inbreds, and regions where biological missing is a rare condition

    Some can be slow for sample sizes that we have.

  • BEAGLE

    • http://faculty.washington.edu/browning/beagle/beagle.html

    • Currently the best published approach. Works well with many population designs.

    • Main weakness is scaling with thousands of samples.

    • Not efficient for mapping populations

  • Hidden Markov Model TASSEL GBS Imputation

    • Developed by Peter Bradbury• Aimed a GBS and biparental populations• Hidden Markov Model • First reconstruct parental gametes and then

    offspring• Very accurate at determining boundaries• Works well on Maize NAM inbred lines, and

    probably others.• AA BB error rate– 0.00005• AB > AA – 0.0278

  • The Viterbi Algorithm

    N = 1 2 3 4

    distance between two points = log[ P(obsn|gn) P(gn|gn‐1) ]

    homozygous parent 1

    heterozygous

    homozygous parent 2

    →→→

  • Defining the transition matrix

    • The probability of a homozygote → homozygote transition is a function of the recombination frequency and number of generations of selfing.

    • Calcula ng the probability of a heterozygote → homozygote transition is more complex in RILs. Even the ratio of transitions varies with generations of selfing.

    • An alternative is just to estimate the transition frequencies directly from the data.  

  • Calculating P(obs|genotype)

    • P(obs | genotype) is a direct function of the genotype calling error rates.

    • P(obs = parent 1| genotype = parent2) is just the rate at which homozygotes are called incorrectly as the other homozygote. 

    • These can be estimated initially from GBS data, then refined using an E‐M algorithm

  • An E‐M algorithm

    • The parameters defining the model are– initial state probabilities– transition matrix– P(obs | genotype) matrix

    • Algorithm– Maximize P(G|S)– Using this G, estimate the parameters– Iterate to convergence (always 

  • Hidden Markov Model TASSEL GBS Imputation

    • Available as TASSEL 4.0 plugin• CallParentAllelesPlugin• ViterbiAlgorithmPlugin

    • Most problems appears in poorly controlled populations

    • Substantial number of errors caused by sample/pollination mix up.

  • FindMergeHaplotypesPlugin +MinorWindowViterbiImputationPlugin• Imputation approach focused on speed and

    large sets of taxa with some closely related individuals.

    • Nearest neighbor approach combined with Viterbi

    • Strengths: Very accurate 2X of each segments)

  • Unstable Genomes

  • Using the Presence/Absence Variants

    • In species like maize, this is the majority of the data

    • Less subject to sequencing error• Need imputation methods to

    differentiate between missing from sampling and biologically missing

  • Only 50% of the maize genome is shared between two varieties

    Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

    50%

    Plant 1

    Plant 2 Plant 3

    99%

    Person 1

    Person 2 Person 3

    Maize Humans

  • GWAS and Joint linkage mapping

    Read depth (B73 vs Non-B73)

    Identify structural variations

    Tags

    GWAS

    JointLinkage

    If Map AlignIf 

    AlignY

    PAVsN

    Chromosome

    B73

    Non‐B73

    Bin 1 Bin 2 Bin 3

    PAV (Deletion) CNV (Duplication)

    Insertion

    Developed by Fei Lu

  • Dependent variable, distance (abs(genetic position – physical position))

    Attributes: P-value, likelihood ratio, tag count, etc.

    Machine learning models: decision tree, SVM, M5Rules, etc.

    Model training for mapping accuracy

    6.4 M

    0.5 M

    8.6 M

    GJ

    G

    J

    Y

    Y

    Y

    4.5 M mapped tags26M tags

    GWAS

    Joint linkage

    Model training and prediction using M5Rules

  • High-resolution mapped tags

    95%

    50%

    4.5 million mapped tags in all maize lines

    100 bp 10 Kb 1 Mb 10 Kb1 Gb 10 Gb

    Median resolution

    Framework of maize pan genome

  • Paired-End Sequencing

    • Most of the tags are less than 500bp in size.

    • Full contigs can be developed by sequencing them on a MiSeq with paired-end 250bp reads

    • Strategies – physically bridge from the mapped end, genetically anchor one end.

  • Phylogenetic (Cross Species) Analysis

    • When divergence

  • Future• Need better integration of Whole Genome

    Sequence data with pipeline– Add information on premature cut sites or

    mutated cut sites• Use paired-end read information• Full incorporation of presence/absence

    variants• Quantitative genotype tools for

    polyploids/GS