USDA-ARS Cornell University Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010 50% Plant 1 Plant 2 Plant 3 99% Person 1 Person 2 Person 3 Maize Humans. GWAS and Joint linkage

Edward BucklerUSDA-ARS

Cornell Universityhttp://www.maizegenetics.net

Why can GBS be complicated? Tools

for filtering, error correction and

imputation.

Many Organisms Are Diverse

Humans are at the lower end of diversity, which results in most genomic and bioinformatics

being optimized to this situation

What is diverse: Grapes, Flies, Arabidopsis, and most dominant species on the planet

Maize has more molecular diversity than humans and apes combined

Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

1.34%0.09%

1.42%

Maize genetic variation has been evolving for 5 million years

Modern Variation Begins Evolving

Sister Genus Diverges

Zea species begin diverging

Maize domesticated

5mya

4mya

3mya

2mya

1mya

War

m

Plio

cene

Col

d Pl

eist

ocen

eDivergence from Chimps

Ardipithecus

Homo erectus

Modern HumansModern Variation Begins

Australopithecus

What are our expectations with GBS?

High Diversity Ensures High Return on Sequencing

• Proportion of informative markers– Highly repetitive – 15% not easily

informative– Half the genome is not shared between two

maize line• Potentially all of these are informative with a

large enough database– Low copy shared proportion (1% diversity)

• Bi-parental information = (1-0.01)^64bp = 48% informative

• Association information = (1-0.05)^64bp= 97% informative

Expectation of marker distribution

Biallelic, 17%

Too Repetitiv

e, 15%

Non-polymor

phic; 18%

Presense/Absense

, 50%

Multiallelic, 34%

Too Repetitiv

e, 15%

Non-polymorphic; 1%

Presense/Absense

, 50%

Biparental population Across the species

Sequencing Error

Illumina Basic Error Rate is ~1%

• Error rates are associated with distance from start of sequence– Bad – GBS puts these all at the same

position– Good – Reverse reads can correct– Good – Error are consistent and modelable

Reads with errors

• Perfect sequences:0.9964=52.5% of the 64bp sequences are

perfect47.5 are NOT perfect

The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.

Do we see these errors?• Assume 10,000 lines genotyped at

0.5X coverage

Base Type Read # (no SNP)

Read # (w/ SNP)

A Major 4950 4900

C Minor 17 67 (50 real)

G Error 17 17

T Error 17 17

Do Errors Matter?• Yes –Imputation, Haplotype

reconstruction• Maybe – GWAS for low frequency

SNPs• No – GS, genetic distance, mapping

on biparental populations

Expectations of Real SNPs

• Vast majority are biallelic• Homozygosity is predicted by

inbreeding coefficient• Allele frequency is constrained in

structured populations• In linkage disequilibrium with

neighboring SNPs

Studying Errors in Biparental populationsLimited range of alleles,

expected allele frequencies, high LD

Maize RIL population expectations

• Allele frequency 0% or 50%• Nearby sites should be in very high

LD (r2>50%)• Most sites can be tested if multiple

populations are available

Bi-parental populations allow identification of error, and non-Mendelian segregation

Error

Non-segregating

Segregating

Bi-parental populations allow identification of error, and non-Mendelian segregation

Error

Median error rate is 0.004, but there is a long tail of some high error sites

Median

HapMap

Process

File (data structure)

Clean Up and Imputation

HapMap

GBSHapMapFiltersPluginSite Coverage, Taxa Coverage, Inbreeding

Coefficient, LD

ImputationImputation &

Phasing

KinshipDistance

PhylogenyLDGS

GWAS

MergeDuplicateSNPsPluginMerge reads from opposite sides

BiParentalErrorCorrectionPluginError rate estimation, LD filters

MergeIdenticalTaxaPluginError rate estimation, LD filters

MergeDuplicateSNPsPlugin

• When restriction sites are less than 128bp apart, we may read SNP from both directions (strands)

• ~13% of all sites• Fusing increases coverage• Fixes errors• -misMat = set maximum mismatch rate• -callHets = mismatch set to hets or not

Use the biology of your system to filter SNP

• Hardy-Weinberg Disequilibrium• For inbreeding crops use the expected

inbreeding coefficient to filter SNPs• How to deal with the low coverage—

– Although many individuals only have 1X coverage (27% of the samples will have 2X or more). Use these individuals to evaluate the quality of segregation.

Product of Filtering

• After filters, in maize we find 0.0018 error rate– AAaa = < 0.0018– AAAa = 0.8 at low coverage

• SNPs in wrong location

Imputation

Missing DataTwo major sources:• Sampling

• Low coverage often used in big genomes with inbred lines

• Differential coverage caused by fragment size biases

• Biological• Region on genome not shared between lines• Cut site polymorphisms

We want to impute the missing sampling but not the biological

Standard Imputation

Lots of algorithms: FastPhase, NPUTE, BEAGLE, etc.

These are appropriate for high coverage loci, inbreds, and regions where biological missing is a rare condition

Some can be slow for sample sizes that we have.

BEAGLE

• http://faculty.washington.edu/browning/beagle/beagle.html

• Currently the best published approach. Works well with many population designs.

• Main weakness is scaling with thousands of samples.

• Not efficient for mapping populations

Hidden Markov Model TASSEL GBS Imputation

• Developed by Peter Bradbury• Aimed a GBS and biparental populations• Hidden Markov Model • First reconstruct parental gametes and then

offspring• Very accurate at determining boundaries• Works well on Maize NAM inbred lines, and

probably others.• AA BB error rate– 0.00005• AB > AA – 0.0278

The Viterbi Algorithm

N = 1 2 3 4

distance between two points = log[ P(obsn|gn) P(gn|gn‐1) ]

homozygous parent 1

heterozygous

homozygous parent 2

→→→

Defining the transition matrix

• The probability of a homozygote → homozygote transition is a function of the recombination frequency and number of generations of selfing.

• Calcula ng the probability of a heterozygote → homozygote transition is more complex in RILs. Even the ratio of transitions varies with generations of selfing.

• An alternative is just to estimate the transition frequencies directly from the data.

Calculating P(obs|genotype)

• P(obs | genotype) is a direct function of the genotype calling error rates.

• P(obs = parent 1| genotype = parent2) is just the rate at which homozygotes are called incorrectly as the other homozygote.

• These can be estimated initially from GBS data, then refined using an E‐M algorithm

An E‐M algorithm

• The parameters defining the model are– initial state probabilities– transition matrix– P(obs | genotype) matrix

• Algorithm– Maximize P(G|S)– Using this G, estimate the parameters– Iterate to convergence (always

Hidden Markov Model TASSEL GBS Imputation

• Available as TASSEL 4.0 plugin• CallParentAllelesPlugin• ViterbiAlgorithmPlugin

• Most problems appears in poorly controlled populations

• Substantial number of errors caused by sample/pollination mix up.

FindMergeHaplotypesPlugin +MinorWindowViterbiImputationPlugin• Imputation approach focused on speed and

large sets of taxa with some closely related individuals.

• Nearest neighbor approach combined with Viterbi

• Strengths: Very accurate 2X of each segments)

Unstable Genomes

Using the Presence/Absence Variants

• In species like maize, this is the majority of the data

• Less subject to sequencing error• Need imputation methods to

differentiate between missing from sampling and biologically missing

Only 50% of the maize genome is shared between two varieties

Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

50%

Plant 1

Plant 2 Plant 3

99%

Person 1

Person 2 Person 3

Maize Humans

GWAS and Joint linkage mapping

Read depth (B73 vs Non-B73)

Identify structural variations

Tags

GWAS

JointLinkage

If Map AlignIf

AlignY

PAVsN

Chromosome

B73

Non‐B73

Bin 1 Bin 2 Bin 3

PAV (Deletion) CNV (Duplication)

Insertion

Developed by Fei Lu

Dependent variable, distance (abs(genetic position – physical position))

Attributes: P-value, likelihood ratio, tag count, etc.

Machine learning models: decision tree, SVM, M5Rules, etc.

Model training for mapping accuracy

6.4 M

0.5 M

8.6 M

GJ

G

J

Y

Y

Y

4.5 M mapped tags26M tags

GWAS

Joint linkage

Model training and prediction using M5Rules

High-resolution mapped tags

95%

50%

4.5 million mapped tags in all maize lines

100 bp 10 Kb 1 Mb 10 Kb1 Gb 10 Gb

Median resolution

Framework of maize pan genome

Paired-End Sequencing

• Most of the tags are less than 500bp in size.

• Full contigs can be developed by sequencing them on a MiSeq with paired-end 250bp reads

• Strategies – physically bridge from the mapped end, genetically anchor one end.

Phylogenetic (Cross Species) Analysis

• When divergence

Future• Need better integration of Whole Genome

Sequence data with pipeline– Add information on premature cut sites or

mutated cut sites• Use paired-end read information• Full incorporation of presence/absence

variants• Quantitative genotype tools for

polyploids/GS

Documents

USDA-ARS Cornell University Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010 50% Plant 1 Plant 2 Plant 3 99% Person 1 Person 2 Person 3 Maize Humans. GWAS and Joint linkage