View
216
Download
0
Category
Preview:
Citation preview
International SheepGenomics Consortia
www.sheephapmap.org
Assembling the sheep genome via KAREN
John McEwan (AgResearch Invermay) on behalf of the
International Sheep Genomics Consortium
Why?
• We want to improve genetic gain in sheep– Can use whole genome selection
• Need a high density SNP chip– Need a genome sequence and SNPs
» Need to sequence and assemble sheep
• Use new sequencing technology• Job too big and expensive for one group• International Consortia developed
Whole genome selection• Major scientific advance• Genome sequencing & SNP chips
= “genome wide selection”
• As accurate as progeny testing, – but can be done at birth
• Suitable for – sex limited, – difficult to measure traits or – traits measured late in life
• Dairy cattle: – increase genetic gain 50-100% – while decreasing progeny testing costs
• Application in sheep is still being explored but has great advantages
• Numerous other species and uses for sequence
What is a SNP and what is a SNP chip?
• SNP = single nucleotide polymorphism
• SNP chip = test 60,000 to 1,000,000 SNPs • WGS works by being able to:
– predict status of other SNP variants nearby– includes variants that affect production traits
MELD atcgcgtgtagctagtgctagctgctagctagctgatgcaROM1_read12667 .............t..........................AWA1_read00345 ........................................ SBF1_read06734 ........................................TEX1_read00234 .............t..........................ROM1_read10385 .............t..........................TEX1_read39890 ........................................
ISGC – division of labor
• 6 sites• AgR hosts core database• tasks divided to best utilise skills• best utilise resources • history was DVDs…..• versioning• KAREN for transfer of data• make available to world
Roche 454 FLX Skim Sequencing Strategy
Romney
Texel
Scottish Blackface
Otago University Baylor College HGSC
Merino
Poll Dorset
Awassi
0.5x 0.5x
0.5x 0.5x
0.5x 0.5x
Repeat mask
Blast BT4.0 + BT2 addns
Assemble with Newbler
Meld against bovine scaffold
Reorder via ovine BES
Detect SNPs
dB Summary
Breed Source SeqCount AvgLength BaseCount
Awassi Baylor 7113075 2351,675,721,39
4
Merino Baylor 9004167 2201,995,873,00
2
Poll Dorset Baylor 7917802 2381,890,589,11
5
Romney Otago 6008805 2191,330,683,71
0
Scottish Blackface
Otago 5611006229 1,273,929,90
0
Texel Otago 6735328 2271,529,979,98
6
TOTAL 41,000,1929,696,777,1
07
Data• ~90 “runs” on 454• Per run
– Sequence ~130Mb– Processed data ~800Mb (quality ….)– Raw data 33Mb x 412 images = 13.6Gb
• Total 1224 +72 = 1296Gb• Actually more as used another
technology• This gives a false impression…..
Repeat masking
• Created own repeat database– Almost all slight variants existing repeats– Only masks ~2% more bases (40% total)– Speeds mapping to bovine genome– Takes about 4-5 days on ~120 CPUs– File size ~10Gbp…– Versioning important…
BLAST results
• Map to cattle
• 46% uniquely
• 4.4% unambiguously
• Issues: options, time taken, size of output– Weeks of processing time…..
Newbler assembly eg 1Mbp region
Contigs plus Singletonsnumber 2365numberOfBases 1,032,839avgSize bp 437
Coverage% unadj 51.6adj 52.7
• This could only be done in one location. Fast (several days). However, alternatives needed to be explored…
• Results needed to be transferred (~3Gbp)
Meld Process
Ovine contigs
Align (BLAST)
reference
contigs
MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT
Bovine reference scaffold
Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome
Ovine contigs
Align (BLAST)
reference
contigs
MELDCTAGTGCATGCTGCactTGCTataTGTGCtagNNgcATATTGCTGNNTGCTAT
Bovine reference scaffold
Figure 2. Creation of MELDed ovine sequence using bovine as a guide genome
MELD• Contigs
– Ordered– Orientated– Use BT4+– Contigs
~480bp
OA_ver.1.0 coverage
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
BTA1_OA_v
er.1
.0
BTA2_OA_v
er.1
.0
BTA3_OA_v
er.1
.0
BTA4_OA_v
er.1
.0
BTA5_OA_v
er.1
.0
BTA6_OA_v
er.1
.0
BTA7_OA_v
er.1
.0
BTA8_OA_v
er.1
.0
BTA9_OA_v
er.1
.0
BTA10_O
A_ver
.1.0
BTA11_O
A_ver
.1.0
BTA12_O
A_ver
.1.0
BTA13_O
A_ver
.1.0
BTA14_O
A_ver
.1.0
BTA15_O
A_ver
.1.0
BTA16_O
A_ver
.1.0
BTA17_O
A_ver
.1.0
BTA18_O
A_ver
.1.0
BTA19_O
A_ver
.1.0
BTA20_O
A_ver
.1.0
BTA21_O
A_ver
.1.0
BTA22_O
A_ver
.1.0
BTA23_O
A_ver
.1.0
BTA24_O
A_ver
.1.0
BTA25_O
A_ver
.1.0
BTA26_O
A_ver
.1.0
BTA27_O
A_ver
.1.0
BTA28_O
A_ver
.1.0
BTA29_O
A_ver
.1.0
BTAcont
ig_OA_v
er.1
.0
BTArept
ig_OA_v
er.1
.0
BTAchru
n_OA_v
er.1
.0
BTAx_OA_v
er.1
.0
adj percent non nN ver.1.0
coverage of nonrepetitivebtau4 fraction ver.1.0
Assembly 3.158 Gbp with 1.242 Gbp non N
Copy number variants• QC process
– Need sanity checks that assembly is correct
• CNVs– Regions >1000bp present variable numbers of times in genome– Often duplicated by unequal recombination
– Can confuse SNP detection and v freq source of assembly errors
• Detection– Use “adjusted” depth of ovine 454 reads mapped to BT4 genome– At each base pair count depth for each animal and average– Done using 50kb window with 1kb increments
• Results– Average depth animal ~0.45X– 1-3 CNV regions detected/chromosome– Appear to be true CNVs
Chromosome 1: as an example
Example putative CNV: BTA1:149Mbp
BTA1:149Mbp Gbrowse view
SNP detection• SNP Detection Criteria
– Stacking: collapsed where reads same (animal , plate, bp)
– Depth: >3 (35% of sequence) and <9 reads deep
– MAF: at least 2 reads present – SNP Class:
– A 2 or more animals present for both alleles. – B 2 or more animals present for at least 1 allele, – C alleles one animal
– SNP quality: read will be discarded if:• variants 10bp either side• homopolymeric runs (n>4) within 5bp• indels within 10bp
How biased is the sampling?Block depth (only blocks with reads are shown)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 2 4 6 8 10 12 14 16 18 20
block depth (minus one occurrence of the guide sequence)
cou
nt
count (raw alignments)
count (C minus stacks from same plate)
expected Poisson
Interim SNP ResultsMELD atcgcgtgtagctagtgctagctgctagctagctgatgcaROM1_read12667 .............t..........................AWA1_read00345 ........................................ SBF1_read06734 ........................................TEX1_read00234 .............t..........................ROM1_read10385 .............t..........................TEX1_read39890 ........................................
4 unique reads to do a callA = both alleles seen by 2 animalsB = 1 allele seen in 2 animalsC = both alleles seen one animal
2= Infinium 2 SNPs1 probe 50bp no G/C,A/T SNPs
1= Infinium 1 SNPs 2 probes 50bp
~69% pass design (0.8 threshold)~ 200K SNPs or ~3 SNPs/50Kb
As expected but rather low
~5/50kb better
575,44537,345273,832264,268Grand Total
72,3824,78234,45133,1491
503,06332,563239,381231,1192
Grand Total
CBAclass
SNP_class
454 SNPs detected in genome (excluding chrUn)
594,68138,507282,826273,348Grand Total
74,7974,94935,59234,2561
519,88433,558247,234239,0922
Grand Total
CBAclass
SNP_class
454 SNPs detected in genome (including chrUn)
575,44537,345273,832264,268Grand Total
72,3824,78234,45133,1491
503,06332,563239,381231,1192
Grand Total
CBAclass
SNP_class
454 SNPs detected in genome (excluding chrUn)
594,68138,507282,826273,348Grand Total
74,7974,94935,59234,2561
519,88433,558247,234239,0922
Grand Total
CBAclass
SNP_class
454 SNPs detected in genome (including chrUn)
Information resources
BLAST & sequence downloadavailable
Up to date information on ISGC project aims and progress
www.sheephapmap.org https://isgcdata.agresearch.nz
Genome Annotation• Visualize sequence and annotation• Widely used• Concept of “tracks”• Each track has significant processing
requirements• Distribute tasks• Versioning again important• Significant data transfers• Can have more than 50 tracks
SNP validation and selection
• Validation– Selected 112 Class A 454 SNPs– Assay with Sequenom– Aim is >85% validation rate in IMF (end of July)– Achieved 81%
• Select 60K SNPs for chip– Spacing algorithms used based on quality (est
MAF, adjacent sequence) and position– Multiple runs– Target date Aug 22nd
Future • 60K chip Aug 22nd final date for SNPs
– Available December 2008• Assembly
– 2nd Assembly ~Aug 2008 • BT4+sens blast+CAP3 assembler: expect 20% more sequence
– 3rd Assembly ~Dec 2008• + all_vs_all: expect 10% more seq
– 4th Assembly ~June 2009• As above but include ~4.5Gbp more seq inc paired end reads
• Application 10X coverage with ~200bp paired end reads June 2009
• For each assembly – annotate with Gbrowse – detect SNPs
Lessons learned
• 8th year of international consortia (~4th)• Data volumes increasing rapidly (1000X)• Initial data transfer is not the major issue
– Storage, transfer, annotation is ongoing– Processing, synchronisation, sharing resources
• Generate more than 10X volume• Small numerous 0.1-5Gb transfers
– Needs reliable transparent high volume data transfer
– Still issues with firewalls– Currently using phone for humans… why (location)?
AcknowledgementsAgResearch NZ Baylor HGSC CSIRO Genesis
Faraday John McEwan Richard Gibbs Brian Dalrymple Chris Warkup Gemma Payne George Weinstock James Kijas Nessa O’Sullivan Donna M. Muzny Ross Tellam Tracey van Stijn Michael E. Holder Wes Barris Theresa Wilson Lynne Nazareth Sean McWilliamRudi Brauning Rebecca L. Thorton Abhirami Ratnakumar Alan McCulloch Christie Kovar David TownleyRussell Smithies Benoit Auvray
Roslin Institute sheepGENOMICS UNE/sheepGENOMICS Steve Bishop Terry Longhurst (MLA) Hutton Oddy
Rob Forage
University of Otago University of Sydney USDA Jo Stanton Frank Nicholas Tim Smith Chrissie Curt van Tassell Mark
Funding Genesis Faraday, University of Sydney ISL Grant, and Ovita NZ
International SheepGenomics Consortia
www.sheephapmap.org
Thanks KARENand team
Recommended