Upload
gary-paul
View
220
Download
4
Tags:
Embed Size (px)
Citation preview
WiggansARS Big Data Workshop – July 16, 2015 (1)
George R. WiggansAnimal Genomics and Improvement LaboratoryAgricultural Research Service, USDABeltsville, MD 20705-2350, [email protected]
Big data in support ofgenetic improvementof dairy cattle
100 011110 1220020012 02121110111121 101111001121100020122002220111120210120021112211002111202 00111100101101101022001100220110112002011010202221211221012 20100111000112202212221120211201202010020220200021221110002 2112201110121001112111021121100201021000220002022 20100020110000220221102211210112111012222001201 1222002000200202020122211002222222002212111122 210021111200110111011200202220001112011010212 112121110202210021120121100111110211121102000 12200010110111020220022111010201112111101122 2021021021211011022122001211011211012022011 01 2220021002110001110021102110111000222002112 2 212121100022201020022221200122112121011101 11 200201102020012222220021110 20011201 211122 10101121211 122200 202111 2112 12112121 10120 1021 01 11220 012 10 0 21 00 2 2 11 12 1 0 21 1 2 12001 2 0 12 22 `
WiggansARS Big Data Workshop – July 16, 2015 (2)
Mission
Genetic improvement of dairy cattle for economically important traits Yield (milk, fat, and protein) Conformation (overall and individual traits) Longevity (productive life) Fertility (conception and pregnancy rates) Calving (dystocia and stillbirth) Disease resistance (mastitis)
WiggansARS Big Data Workshop – July 16, 2015 (3)
Data types
Identification information for animal, sire, and dam: Name ID number Birth date
Animal genotypes from marker panels thatthat range from 2,900 to 777,962 markers
Breed Herd Country
Courtesy of Il
lumina, Inc.
WiggansARS Big Data Workshop – July 16, 2015 (4)
Data types (continued)
Records for milk yield, fat percentage, protein percentage, and somatic cell count (1/month)
Appraiser-assigned scores for 16 body and udder characteristics related to conformation (e.g., stature)
Breeding records that include indicator for conception success
Calving difficulty scores and stillbirth occurrences
WiggansARS Big Data Workshop – July 16, 2015 (5)
Data amounts
Pedigree records: 71,974,045
Animal genotypes: 1,035,590
Lactation records (since 1960): 132,629,200
Daily yield records (since 1990): 641,864,015
Reproduction event records: 176,559,035
Calving difficulty scores: 29,528,607
Stillbirth scores: 19,567,198
WiggansARS Big Data Workshop – July 16, 2015 (6)
Computing environment
Computation server 2.27 GHz CPU (32 cores, 64 threads) 660 GB RAM 2.7 TB local storage
Database server 3.4 GHz CPU (12 cores, 24 threads) 264 GB RAM 1.3 TB local storage
Shared storage 38 TB
WiggansARS Big Data Workshop – July 16, 2015 (7)
Data management
Variable length segments for database rows to minimize space and overhead in identifying data
All marker genotypes for an animal stored each as a single byte in a character large object (CLOB)
All breedings and monthly milk yield and component information for a cow’s lactation stored in variable character data types
WiggansARS Big Data Workshop – July 16, 2015 (8)
Programming languages
C Database interface including data editing
FORTRAN Calculation of genetic merit estimates
SAS Data preparation, checking, and delivery
WiggansARS Big Data Workshop – July 16, 2015 (9)
Calculation schedule
Triannual genetic merit estimatesfrom processed phenotypic data
Monthly genomic evaluations based on estimates of marker effects using genotypic data and triannual phenotype-based evaluations
Weekly evaluations using marker effect estimates from monthly evaluations
APRDEC
AUg
WiggansARS Big Data Workshop – July 16, 2015 (10)
Transition to industry
Council on Dairy Cattle Breeding Database maintenance Calculation and distribution of genetic merit
estimates Interface with evaluation users and data suppliers
ARS Research and development using data made
available by Council
WiggansARS Big Data Workshop – July 16, 2015 (11)
Research resource
Massive amount of genomic data Location of causal genetic variants
Investigation of haplotypes never found in a homozygous state Discovery of chromosomal abnormalities
resulting in early embryonic death
Investigation of sons of heterozygous sires Specific markers associated with differences
between sons by haplotype
WiggansARS Big Data Workshop – July 16, 2015 (12)
Genetic merit of marketed Holstein bulls
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14-300
-200
-100
0
100
200
300
400
500
600
Year entered AI
Aver
age
net m
erit
($)
Average gain:$19.42/year
Average gain:$47.95/year
Average gain:$87.49/year
WiggansARS Big Data Workshop – July 16, 2015 (13)
Working with sequence data
Sequence available from 1000 Bull Genomes Project hosted in Australia
Project funded by industry to sequence over 200 bulls to create a haplotype library
A posteriori granddaughter design to locate chromosomal segments of interest from 71 bulls each with over 100 genotyped and progeny-tested sons
WiggansARS Big Data Workshop – July 16, 2015 (14)
Imputing sequence data
Haplotype library supports imputation
Genotypes from genotyping chips can be imputed to full sequence
Lower accuracy of sequence data compared with chip genotypes accommodated by dealing in dosages to represent allele content
Findhap v4 (VanRaden) fast and more accurate than Beagle at low × coverage
WiggansARS Big Data Workshop – July 16, 2015 (15)
Alignment of sequence data
Alignment – determining location of chromosomal segments provided by sequencer
Findmap – matches segment against library of haplotypes
Preserves low-frequency variants
Does not identify new variants
Uses a hash table to find variant enabling rapid processing
WiggansARS Big Data Workshop – July 16, 2015 (16)
Accuracy of Findhap vs. Beagle*
Sequence + HD Imputed from HDProgram Depth Correct Correlation Correct CorrelationFindhap 8× 98.7 0.981 95.0 0.926
4× 95.8 0.939 93.1 0.8972× 91.3 0.879 89.2 0.837
Beagle 8× 99.0 0.984 97.1 0.9564× 95.0 0.918 78.2 0.5822× 79.5 0.602 63.5 0.100
*250 bulls had sequence + HD; 250 others were imputed from HD
WiggansARS Big Data Workshop – July 16, 2015 (17)
Data storage and backup
Disk storage being added Compression option being investigated
Back up to tape with weekly submission to off-site storage
Expect to have internet 2 connection Facilitate sharing of sequence data
WiggansARS Big Data Workshop – July 16, 2015 (18)
Summary
Highly successful program leading to annual increases in genetic merit for production efficiency
Large database of phenotypic and genomic data provided by industry
Research projects to determine mechanism of genetic control of economically important traits
Data processing techniques developed so that rapid turnaround could be realized