Upload
vuthien
View
213
Download
0
Embed Size (px)
Citation preview
COMPUTER SCIENCE APPLIED TO ANIMAL
BREEDING
• Structure of an effective data base
• Example data base in Excel
1. Data
• Examples
• Editing
• Management
2. Data bases
• Definitions
• Examples
3. MS Excel
• Examples of (useful) data base functions
INTRODUCTION
Copyright ©2013, Joanna Szyda
DATA: data structure
SIMPLE COMPLEX
• Data records not related
eg.
1 7 9 5697 4.1
2 7 8 4890 3.8
3 6 5 7321 3.5
• Relation between records
eg.
1 7 9 2001 10.05.2005 23
1 7 9 2001 17.06.2005 34
1 7 9 2001 14.07.2005 30
Copyright ©2013, Joanna Szyda
DATA: simple
Weight gain of pigs
Copyright ©2013, Joanna Szyda
IND P.0 P.132 P.265 P.397 P.530
346 0.2999 1.3938 4.047 8.9365 14.4663
347 0.4265 1.9578 6.6809 15.9458 27.3269
348 0.4991 2.0284 6.0664 13.7166 22.7103
349 0.1739 1.2515 4.4695 11.0793 18.7735
350 0.3712 1.8365 5.9575 14.4277 23.8408
351 0.2727 1.3336 3.9884 8.7238 14.138
352 1.1542 3.7294 9.8721 20.2459 32.292
353 0.3175 1.7614 5.678 13.824 22.7556
354 0.1726 1.2156 4.464 11.2814 19.679
355 0.6935 2.8703 8.4873 19.1791 30.8544
356 0.5498 2.3433 7.2887 17.2022 28.4123
357 0.7276 2.5778 7.4177 16.2656 25.7423
358 0.5879 2.3876 7.0633 17.2328 28.7312
359 0.4806 2.339 7.7452 18.9444 31.8284
360 0.481 2.2166 7.087 17.0398 27.9577
361 0.2769 1.66 5.6707 14.9897 25.8092
362 0.7281 2.6245 7.3139 16.0735 26.359
363 0.3418 1.6791 5.6198 13.568 22.6985
364 0.3764 1.7024 5.2701 12.5866 21.5353
365 0.5849 2.1908 6.2308 13.3812 21.5758
DATA: complex
INSEMIK data base
Copyright ©2013, Joanna Szyda
Bull name Bull ID
Semen
Production
Date
1
Vol
1
Rm
1
Rpl
1
Conc
1 Culling
reason
2
Vol
2
Rm
2
Rpl
2
Konc
2 Culling
reason No of straws
Rpl after
thawing
BEGONI PL005053309889 13/08/2004 6 0 0 0 NBR 7 0 0 0 NBR 0 0
BEGONI PL005053309889 16/08/2004 3.5 2 70 1104 5 0 0 0 NBR 0 20
BEGONI PL005053309889 19/08/2004 7 2 70 553 10 0 0 0 NBR 0 30
BEGONI PL005053309889 23/08/2004 3 2 70 1503 5.5 3 80 1259 676 60
BEGONI PL005053309889 26/08/2004 10 3 80 1701 4.5 2 70 1034 1305 50
BEGONI PL005053309889 02/09/2004 3.5 3 80 1353 6 1 30 0 NBR 280 60
BEGONI PL005053309889 06/09/2004 6 3 80 1638 7 2 70 749 925 50
BEGONI PL005053309889 09/09/2004 4 2 70 908 5.5 3 80 1290 657 60
BEGONI PL005053309889 13/09/2004 10 3 80 2164 6.5 3 80 1433 1925 50
BEGONI PL005053309889 16/09/2004 6.5 2 70 560 0 0 0 0 167 50
DATA EDITING (cleaning) → data editing software
QUANTIFICATION
EDITION
PREPROCESSING
EDITING DATA
Copyright ©2013, Joanna Szyda
1. Editing data (cleaning step I) → deleting errorneous data
1 7 8 34.0 11 12 12 22
2 7 5 31.7 11 22 12 12
3 6 8 371 11 11 22 11
2. Preprocessing (cleaning step II) → deleting unnecessary
information = noise
1 7 8 34.0 11 12 12 22
2 7 5 31.7 11 22 12 12
3 6 8 371 11 11 22 11
3. Statistical analysis of data
• Standard methods
• New methods
• Data mining
4. Conclusions !!!
EDITING DATA
Copyright ©2013, Joanna Szyda
1. E.g. Removing non-informative data, data format
modification, etc.
2. Manual data modification generates errors
3. … is a waste of time
4. … is impossible for many data sets because of size
5. use programs for data editing
DO NOT MODIFY DATA MANUALLY
Copyright ©2013, Joanna Szyda
EDITING DATA
1. Use informative program names
2. Write extensive comments / documentation
3. Documentation contains:
• Goal of the program
• Date of the last modification
• Input / output file names and locations
• Input / output file record layouts
• Written in English
4. Automated execution of particular programs
MAKE DOCUMENTATION
Copyright ©2013, Joanna Szyda
EDITING DATA
EXAMPLE OF WORKING DOCUMENTATION
--------------------- running programmes ASReml -NS9 /home/szyda/MICROARRAY/PROGRAMS/fixed.as nohup R --vanilla < directslve.R > directslve.log & --------------------- file extensions try: *.as -> giv variance=0.01 *.as2 -> more sparse covariance matrix *.as3 -> giv variance=0.03 noiter FINISHED *.as4 -> using diagonal covariance matrix *.as5 -> giv variance=0.03 noiter FINISHED *.as6 -> giv variance=0.03 noiter FINISHED *.as7 -> giv variance=0.03 noiter --------------------- sequence of analysing data 1. readmatrix-1.f <- macierzN.csv -> fort.macierzN 2. readmatrix.sas <- kody.txt + fort.macierzN -> all_col.macierzN 3. readgenematrix1.f <- macierzN.csv + all_col.macierzN -> genecovN.txt 4. readgenematrix0.f <- genecovN.txt -> outputs on a terminal min/max value of the matrix 5. readgenematrix3.f <- genecovN.txt -> genecovN_100.full 6. directslve.R <- genecovN_100.full -> genecovN_100.inv1 7. readinversematrix1.f <- genecovN_100.inv1 -> genecovN_100.giv 8. readmatrix1.sas <- all_col.macierzN + betweenarrayM.out1 -> betweenarrays.out1.GID
Copyright ©2013, Joanna Szyda
EDITING DATA
Copyright ©2013, Joanna Szyda
EDITING DATA – example tools
1. SAS
• User-friendly, all operating systems
2. Perl (bioPerl) i Python
• Very popular in bioinformatics, quite simple, all
operating systems, free:
• www.perl.org
• www.python.org
3. Fortran
• Popular in numerical calculations, elegant, simple, all
operating systems
4. Other programming languages: R, C, C++, Java
MANAGING DATA
DATABASE an archived data set
LARGE FILES DYNAMIC CHANGES
Database tasks:
• STORAGE
• ORGANIZING
• TESTING FOR ERRORS
• DISTRIBUTION
Database organization
• RECORDS = "rows"
• FIELDS
Database searching:
• SPECIFICATION OF
SEARCH CRITERIA FOR
FIELDS
• = INQUIRY Copyright ©2014 Joanna Szyda
1. Large data sets require data management using data
bases
2. The dynamics of updating data sets requires ongoing
management and control of the correctness of the
incoming data
3. Before analyzing the data - remove errors
4. Browsing and visualization of data (Excel, Notepad) are
often impossible
5. Different statistical packages used for analysis of data
require different input formats
6. A well designed and documented database and editing
programs can be used for other analyzes
Copyright ©2013, Joanna Szyda
MANAGING DATA
DATA BASE – key features
WHAT IS A DATABASE?
1. Set of archived data (electronic)
WHAT ARE THE KEY FEATURES OF A DATABASE?
1. Error Checking
2. Basic manipulation of data
3. Providing data
Copyright ©2014, Joanna Szyda
1. Field
• A single source of information, piece of data, e.g.
individual code, year of birth
2. Record
• One row of data, a group of fields containing
information about e.g. the same individual, herd,
etc.
3. Table
• The collection of data in a table with defined
columns = fields and rows = records
4. Relation
• Connection between tables
DATA BASE – key concepts
Copyright ©2013, Joanna Szyda
SIMPLE RELATIONAL
• A single data table or
several independent
tables
• Many of the data tables
related with each other
using indices
DATA BASE – types
Copyright ©2013, Joanna Szyda
nr lab Individual ID abcg2 lepR btn3 btn1 btn2 dgat1 lep2a lep3 lept1 lep7
942 PL005006200324 1 3 2 1 2 2 2 1 1 2
943 PL005006200355 1 3 2 1 3 2 1 1 1 3
944 PL005006200416 1 3 2 1 3 2 3 3 1 1
945 PL005006200423 1 3 2 2 2 2 2 2 1 2
947 PL005006800463 1 3 2 1 2 1 3 2 1 1
948 PL005001502973 1 3 3 2 3 1 3 3 1 1
949 PL005001503178 1 2 1 1 3 1 2 1 1 2
DATA BASE – simple
Copyright ©2013, Joanna Szyda
nr lab individual ID abcg2 lepR btn3 btn1 btn2 dgat1 lep2a lep3 lept1 lep7
942 PL005006200324 1 3 2 1 2 2 2 1 1 2
943 PL005006200355 1 3 2 1 3 2 1 1 1 3
nr lab birth day birth month birth year breeder
942 15 03 2001 2
943 23 10 2003 7
nr lab herd test day day test day month test day year milk yield % fat
942 1 10 3 2004 13 5.1
942 1 15 4 2004 19 4.6
…
943 2 01 3 2006 31 4.0
TABLE: GENOTYPES
TABLE: COWS
TABLE: YIELDS
DATA BASE – relational
Copyright ©2013, Joanna Szyda
1. One-to-one
• Table with genotypes ↔ table with pedigree
2. One-to-many
• Table with genotypes ↔ table with test day yields
3. Many-to-many
• Table with test day yields ↔ table with herd info
DATA BASE – types of relations
Copyright ©2013, Joanna Szyda
1. MS Excel
• Known tool, simple databases, Widnows
2. MS Access
• User-friendly, component of MS Office Professional,
Windows
3. MySql
• Relatively user-friendly, free: http://dev.mysql.com/,
Windows + Linux
DATA BASE – example data base tools
Copyright ©2013, Joanna Szyda
4. SAS
• Professional data management package, very
expensive, all operating systems
5. Oracle
• Widely used professional package, all operating
systems
DATA BASE – example data base tools
Copyright ©2013, Joanna Szyda
1. Open the data in TextPad (produkcja.tx)
2. Open in Excel
3. Create a simple database
• Give the names of the columns
• In the next sheet to describe the column names = create documentation
• Recode missing data (find - replace)
• Set the filter for the column with animal ID
• Set a numeric filter (e.g. Above average)
• Select the color of the selected data (conditional formatting)
• Define data validation criteria
Copyright ©2013, Joanna Szyda
DATA BASE – Excel example