23
COMPUTER SCIENCE APPLIED TO ANIMAL BREEDING Structure of an effective data base Example data base in Excel

COMPUTER SCIENCE APPLIED TO ANIMAL BREEDINGtheta.edu.pl/wp-content/uploads/2017/01/systemy_erasmus_lecture3.pdf357 0.7276 2.5778 7.4177 16.2656 25.7423 358 0.5879 2.3876 7.0633 17.2328

  • Upload
    vuthien

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

COMPUTER SCIENCE APPLIED TO ANIMAL

BREEDING

• Structure of an effective data base

• Example data base in Excel

1. Data

• Examples

• Editing

• Management

2. Data bases

• Definitions

• Examples

3. MS Excel

• Examples of (useful) data base functions

INTRODUCTION

Copyright ©2013, Joanna Szyda

DATA: data structure

SIMPLE COMPLEX

• Data records not related

eg.

1 7 9 5697 4.1

2 7 8 4890 3.8

3 6 5 7321 3.5

• Relation between records

eg.

1 7 9 2001 10.05.2005 23

1 7 9 2001 17.06.2005 34

1 7 9 2001 14.07.2005 30

Copyright ©2013, Joanna Szyda

DATA: simple

Weight gain of pigs

Copyright ©2013, Joanna Szyda

IND P.0 P.132 P.265 P.397 P.530

346 0.2999 1.3938 4.047 8.9365 14.4663

347 0.4265 1.9578 6.6809 15.9458 27.3269

348 0.4991 2.0284 6.0664 13.7166 22.7103

349 0.1739 1.2515 4.4695 11.0793 18.7735

350 0.3712 1.8365 5.9575 14.4277 23.8408

351 0.2727 1.3336 3.9884 8.7238 14.138

352 1.1542 3.7294 9.8721 20.2459 32.292

353 0.3175 1.7614 5.678 13.824 22.7556

354 0.1726 1.2156 4.464 11.2814 19.679

355 0.6935 2.8703 8.4873 19.1791 30.8544

356 0.5498 2.3433 7.2887 17.2022 28.4123

357 0.7276 2.5778 7.4177 16.2656 25.7423

358 0.5879 2.3876 7.0633 17.2328 28.7312

359 0.4806 2.339 7.7452 18.9444 31.8284

360 0.481 2.2166 7.087 17.0398 27.9577

361 0.2769 1.66 5.6707 14.9897 25.8092

362 0.7281 2.6245 7.3139 16.0735 26.359

363 0.3418 1.6791 5.6198 13.568 22.6985

364 0.3764 1.7024 5.2701 12.5866 21.5353

365 0.5849 2.1908 6.2308 13.3812 21.5758

DATA: complex

INSEMIK data base

Copyright ©2013, Joanna Szyda

Bull name Bull ID

Semen

Production

Date

1

Vol

1

Rm

1

Rpl

1

Conc

1 Culling

reason

2

Vol

2

Rm

2

Rpl

2

Konc

2 Culling

reason No of straws

Rpl after

thawing

BEGONI PL005053309889 13/08/2004 6 0 0 0 NBR 7 0 0 0 NBR 0 0

BEGONI PL005053309889 16/08/2004 3.5 2 70 1104 5 0 0 0 NBR 0 20

BEGONI PL005053309889 19/08/2004 7 2 70 553 10 0 0 0 NBR 0 30

BEGONI PL005053309889 23/08/2004 3 2 70 1503 5.5 3 80 1259 676 60

BEGONI PL005053309889 26/08/2004 10 3 80 1701 4.5 2 70 1034 1305 50

BEGONI PL005053309889 02/09/2004 3.5 3 80 1353 6 1 30 0 NBR 280 60

BEGONI PL005053309889 06/09/2004 6 3 80 1638 7 2 70 749 925 50

BEGONI PL005053309889 09/09/2004 4 2 70 908 5.5 3 80 1290 657 60

BEGONI PL005053309889 13/09/2004 10 3 80 2164 6.5 3 80 1433 1925 50

BEGONI PL005053309889 16/09/2004 6.5 2 70 560 0 0 0 0 167 50

DATA EDITING (cleaning) → data editing software

QUANTIFICATION

EDITION

PREPROCESSING

EDITING DATA

Copyright ©2013, Joanna Szyda

1. Editing data (cleaning step I) → deleting errorneous data

1 7 8 34.0 11 12 12 22

2 7 5 31.7 11 22 12 12

3 6 8 371 11 11 22 11

2. Preprocessing (cleaning step II) → deleting unnecessary

information = noise

1 7 8 34.0 11 12 12 22

2 7 5 31.7 11 22 12 12

3 6 8 371 11 11 22 11

3. Statistical analysis of data

• Standard methods

• New methods

• Data mining

4. Conclusions !!!

EDITING DATA

Copyright ©2013, Joanna Szyda

1. E.g. Removing non-informative data, data format

modification, etc.

2. Manual data modification generates errors

3. … is a waste of time

4. … is impossible for many data sets because of size

5. use programs for data editing

DO NOT MODIFY DATA MANUALLY

Copyright ©2013, Joanna Szyda

EDITING DATA

1. Use informative program names

2. Write extensive comments / documentation

3. Documentation contains:

• Goal of the program

• Date of the last modification

• Input / output file names and locations

• Input / output file record layouts

• Written in English

4. Automated execution of particular programs

MAKE DOCUMENTATION

Copyright ©2013, Joanna Szyda

EDITING DATA

EXAMPLE OF WORKING DOCUMENTATION

--------------------- running programmes ASReml -NS9 /home/szyda/MICROARRAY/PROGRAMS/fixed.as nohup R --vanilla < directslve.R > directslve.log & --------------------- file extensions try: *.as -> giv variance=0.01 *.as2 -> more sparse covariance matrix *.as3 -> giv variance=0.03 noiter FINISHED *.as4 -> using diagonal covariance matrix *.as5 -> giv variance=0.03 noiter FINISHED *.as6 -> giv variance=0.03 noiter FINISHED *.as7 -> giv variance=0.03 noiter --------------------- sequence of analysing data 1. readmatrix-1.f <- macierzN.csv -> fort.macierzN 2. readmatrix.sas <- kody.txt + fort.macierzN -> all_col.macierzN 3. readgenematrix1.f <- macierzN.csv + all_col.macierzN -> genecovN.txt 4. readgenematrix0.f <- genecovN.txt -> outputs on a terminal min/max value of the matrix 5. readgenematrix3.f <- genecovN.txt -> genecovN_100.full 6. directslve.R <- genecovN_100.full -> genecovN_100.inv1 7. readinversematrix1.f <- genecovN_100.inv1 -> genecovN_100.giv 8. readmatrix1.sas <- all_col.macierzN + betweenarrayM.out1 -> betweenarrays.out1.GID

Copyright ©2013, Joanna Szyda

EDITING DATA

Copyright ©2013, Joanna Szyda

EDITING DATA – example tools

1. SAS

• User-friendly, all operating systems

2. Perl (bioPerl) i Python

• Very popular in bioinformatics, quite simple, all

operating systems, free:

• www.perl.org

• www.python.org

3. Fortran

• Popular in numerical calculations, elegant, simple, all

operating systems

4. Other programming languages: R, C, C++, Java

MANAGING DATA

DATABASE an archived data set

LARGE FILES DYNAMIC CHANGES

Database tasks:

• STORAGE

• ORGANIZING

• TESTING FOR ERRORS

• DISTRIBUTION

Database organization

• RECORDS = "rows"

• FIELDS

Database searching:

• SPECIFICATION OF

SEARCH CRITERIA FOR

FIELDS

• = INQUIRY Copyright ©2014 Joanna Szyda

1. Large data sets require data management using data

bases

2. The dynamics of updating data sets requires ongoing

management and control of the correctness of the

incoming data

3. Before analyzing the data - remove errors

4. Browsing and visualization of data (Excel, Notepad) are

often impossible

5. Different statistical packages used for analysis of data

require different input formats

6. A well designed and documented database and editing

programs can be used for other analyzes

Copyright ©2013, Joanna Szyda

MANAGING DATA

DATA BASE – key features

WHAT IS A DATABASE?

1. Set of archived data (electronic)

WHAT ARE THE KEY FEATURES OF A DATABASE?

1. Error Checking

2. Basic manipulation of data

3. Providing data

Copyright ©2014, Joanna Szyda

1. Field

• A single source of information, piece of data, e.g.

individual code, year of birth

2. Record

• One row of data, a group of fields containing

information about e.g. the same individual, herd,

etc.

3. Table

• The collection of data in a table with defined

columns = fields and rows = records

4. Relation

• Connection between tables

DATA BASE – key concepts

Copyright ©2013, Joanna Szyda

SIMPLE RELATIONAL

• A single data table or

several independent

tables

• Many of the data tables

related with each other

using indices

DATA BASE – types

Copyright ©2013, Joanna Szyda

nr lab Individual ID abcg2 lepR btn3 btn1 btn2 dgat1 lep2a lep3 lept1 lep7

942 PL005006200324 1 3 2 1 2 2 2 1 1 2

943 PL005006200355 1 3 2 1 3 2 1 1 1 3

944 PL005006200416 1 3 2 1 3 2 3 3 1 1

945 PL005006200423 1 3 2 2 2 2 2 2 1 2

947 PL005006800463 1 3 2 1 2 1 3 2 1 1

948 PL005001502973 1 3 3 2 3 1 3 3 1 1

949 PL005001503178 1 2 1 1 3 1 2 1 1 2

DATA BASE – simple

Copyright ©2013, Joanna Szyda

nr lab individual ID abcg2 lepR btn3 btn1 btn2 dgat1 lep2a lep3 lept1 lep7

942 PL005006200324 1 3 2 1 2 2 2 1 1 2

943 PL005006200355 1 3 2 1 3 2 1 1 1 3

nr lab birth day birth month birth year breeder

942 15 03 2001 2

943 23 10 2003 7

nr lab herd test day day test day month test day year milk yield % fat

942 1 10 3 2004 13 5.1

942 1 15 4 2004 19 4.6

943 2 01 3 2006 31 4.0

TABLE: GENOTYPES

TABLE: COWS

TABLE: YIELDS

DATA BASE – relational

Copyright ©2013, Joanna Szyda

1. One-to-one

• Table with genotypes ↔ table with pedigree

2. One-to-many

• Table with genotypes ↔ table with test day yields

3. Many-to-many

• Table with test day yields ↔ table with herd info

DATA BASE – types of relations

Copyright ©2013, Joanna Szyda

1. MS Excel

• Known tool, simple databases, Widnows

2. MS Access

• User-friendly, component of MS Office Professional,

Windows

3. MySql

• Relatively user-friendly, free: http://dev.mysql.com/,

Windows + Linux

DATA BASE – example data base tools

Copyright ©2013, Joanna Szyda

4. SAS

• Professional data management package, very

expensive, all operating systems

5. Oracle

• Widely used professional package, all operating

systems

DATA BASE – example data base tools

Copyright ©2013, Joanna Szyda

1. Open the data in TextPad (produkcja.tx)

2. Open in Excel

3. Create a simple database

• Give the names of the columns

• In the next sheet to describe the column names = create documentation

• Recode missing data (find - replace)

• Set the filter for the column with animal ID

• Set a numeric filter (e.g. Above average)

• Select the color of the selected data (conditional formatting)

• Define data validation criteria

Copyright ©2013, Joanna Szyda

DATA BASE – Excel example

Copyright ©2013, Joanna Szyda

DATA EDITING

&

DATA

MANAGEMENT

nr lab nr osobnika

942 PL005006200324

943 PL005006200355

nr lab

942

943