25
Selecci´ on gen´omica, Valencia, UPV, Abril 2015: ejercicios A. Legarra, JP S´ anchez, ZG Vitezica May 28, 2016 1

Selecci on gen omica, Valencia, UPV, Abril 2015: ejerciciosgenoweb.toulouse.inra.fr/~alegarra/lab_all.pdf · Selecci on gen omica, Valencia, UPV, Abril 2015: ejercicios A. Legarra,

  • Upload
    lydien

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Seleccion genomica, Valencia, UPV, Abril2015: ejercicios

A. Legarra, JP Sanchez, ZG Vitezica

May 28, 2016

1

Miscellaneousnotesforgooduseofthepractices

12/5/2015

Filesfortheexercisescanbedownloadedfromhttp://genoweb.toulouse.inra.fr/~alegarra/course.tar.gz

28/3/2015

ExercisesareconceivedtobeexecutedinaconsoleofaUnixsystemorsimilar(LinuxorMac).InWindows,IrecommendusingMobaxtermwherealmostthesameutilitiescanbefound.Analternativeisavirtualmachineinstalledinyourwindowsmachine.Ifyouuseaserverfromawindowsmachine,IrecommendMobaxtermtohandletheconnections.(Putty+xmingisanotheroption).

ManipulatingSNPsinWindowsisdifficultandveryinefficient.TrytolearnandusetheconsoleshellandUnixtools.Alsolearnatleastatruecomputinglanguagetohandlethesedatasets(fortran,C,Python…).ThenyoucanuseMobaXtermforWindows,whichisa“linux”consoleemulator.

Tovisualizetheresults(e.g.,plotmarkereffects,orcomputeaccuracyfromcross-validation),onecoulduseSAS(maybeExcelbutfor50,000markersitstartsbeinguncomfortable).HoweverthisisnotavailableeverywhereandweuseR.

Your“home”directoryis/home/usernameandisusuallyabbreviatedas“~”.

Firstofall,youneedtherighttoolstodotheexercises:

R,awkorgawk,afortrancompilershouldbeinstalledbythesystemadministrator.Afreeoneisgfortran.

QMSim,gs3,andtheblupf90suite(blupf90,remlf90,airemlf90,preGSf90,renumf90)shouldbedownloadedfromtheirrespectivewebsitesandthenputintothe/home/username/binfolder.ThisfoldershouldbeinyourPATH(try“echo$PATH”atthecommandline).Innot,includesomethinglikeif [ -d "$HOME/bin" ] ; then PATH="$HOME/bin:$PATH" fi infile.bash_profilein/home/username,oranytimeyoucallaprogram,usethecompletepath,e.g./home/username/bin/blupf90.

Thisis(orshouldbe)alreadyinplaceforthecoursebutdonotforgettodoit“athome”.

Ifsomethingisnotworking,maybe.bash_profilehasnotbeencorrectlyread.Try“source~/.bash_profile”.

WhenyoutransferfilesbetweenUnix/Mac/Windowstheendoflinesisdifferent(UnixandMacdifferfromwindows).Thiscanbeavoidedbyusingflip

3

(https://ccrma.stanford.edu/~craig/utility/flip/,foranysystem),dos2unix(forunixmachines),Notepad++(forwindowsmachines).

Visualeditorsaregeditandnedit,andcommand-lineeditorsvimandjoe.Theserverhas(implicitely)aqwertykeyboard;ifyouhaveadifferentone(i.e.aFrenchAzerty),tryforinstance“setxkbmapfr”totelltheserverwhatyourkeyboardis.

4

Comandos Unix/Linux – Guía de ReferenciaTraducido por Fran Delgado [http://kernelsource.org] .com

Trabajando con archivosls – listar contenido de un directoriols -al – listado con atributos y archivos ocultoscd newdir/ - moverse al directorio newdircd – moverse al directorio homepwd – mostrar la ruta actualrm file – borrar el archivo filerm -r dir – borrar el directorio dirrm -f file – borrar file sin emitir mensajes de errorrm -rf dir – igual que el anterior pero con el directorio dir [**]cp file1 file2 – copiar file1 en file2cp -r dir1 dir2 – copiar el dir1 en dir2 (si no existe se crea)mv file1 file2 – renombra file1 como file2. Si file2 es un directorio lo mueve dentro del mismo.ln -s file link – crea un enlace simbólico de link hacia file.touch file – crea o actualiza filecat > file – redirecciona la entrada estándar a filemore file – muestra el contenido de filehead file – muestra las 10 primeras filas de filetail file – muestra las 10 últimas filas de filetail -f file – muestra las 10 últimas filas de file a medida que va creciendo.

Gestión de procesosps – muestra los procesos activos del usuariotop – muestra todos los procesos activoskill pid – mata el proceso con id pidkillall proc – mata todos los procesos proc [**]bg – lista los procesos parados o en segundo planofg – lleva proceso más reciente a primer planofg n – lleva proceso n a primer plano

File Permissionschmod octal file – establece en file los permisos especificados en octal (usuario, grupo y otros)

● 4 – lectura (r)● 2 – escritura (w)● 1 – ejecución (x)

Ejemplos:chmod 777 – lectura/escritura/ejecución para todos.chmod 755 – rwx para el propietario, rx para su grupo y otros

SSHssh user@host – conectar a host como userssh -p port user@host – conectar a host por el puerto port como userssh-copy-id user@host – añadir clave de user a host para autenticarte

Búsquedagrep pattern files – buscar patrón pattern en filesgrep -r pattern dir – buscar recursivamente patrón pattern en dircommand | grep pattern – buscar patrón pattern en la salida de commandlocate file – Busca instancias de file

Información del sistemadate – Consulta la fecha y hora actualcal – Muestra el calendario del mes actualuptime – tiempo que lleva encendida la máquinaw – muestra usuarios conectados a la máquinawhoami – nombre de mi usuariofinger user – muestra información sobre useruname -a – información sobre el núcleocat /proc/cpuinfo – información sobre la cpucat /proc/meminfo – información sobre la memoriaman command – páginas de manual sobre commanddf – espacio libre en los discosdu – espacio usado por los directoriosfree – uso de memorio y swapwhereis app – localiza el binario, fuente y página de manual de appwhich app – localiza el comando app

Compresióntar cf file.tar files – empaqueta files en un fichero files.tartar xf file.tar – extrae el contenido de file.tartar czf file.tar.gz files – empaqueta y comprime (gzip) files en files.tar.gztar xzf file.tar.gz – extract y descomprime usando Gziptar cjf file.tar.bz2 – empaqueta y comprime (bzip2) files en files.tar.bz2tar xjf file.tar.bz2 – extract y descomprime usando Bzip2gzip file – comprime file y lo renombra como file.gzgzip -d file.gz – descomprime file.gz a file

Redesping host – hace ping a host y muestra los datoswhois domain – información del dominio domaindig domain – configuración DNS de domaindig -x host – DNS inverso de hostwget file – descarga filewget -c file – continua una descarga parada

InstalaciónInstalar desde los fuentes:./configuremakemake installdpkg -i pkg.deb – instalar paquete DEBrpm -Uvh pkg.rpm – isntalar paquete RPM

Combinaciones de teclasCtrl+C – Interrumpe el comando activoCtrl+Z – Suspende el comando activo, con fg se reanuda y con bg se lleva a segundo planoCtrl+D – abandona sesión actual, similar a exitCtrl+W – borra una palabra en la línea actualCtrl+U – borra toda la línea!! - repite el último comandoexit – abandona la sesión actual

[**] usar con mucho cuidado

5

Prepared by I. Aguilar, A. Legarra, Z.G. Vitezica. This lab has three objectives: first, to do a brief overview of family F90 software; second, to simulate genomic data sets using software QMSim; and third, to get familiar with the use of a few Unix-Linux tools. Copy the folder lab1: ‘cp -r /home/course/lab1/ .’ ←(do not forget this point !)

(1.1) Exercise  1   Remember that you look at the documentation for BLUPF90 program in the wiki: http://nce.ads.uga.edu/wiki/doku.php and also in blupf90.pdf file. Using the following example from Mrode and Thompson (2005, in Linear Models for Predicting Animal Breeding Value), create data, pedigree and parameter file using renumf90 and then run blupf90 to obtain solutions and accuracies.

Remember that the accuracy for animal i (𝑟!) can be calculated as 𝑟! = 1− !"!!!!!

where: PEV is the prediction error variance (S.E. = sqrt(PEV)) 𝜎!! is the additive genetic variance

The solutions from example are:

-­‐-­‐  LAB  1  -­‐-­‐  

(1)   BLUPF90  /  REMLF90  

6

(1.2) Exercise  2   Remember that you can look at the documentation for REMLF90 program in the wiki: http://nce.ads.uga.edu/wiki/doku.php and also in remlf90.pdf file. The parameter files for this exercises are in : /home/course/lab1/f90/ The files with *99 contain data files for up to 14 traits. The parameter file exmr99s1 uses these files for a single-trait model, exmr99s2 uses data file for a two-trait model, and exmr99s for a three-trait model. Calculate variance components by remlf90 and airemlf90 using the parameter file exmr99s1. Record the number of rounds and CPU time. Using the command “time remlf90….” The CPU time will printed after to program stop. Extend the model to 2 traits by adding the observations in column 4 (parameter file exmr99s2). Repeat the computations for AIREMLF90 only. How much slower is REMLF90 and how longer are the computations in the two-trait case?

We will use the software QMSim (Sargolzaei & Schenkel, 2009, Bioinformatics 25:680-681). The software and its manual can be found in http://www.aps.uoguelph.ca/~msargol/qmsim/ The files for these exercises are in: /home/course/lab1/QMsim/ For running QMSim, use: echo ex.prm | QMsim (2.1)   Exercise  1  

1. Run the QMSim program. An example of the parameter file is ex01.prm. Note that historical population was generated by mutation and drift over 100 generations (t) with an effective population size of 100 (t = 1 to 95) and gradually expanded to 3,000 offspring (t = 100).

2. Now change ex01.prm and simulate a base population of 200 males and 2,600

females, 5 generations of selection for a trait (i.e. live weight) with a

(2)   DATA  SIMULATION  (including  genomics)  

7

phenotypic variance of 1.

1. How many SNPs did you simulate? 2. How many QTLs might potentially affect the phenotype? 3. How many animals do you have in the recent population? 4. Answer this question assuming a litter size equal to 12. 5. Which is the mean of the TBVs after 5 generations? 6. Use selection and culling based on EBVs, does the mean of the TBV

change? 7. Include positive assortative mating. 8. Which is the value of the polygenic variance?

3. Take a look at script cleaningAfterQmSim.sh.. Run it by

typing ./cleaningAfterQmSim.sh. This script creates the pedigree, the phenotype, and the genotype files for BLUPf90 from the QMSim simulated data. Which are the pedigree file, the phenotype file, and the marker file?

Note that ./cleaningAfterQmSim.sh uses the directory r_ex01/

4. Using awk, check the number of animals in each file: wc -l pedigree.txt

Do the same for the phenotype file

5. Using freq_allele.awk, estimate the allele frequency of markers. ./freq_allele.awk r_ex01/p1_mrk_001.txt > freq (2.2)   Exercise  2   The data for this lab was created using a simulation for an Animal Model (Vitezica et al., 2011). Run QMSim using the parameter file GenRes_example.prm. Look at the simulated file p1_data_001.txt, you will have the following columns: 1: animal id 2: sire id 3: dam id 4: sex 5: generation 6: number of males’ progenies 7: number of females’ progenies 8: inbreeding 9: homozygosity 10: phenotype 11: simulated residual (e) 12: individual true breeding value for polygene 13: individual true breeding value for direct effect (qtl) 14: EBV from QMSim internal BLUP From this simulated data the files pheno.txt, pedigree.txt and mrk.txt were built with

8

the script cleaningAfterQmSim_GenRes.sh. Run it by typing ./cleaningAfterQmSim_GenRes.sh. The pheno.txt contains 1,800 individuals selected from 28,800 that are in the pedigree (pedigree.txt), 915 animals have phenotypes while 885 have not (they have 0’s as phenotypes, and 0 is code for missing value in blupf90). This file has a first column of ‘ones’ (mean) plus the 14 columns of the simulated file. The marker data are in mrk.txt.

1. Before continuing the analysis, it’s important to check the “quality” of the files for some typical errors in the file. For example, are there duplicated animals in the pedigree? Check it using awk '{print $1}' pedigree.txt | sort | uniq –c | awk '$1>1'

2. Which  number  of  progeny  of  each  sire?  awk '{print $2}' pedigree.txt | sort | uniq –c

3. How  many  genotyped  animals  are?  wc –l mrk.txt

4. How  many  SNP  genotyped?  awk '{print length($2)}' mrk.txt

5. Everyone  has  the  same  number  of  loci  in  the  marker  file?  awk '{print length($2)}' mrk.txt | sort -u

6. From raw data modify renumf90 parameter file (renlab.par) according to the data file and to fit the model:

y = mean + animal + e Keep the true breeding values (TBV is in column 14 in pheno.txt). Do not forget to include the marker data in the renumbering. (2.3)   Exercise  3  (Handling  data)   Take a look at SNP and get acquainted with the use of a few Unix-Linux tools:

Copy directory  /home/course/lab1/mice to your folder: cp –rv /home/course/lab1/mice ./    Go to  mice/data  cd mice/data Look at the genotype file less –S mice_genotypes.txt    This file is prepared to be used in GS3 and blupf90, and it is in fixed format. This is inconvenient because many programs (plink?) admit SNPs separated by spaces.

9

However, some Fortran compilers (e.g. Intel) don’t admit free format beyond a large number of columns. So it is better to compact the genotype. This also saves lots of space. The format is pretty easy to handle using Awk, Fortran, and scripts (and possibly Perl, Python, etc.). Alleles are codified as gene content {0,1,2} for genotypes {AA, Aa (or aA), aa}. What allele is being counted (“a” in this case) is arbitrary. We do codify missing genotypes as 5. This format comes from Paul VanRaden, slightly modified. I’ll call it “UGA format”. It’s important to understand what fixed format means. It means that id’s go from position i to j and genotypes from k to l always. The typical error is not to take into account that id’s have variable lengths, e.g., this file would not be correctly read:  

45 1111212… 346 1121111… 347 2022222… 348 1111111… 349 2022222… 1350 1111212…

 Whereas  this  is  a  correct  format:  

45 1111212… 346 1121111… 347 2022222… 348 1111111… 349 2022222… 1350 1111212…

   How  to  detect  if  our  file  is  correctly  formatted  or  not?  I  usually  check  that  all  lines  have  the  same  length:     awk '{print length($0)}' mice_genotypes.txt | sort -u  You  should  find  a  single  figure  as  answer  (i.e.,  all  lines  have  same  length).  So,  how  many  genotyped  animals?   wc –l mice_genotypes.txt How  many  SNP  genotyped?   awk '{print length($2)}' mice_genotypes.txt Everyone  has  the  same  number  of  loci  in  the  data  file?   awk '{print length($2)}' mice_genotypes.txt | sort -u

10

Let’s  extract  the  first  200  individuals  while  keeping  the  format.  This  is  a  one-­‐liner,  beware  of  simple  and  double  quotes  !!   awk ' NR <= 200 {printf("%10s %1s%" length($2) "s\n", $1, " ", $2) }' mice_genotypes.txt > temp Note   that   the   format   is   defined   by "%10s%1s%" length($2) "s\n" which  means  “10  positions,  1  position  (for  the  space  in  “ “  later),  as  many  positions  as   SNPs   we   have   for   each   individual   (in   %" length($2) "s ),   and   the   line  return  in  \n.  Otherwise  awk  would  change  our  beautifully  crafted  format  J  .

 Extracting  one  SNP  in  whatever  column  we  like  (e.g.,  column  16).  This  is  very  useful  for  GWAS  scripting  (we’ll  come  back  to  this  later):     awk ' {printf("%10s%1s%1s\n", $1, " ", substr($2,16,1)) }' mice_genotypes.txt > temp

1. Run the QMSim program with the parameter file: 'ex02.prm'. Note that the population structure involves an F2 design produced from inbred lines with divergent phenotypes.

1. How many SNPs do you simulate? 2. How many animals do you have in the cross between line 1 and line 2 after

2 generations? 3. Which are the values of inbreeding in lines 1 and 2 ?

2. Write a parameter file to simulate a backcross between the F1 and the line 1.

(3)   OPTIONAL  Exercise  

11

Prepared by A. Legarra.

The objective of this exercise is to fit SNP models using « snp-based » models (BLUP_SNP, BayesCPi,

Bayessian Lasso) and software GS3.

This software can be found in http://genotoul.toulouse.inra.fr/~alegarra where the manual can be found.

Copy the folder lab2: ‘cp -r /home/course/lab2/ .’

1. Run the script ./cleaningAfterQmSim_GenRes.sh.

2. Estimate the allele frequency of markers using ./freq_allele.awk mrk.txt > freq

3. Run renumf90 with the parameter file renlab.par and the files renf90.par, renf90.dat and

renaddxx.ped will be created.

Extract the animals with records.

Use: awk '$1!=0 {print $0}' renf90.dat > renf90.dat.wmp

The file renf90.dat.wmp contains only the animals with records (wmp: without missing phenotypes).

Modify renf90.par to use this file renf90.dat.wmp instead of renf90.dat.

Use remlf90 with the parameter file renf90.par. It should give 0.17 heritability; thus we assume that

estimates are 𝜎𝑔2 = 0.17 of genetic variance and 𝜎𝑒

2 = 0.84. So we will use this for further analysis.

The BLUP_SNP (also known as Random Regression BLUP; this is simply called “BLUP” in GS3

manual) assumes that variance components are known. Therefore, we need to determine the variance

components, and in particular the SNP variance for BLUP_SNP. According to Gianola et al. 2009, this

is (under some assumptions):

𝜎𝑎02 ≈

𝜎𝑢2

2 ∑ 𝑝𝑖𝑞𝑖, then we need to find out 2 ∑ 𝑝𝑖𝑞𝑖. This can be obtained using the freq file (a file with

frequencies for each SNP) that was generated by freq_allele.awk. For instance, you can compute it

using R,

a=read.table(file="freq",colClasses="numeric")

summary(a)

r=sum(2*a*(1-a))

or using awk:

(2) BLUP_SNP -- LAB 2 --

(1) Determine pedigree-based variance components

-- LAB 2 --

-- LAB 2 --

12

awk 'BEGIN{sum2pq=0}{sum2pq+=2*$1*(1-$1)}END{print sum2pq}' freq

The number ∑ 𝑝𝑖𝑞𝑖 will appear in the standard output of GS3 as well, look for it!!

This gives 2 ∑ 𝑝𝑖𝑞𝑖 = 8714 so 𝜎𝑎2 ≈

𝜎𝑢2

2 ∑ 𝑝𝑖𝑞𝑖=

0.17

8714= 1.95089𝑒 − 05. This is a small number, but

remember it goes multiplied by >24000 loci.

So let’s launch GS3:

gs3 gs3.BLUP.par

It generates a file with solutions (including SNPs) called solutions and another with EBV’s in

gs3.BLUP.par_EBVs.

Let’s take a look at SNP results:

a1=read.table("solutions",header=TRUE) summary(a1) snp=a1[a1$effect==2,]

summary(snp) plot(snp$solution)

No “large” snp can be easily spotted.

Let’s see in R the accuracy comparing the EBV’s with the TBV, which are in column 4. Open an R

session: a=read.table("gs3.BLUP.par_EBVs",header=T)

summary(a)

b=read.table("renf90.dat.wmp")

summary(b)

cor(a$g_overall,b$V4)

[1] 0.6605722

Which is actually pretty good, but these guys were in the training file. Now:

The animals that are candidates to selection (as young bulls) have genotype information but don’t have

phenotypes. We will predict the EBV’s for this animals using the SNPs effects (that are in file solutions)

and compared them to the TBV. First use, awk '$1==0 {print $0}' renf90.dat > renf90.dat.mp

The file renf90.dat.mp contains only the animals without records (mp: missing phenotypes).

Use this data file and PREDICT in the parameter file (gs3.BLUP.PREDICT.par) and execute it.

This creates a file called predictions and the EBV file as well.

Now let’s check the quality of the prediction:

(3) Compute the accuracy for the candidates to selection -- LAB 2 --

13

a=read.table("gs3.BLUP.PREDICT.par_EBVs",header=T)

b=read.table("renf90.dat.mp")

cor(a$g_overall,b$V4)

[1] 0.4937833

which is good but not as good as for the training data, as could be expected.

NOTE: For using PREDICT you need the same format in the data file as the preceding analysis so that

GS3 can compute things in the correct order. A complication is that all levels of all effects that exist in

predict have to exist in par as well. E.g., if you do cross-validation and the whole data set has 1500

levels of herd then you need to “declare” the 1500 levels in both files.

Other diagnostics should be done. What is the bias and inflation of EBVs? This can be checked by

fitting a linear model

𝑇𝐵𝑉 = 𝑎 + 𝑏𝐸𝐵𝑉 , i.e., in R: summary(lm(b$V4 ~ a$g_overall))

Also, method closest to “best” should have minimum MSE that can be easily computed as mean((b$V4 - a$g_overall)**2)

NOTE: These analyses might be slow because they are MCMC. You might reduce the number of

iterations in the parameter file from 10000 to 4000 for instance, results will be usually poorer.

(4.1) BayesC

This estimates the variances as well, and from a BLUP_SNP file it is straightforward: change “method”

BLUP to VCE (but it will take much longer, and you need to verify convergence of the MCMC, etc).

The parameter file is gs3.BayesC.par. It usually gives similar results in livestock populations (but not

in the mice data set) as the approximate identity for the variances is rather good.

So finally, the correlation (TBV, EBV) is 0.66 for the training set and 0.50 for the validation one. To

compute the correlation, run GS3 with the predict file, which is the same (gs3.BLUP.PREDICT)

and do as above, changing appropriately the file names:

a=read.table("gs3.BayesC.par_EBVs",header=T)

b=read.table("renf90.dat.wmp",header=F)

cor(a$g_overall,b$V4)

[1] 0.6654839

(4) BAYESIAN ANALYSES BY GIBBS SAMPLING

-- LAB 2 --

14

a=read.table("gs3.BLUP.PREDICT.par_EBVs",header=T)

b=read.table("renf90.dat.mp",header=F)

cor(a$g_overall,b$V4)

[1] 0.4954621

(4.2) BayesCPi

We will fix π to 999/1000, i.e., only 1 SNP every 1000 is supposed to “have” an effect. (There are 750

QTLs simulated and 24,000 SNPs so this is completely wrong). Parameter π can be equally estimated

but in our experience π is very much confounded with 𝜎𝑎2, so when one increases the other decreases.

A good guess, or starting value for 𝜎𝑎02 is

𝜎𝑎02 ≈

𝜎𝑔2

(1−𝜋 )2 ∑ 𝑝𝑖𝑞𝑖= 1.952699𝑒 − 05. (Note that this 𝜎𝑎0

2 is defined for those SNPs which are not 0 ).

The parameter file is modified by stating “mixture TRUE” in the last line. We also fix the proportions

in the a priori beta distribution so that π will be fixed in practice to 999/1000, as follows:

A PRIORI a 1d8 999d8

... USE MIXTURE T

Note that GS3 has the opposite notation for π in the documentation.

We launch it as usual

./gs3 gs3.BayesCPi.par

In the output you can see the number of SNPs “drawn” at an iteration: “includeda”. In the parameter

file I have put 10,000 iterations but this is too little. We usually put 100,000 iterations or more. Imagine:

it should ideally explore all combinations of 24 SNP in a space of some 24,000 SNPs. (Hence, in my

opinion this is a source of inefficiency and other options such as BLUP_SNP or non-linear Lasso,

ElasticNet or VanRaden’s nonlinearA should be better, specially for very large SNP data.)

With these caveats, accuracy in the validation data set is 0.42 and 0.57 in the training data set. Not a

very good accuracy but possibly fixing π=0.999 was not a good idea. A value 0.99 or estimating π is

possibly a better idea.

However, if we look at individual SNP effects using R there are interesting things:

a1=read.table("solutions",header=TRUE)

summary(a1) snp=a1[a1$effect==2,] summary(snp)

plot(snp$solution) plot(abs(snp$solution),cex=2,pch=18)

Some very large SNP can spot at position ~19204, this is possibly a large QTL location. We have

found that this way of fitting BayesCPi is as good as other methods to spot QTL locations.

15

(4.3) Bayesian Lasso (BL)

This is rather similar to BayesC as well, but we don’t use the “mixture” option and we do use an

OPTION BayesianLasso Tibshirani , which corresponds to BL2Var in Legarra et al. 2011 (option

ParkCasella has been planned yet not executed ! volunteers welcome). The starting value for lambda is

deduced from the starting value for 𝜎𝑎02 (for instance, 𝜎𝑎0

2 ≈𝜎𝑔

2

(1−𝜋 )2 ∑ 𝑝𝑖𝑞𝑖 )

as 𝜆2 =2

𝜎𝑎02 because of the equivalence 𝜎𝑔

2 ≈ 2 ∑ 𝑝𝑖𝑞𝑖2

𝜆2 (Legarra et al., 2011).

The BL is generally faster than BayesCPi because it does not compute any likelihood. Also, does not

require this “exploration” so in my opinion mixing should be much better.

The accuracy with training is 0.66, whereas with validation is 0.50.

Things to do

Other diagnostics should be done. What is the bias and inflation of EBVs? This can be checked by

fitting a linear model

𝑇𝐵𝑉 = 𝑎 + 𝑏𝐸𝐵𝑉 , i.e., in R: summary(lm(b$V4 ~ a$g_overall))

Also, method closest to “best” should have minimum MSE, that can be easily computed as mean((b$V4 - a$g_overall)**2)

16

You can absolutely take a look at the file var with samples of variance components, to check

convergence. The simplest way to do is, using R:

1) discard visually burn-in iterations

2) with the remaining samples, compare the results using the first half vs. using the second half. If

the chain has converged to the posterior distribution, results should be similar.

You can change whatever parameters (π, variances) and compare the results.

You should take a look at the distribution of estimates of SNP effects in the file solutions.

If you read the documentation, you’ll find out that GS3 can estimate the genetic parameters with

pedigree (no SNPs whatsoever) and also separate genetic variance due to SNPs from genetic

variance due to pedigree.

Prediction using SNPs in R

Go to the directory /mice_data_R

There is an R program that estimates SNP effects using this mice data. It uses BLUP_SNP, and the

variance components are those in Legarra et al. 2008 (Genetics):

blupgenomic_crossval.r

There are three data sets, including 1095, 10946, and 5473 SNPs.

In the same directory, open R. Copy and paste commands until line

# ---------------------------------------------------

# END OF FIRST PART # ---------------------------------------------------

You will see, first, the computation time of solving BLUP_SNP (it takes 17 s in my “old” (>3 years)

laptop); and second, a result ( print(cor(y,yhat))): an empirical correlation between the EBV

and raw phenotypes. More or less SNPs can be included setting the variable “set” to different values, in

set=1 #10946

#2 #5473 #3 #1095

(don’t forget to re-read the whole thing again, i.e., submit the lines from the top of the program).

There are two methods, PCG and GSRU; in my computer PCG takes 17 s … find out how long does it

take for GSRU (you can change the convergence criteria as well).

(5) OPTIONAL Exercises – BLUP_SNP in R with mice data BAYESIAN ANALYSES BY GIBBS SAMPLING

-- LAB 2 --

17

However, this correlation is way too high because we are predicting “training” data. What we’re

interested in is in prediction of unseen phenotypes (or progeny performances). So, the next piece of

program does crossvalidation, splitting animals in training and validation, with increasing training

sizes, and 10 replicates for each size of the training data set (R is wonderful for doing this kind of

things).

So run the script until

# ---------------------------------------------------

# END OF SECOND PART # ---------------------------------------------------

A Graphics “correlation between size of training population” is generated:

You can also do plot(size,pp) Note that more training implies more accuracy, but it seems to be

asymptotic.

On the other hand, look at the dispersion of correlations for a given training size. It is fairly disperse;

this means that a punctual estimate of this correlation can be rather inaccurate, in special if we have too

few individuals in either the training or the validation data set (e.g., at the extremes of the graph). This

shows nicely that one should not try to draw too many conclusions from cross-validation analysis.

What you can do next is to play with the amount of SNPs and/or variances. Also there are lots of

diagnostics that can be done.

18

Prepared by I. Aguilar and A. Legarra.

The objective of this exercise is to compute and examine the genomic relationship matrix G with

software pregsf90 and fit GBLUP models using blupf90.

We will use mice data from Legarra et al. (2008, Genetics).

Copy the folder lab3: ‘cp –r /home/course/lab3/ .’ to your directory.

Look at the program compute_G.f90. This home-made program computes relationships following

VanRaden’s (2008) first 𝐆 = 𝐙𝐙′/2 ∑ 𝑝𝑖(1 − 𝑝𝑖)𝑎𝑙𝑙 𝑆𝑁𝑃𝑠 or second G (𝐆 =1

𝑛𝑠𝑛𝑝∑

𝐙𝐢𝐙𝐢′

𝟐𝑝𝑖𝑞𝑖) (which

was used by Yang et al. (2010) in human genetics). Actually, to make G positive definite and

invertible, it computes 𝐆 = 0.95 𝐙𝐙′/2 ∑ 𝑝𝑖(1 − 𝑝𝑖) + 0.05 𝐈𝑎𝑙𝑙 𝑆𝑁𝑃𝑠 .

Compile it: ifort –heap-arrays compute_G.f90 –o compute_G

or gfortran compute_G.f90 –o compute_G

Run it: $ ./compute_G

genofile?

mice_genotypes.txt

out file for G :

mice_genotypes.txt.G

out file for G-1 :

mice_genotypes.txt.Gi

which G?

1 - VanRaden firstG = ZZ'/sum(2pq)

2 - VanRaden secondG = Yang et al. = mean (Z_i Z_i' /(2p_i

q_i))

1

write out G ? (T:F)

T

write out G inverse? (T:F)

F

Column position in file for the first marker: 12

Format to read SNP file:

(1) Compute genomic relationships with compute_G

-- LAB 3 --

19

(i10,1x,10946i1)

Number of SNPs : 10946

nanim= 1884 nsnp= 10946

...

average freq 0.512276949805317 var(freq)

7.708975460608908E-002

X re-setup

G computed, time 108.8205

It creates a file, mice_genotypes.txt.G . Take a look. Let’s see the aspect of diagonal and off-

diagonal; if SNPs are in H-W equilibrium, they should average to 1 and (almost) 0 respectively.

Actually, in HW the whole matrix G needs to average to 0. Take a look using: awk '$1==$2' mice_genotypes.txt.G | less

it looks like this is around 1 but there are values higher and lower than 1. The minimum is 0.82;

the maximum is 1.29.

We can extract a file with the diagonal elements using awk: awk '$1==$2' mice_genotypes.txt.G > diag

We can use awk to compute the mean (you can also use R to read diag).

awk 'BEGIN{r=0}; {r=r+$5}; END{print r/NR}' diag

which is 1.03 so this population has an average inbreeding of 0.03. The base population was

actually composed of 8 inbred lines, whose descendants mated during 50 generations. Now, use:

awk '$1!=$2' mice_genotypes.txt.G > offdiag

for off-diagonals. Off diagonal have a minimum of -0.28 and a maximum of 1.17. This maximum

possibly corresponds to a couple of twins (or clones) that do exist in the population. What is the

mean of off-diagonals?

Activities:

1) There is at least a couple of twins. Can you find them?

2) Plot a histogram of relationships. How does it look?

3) Inbreeding of one individual is the coancestry of the parents, or half the additive

relationship of its parents; therefore, half the average of the off-diagonal elements is

roughly equal to average inbreeding in pedigree relationships. Is this true in this G

matrix?

4) OPTIONAL (this may take time). Take the code. Right after computation of allelic

frequencies (line 117), put freq(i)=0.5 to fix allelic frequencies to 0.5. Compile and

run the exercise again. How do diagonals and off-diagonals look like?

NOTE: to read things in R use: a=read.table("mice_genotypes.txt.G",colClasses="numeric")

20

This software allows computing of genomic relationships in an efficient manner, and they can be

used for GBLUP or Single Step. Its computes G and prepares matrix H-1

for Single Step.

After computing the genomic relationship matrix, we will use it.

To running pregsf90 needs:

(1) a “standard” BLUPF90 parameter file: mice.par with an unrelated pedigree file creates as: awk '{print $1,0,0}' pedigri.dat > ped_unrelated

and (2) one extra option: OPTION SNP_file mice_genotypes.txt

This option makes pregsf90 reading a genotype file mice_genotypes.txt and an “equivalences”

file mice_genotypes.txt_XrefID created with (only when all animals have genotype): awk '{print $1,$1}' mice_genotypes.txt > mice_genotypes.txt_XrefID

Here, we have the OPTIONS that make G closest to “pure” 𝐆 = 𝐙𝐙′/2 ∑ 𝑝𝑖(1 − 𝑝𝑖)𝑎𝑙𝑙 𝑆𝑁𝑃𝑠 : OPTION thrStopCorAG -1d0

OPTION tunedG 0

OPTION AlphaBeta 0.99 0.01

Which means, no control of similarity between pedigree and genomic; no “tuning” à la Vitezica

et al. to make G and A compatible; and use 𝑮∗ = 0.99𝑮 + 0.01𝑨 (where A is actually an

identity matrix because the pedigree is “unrelated”).

Also we are going to write a file matrix G and its inverse, G-1

. We need to add the following

OPTIONS:

OPTION saveAscii

OPTION saveG

OPTION saveGInverse

Run it $ preGSf90

name of parameter file?

mice.par

The printout on screen is very informative, do read it !!

There are files with 2sum(pq) (sum2pq) , frequencies (freqdata.out) and a file with G-1

-A22-1

(GimA22i) which is used in matrix H.

Take a look at G so created. Compute means of diagonals and off-diagonals. (Hint if you use

awk: the field with the value of G(i,j) is $3, not $5).

Are they identical to use of compute_G?

(2) Computing genomic relationships with pregsf90

21

Take a look at G-1

(file Gi).

Remember that blupf90 does (G)BLUP using (if it does exist) genomic information.

GBLUP can be seen as a Single Step where all animals in data have genotype, but we will use

two ways to fit a GBLUP here.

First, we will introduce the inverse of the genomic relationship matrix (Gi) from an external file.

The f90 series of programs has a utility to include external files with covariance structures; see

the wiki

http://nce.ads.uga.edu/wiki/doku.php?id=user_defined_files_for_covariances_of_random_effects

for an explanation.

So you can use parameter file mice_gblupGi.par in blupf90 to test GBLUP. After running,

keep the solutions for the EBV in a file (ebv.Gi) using awk as awk '($1==1 && $2==2) {print $0}' solutions > ebv.Gi

Second, compute GBLUP letting blupf90 compute matrix G. Use the parameter file

mice_gblup.par to test it. It will compute the G matrix from markers as

𝐆 = 0.95 𝐙𝐙′/2 ∑ 𝑝𝑖(1 − 𝑝𝑖) + 0.05 𝐈𝑎𝑙𝑙 𝑆𝑁𝑃𝑠

To do this we create a “fake” pedigree where animals are unrelated. This has the advantage of

including non genotyped animals with 1 in the diagonal and 0 elsewhere.

Keep the solutions for the EBV in another file: awk '($1==1 && $2==2) {print $0}' solutions > ebv.H

Look at this file and see that even if only genotyped animals have EBVs different from zero, all

the animals in the pedigree are in the file.

To compare the results of both GBLUPs, we have a problem. The estimation using user_file (the

“first”) has only 1884 animals. This “second” estimation has 2272, including non genotyped

animals. The problem is that some id’s are recodified, and for instance animal 1 in the first

analysis is animal 345 in the second. Therefore we need to select in ebv.H the results for

genotyped animals using the original id (which is in the columns 5 or 9 of mice_data).

We can use

(3) Genetic evaluations using GBLUP

22

./select_animals.awk id ebv.H > output

This program takes the first file id and reads the ids to keep in the final file. Then it takes a

second file ebv.H and creates on output a file if the animals are in the first file.

Then, use R a=read.table("ebv.Gi",colClasses="numeric")

b=read.table("output",colClasses="numeric")

summary(a)

summary(b)

cor(a$V4,b$V4)

plot(a$V4,b$V4)

You can check mice_blup.par to use pedigree relationships instead, with file

pedigri.dat.

All these files make a genetic evaluation for Body Weight.

This is basically the same but using airemlf90 (or remlf90, which is slower) instead of

blupf90.

What you can do first is to estimate variance components for Body Weight using the files above.

What do you get? Are parameter estimates the same?

Now analyse the 4th

column (trait: Body Length) of the data file, both for pedigree or genomics.

What do you get? Are parameter estimates the same for pedigree or genomics?

(4) Using GREML

23

Prepared by Vitezica,Z., I. Aguilar, D. Lourenco and H. Wang.

The objective of these exercises is to fit Single-Step GBLUP models using software

preGSf90, blupf90 and postgsf90. We will obtain EBV for all the individuals in the

pedigree (with and without genotype).

Copy the folder lab4: ’cp -r /home/course/lab4/ .’

Genomic evaluation methods assume that all the animals are genotyped and phenotyped.

This is most often false. For instance, the young dairy bulls, candidates to selection, could

be genotyped but do not have phenotypes.

An extension of the genomic relationship matrix G can be constructed in which genomic

relationships are propagated to all individuals (with and without genotype), resulting in a

combined relationship matrix H, which can be used in a BLUP procedure called the

Single Step Genomic BLUP.

Genomic EBVs are obtained using all available information (pedigree + phenotypes +

genotypes) in the mixed model equations (e.g., Aguilar et al., 2010) :

(𝑿′𝑿 𝑿′𝒁

𝒁′𝑿 𝒁′𝒁 + 𝑯−𝟏𝜆) (����) = (

𝑿′𝒚

𝒁′𝒚)

where 𝜆 = 𝜎𝑒2/𝜎𝑢

2. Note that 𝑯 above can be seen as a modification of regular pedigree

relationships to accommodate genomic relationships. Remember the inverse of 𝑯 is

𝑯−𝟏 = 𝑨−𝟏 + (𝟎 𝟎𝟎 𝑮−𝟏 − 𝑨22

−1).

In this lab, we will use the simulated data used in Lab 1. The pheno.txt contains 15,560

individuals with phenotype from 28,800 that are in the pedigree (pedigree.txt). The

mrk.txt contains the marker data of sires; all of them have genotype (920 sires).

Remember that in pheno.txt file, you have the following columns:

1: mean

2: animal id

3: sire id

4: dam id

5: sex

6: generation

7: number of males progenies

8: number of females progenies

9: inbreeding

10: homozygosity

(1) Animal model for Single-Step GBLUP

-- LAB 4 --

24

11: phenotype

12: simulated residual (e)

13: individual true breeding value for polygene

14: individual true breeding value for direct effect (qtl)

15: EBV from QMSim internal BLUP

(1) From raw data modify renumf90 parameter file (renlab4.par) according to the

data file and to fit the model: y = mean + animal + e

The true breeding values and the generation information must be in the output of the data

file (renf90.dat). Do not trust renlab4.par in its present state !! Then run renumf90 with

your modified file.

(2) Check the renf90.par, renf90.dat and renaddxx.ped.

Which is the content of each column in the data file?

(3) From the renaddxx.ped file check with wiki the content of each column.

http://nce.ads.uga.edu/wiki/doku.php?id=readme.renumf90

Check how many genotyped animals are in the file? awk '$6>=10' renadd02.ped |wc -l

(4) Estimate variance components considering and ignoring marker information.

This gives basically the same results, you use airemlf90 in both cases, but what do you

change in the parameter file?

What do you get? Are parameter estimates the same for pedigree or genomics?

(5) Using the estimated variance components above:

we will predict the EBV’s for these animals using (1) a model with marker information

using Single-Step, and (2) with no marker information (using the pedigree). Compare the

solutions.

25