Supplementary Material NG-TR22292 · 1 Supplementary Material SNP and haplotype mapping for genetic analysis in the Rat The STAR consortium 1Kathrin Saar, 2Alfred Beck, 3Marie-Thérèse

1

Supplementary Material SNP and haplotype mapping for genetic analysis in the Rat The STAR consortium

1Kathrin Saar, 2Alfred Beck, 3Marie-Thérèse Bihoreau, 4Ewan Birney, 3Denise Brocklebank, 4Yuan Chen, 5Edwin Cuppen, 6Stephanie Demonchy, 7Joaquin Dopazo, 4Paul Flicek, 6Mario Foglio,8,9Asao Fujiyama, 6Ivo G. Gut, 3Dominique Gauguier, 10,11Roderic Guigo, 5Victor Guryev, 1Matthias Heinig, 1Oliver Hummel, 12Niels Jahn, 2Sven Klages, 13,14Vladimir Kren, 2Michael Kube, 2Heiner Kuhl, 15Takashi Kuramoto, 8Yoko Kuroki, 6Doris Lechner, 1,16Young-Ae Lee, 10,11Nuria Lopez-Bigas, 6G. Mark Lathrop, 15Tomoji Mashimo, 7Ignacio Medina, 3Richard Mott, 1Giannino Patone, 6Jeanne-Antide Perrier-Cornet, 12Matthias Platzer, 13,14Michal Pravenec, 2Richard Reinhardt, 8Yoshiyuki Sakaki, 12Markus Schilhabel, 1Herbert Schulz, 15Tadao Serikawa, 11Medya Shikhagaie, 8Shouji Tatsumoto, 12Stefan Taudien, 8Atsushi Toyoda, 15Birger Voigt, 6Diana Zelenika, 1Heike Zimdahl, 1Norbert Hubner

2

Supplementary Figure 1a

3

Supplementary Figure 1b

4

Supplementary Figure 1c

5

Supplementary Figure 1a, b, c: Neighbor-Joining tree of laboratory rat strains.

The bootstrap values over 50% (based on 1,000 replicates) are shown next to

the branches. All positions containing alignment gaps and missing data were

eliminated only in pairwise sequence comparisons (pairwise deletion option).

Branches containing several isolates from the same strain were collapsed. Three

trees are shown: 1a is based on the complete set (20,283 positions genotyped in

167 rat samples); 1b: set exclusively genotyped on Illumina GoldenGate platform

(10,665 positions); 1c: set exclusively genotyped on Affymetrix Targeted

Genotyping platform (8,935 positions). We defined 10 clusters of rat strains

(coloured), which are supported by bootstrap value exceeding 75%.

6

Supplementary Figure 2: Effect of reduction in SNP density or number of rats

genotyped on haplotype block structure. Example of linkage disequilibrium (r-

squared) and defined haplotype blocks (black triangles) are given for rat

chromosomes 19 and 20. First row represents haplotype block structures for full

dataset with the following rows showing blocks for sets lacking rats from the most

divergent BN strain ('no BN'); reduced number of strains ('-10% strains', 3

random replicates) and reduced SNP density ('-10% SNPs', 3 random

replicates). The figure shows that block structure is robust to the selection of

strains and/or markers.

Supplementary Figure 2

7

Supplementary Figure 3: Decay of linkage disequilibrium with distance for

laboratory mice and rats. Laboratory mouse strains show greater linkage

disequilibrium if compared to rat strains. Average R-squared was calculated as a

function of distance between SNPs. Four datasets were used for calculations:

genotype data from laboratory rats, 17.9 thousand genomic positions in 167 (red)

and in 38 most diverged rats (brown); genotype data from 38 laboratory mouse

strains for 17.9 thousands (same SNP density/distribution as in rat, dark blue)

and for full set of 96 thousand SNPs (light blue).


8


9

Supplementary Figure 4 Genetic and physical map of the GKxBN F2 cross.

10

Supplementary Figure 5: Three SNP-based genetic maps. The maps are constructed for the mapping panels HXB-BXH

(a), FXLE-LEXF (b), and for GKxBN (c) after removal of markers that generate suspiciously large map distances. Genetic

positions (given in cM, x-axis) of observed SDP are represented by vertical lines for all autosomes (ordered ascending on

the y-axis).


a b c

11

Supplementary Figure 6a: Rearrangement of contigs on the X chromosome.

The X chromosome was split into four fragments (red, blue, green and yellow) at

contig boundaries where no linkage between the flanking markers could be

detected (LOD < 2) (supplementary Table 5). Contig boundaries are marked on

the ideograms. We rearranged the fragments in the order (see “X_new” in the

upper bar) that generates the smallest average recombination fraction in the

three populations.

Supplementary Figure 6a

12

Supplementary Figure 6b: Genetic maps of the X chromosome.

Supplementary Figure 6b

13

Supplementary Figure 7: QTL mapping for cholesterol and lipid parameters. QTL as the result of Composite Interval Mapping (CIM) for Cholesterol, High Density Lipoproteins, Low Density Lipoproteins and Triglycerides are shown in red, green, yellow and blue, respectively. The LOD thresholds were determined by at least 1,000 genome wide permutations for each trait with a significance level of 0.05 and are indicated with horizontal bars, where the green bar for HDL is covered by the same valued blue Triglyceride line (3.6). The calculated LOD scores are the result of the comparison of two nested hypothesis (H0 with likelihood L0 is nested within H1 with likelihood L1) and is two times the negative natural log of the ratio of the likelihoods (-2ln(L0/L1). QTL were considered independently if their peaks showed a minimal distance of 5 cM and at least 1 LOD difference between top and valley. Multiple testing across the various traits investigated was corrected using FDR and LOD scores were considered significant at P<0.051. The lower panel displays the additive effect of the LE/Stm and F344/Stm alleles at the corresponding QTL locations. All detected QTL are availableURL1.


14

Supplementary Table 1: Characterization of identified SNPs with regard to putative function.

all SNPs Coding region

Splicing Promoter region 3’-UTR OMIM

Splice Site TFBS DNA-triplex miRNA target Cancer Disease Consequence type # of SNPs Omega

Donor Acceptor

intergenic 208,972 upstream 14,737 1,019 568 180 744 downstream 12,725 158 729 3’-UTR 625 1 132 5 42 5’-UTR 330 3 15 intronic 94,091 57 63 1,516 5,635 essential splice site 21 1 1 splice site 303 5 35 non synonymous coding 1,160 56 1 18 70 synonymous coding 2,599 2 43 178 stop gained 21 Total # of SNPs 325,788

15

Supplementary Table 2: In silico mapping. Median number of putative QTL exceeding given LogP threshold for a scenario of ten 5% QTL. One hundred inbred Rat strains were evaluated by simulation for their potential for WGA mapping of QTL.

LogP threshold # putative QTL

4.1 1,412.0 5 681.5 6 248.0 7 154.0 8 19.5

16

Supplementary Table 3: Comparative analysis of genetic maps constructed in the GKxBN F2 cross with microsatellite and SNP markers using Joinmap. Number of markers mapped and average spacing between resolved genetic positions are given. *Taken from 2.

Microsatellite markers * SNP markers Chr. Mapped Average (cM) Typed Mapped Average (cM)

1 333 1.52 936 594 1.24 2 192 1.26 1,005 711 1.03 3 110 1.97 646 399 1.11 4 159 1.67 712 501 1.10 5 172 1.45 704 520 1.06 6 102 1.47 534 375 0.95 7 104 1.66 545 382 0.98 8 108 1.64 488 345 1.19 9 75 2.11 404 267 1.18

10 128 1.57 457 311 1.10 11 40 2.94 330 210 1.25 12 50 1.95 208 133 1.32 13 78 2.02 349 206 1.26 14 72 2.15 401 317 1.37 15 53 2.63 339 221 1.26 16 45 2.78 300 178 1.58 17 92 1.54 379 234 0.98 18 59 1.81 323 193 1.14 19 41 1.98 204 146 1.11 20 45 2.41 225 134 1.13

Total 2,058 1.77 9,489 6,377 1.17

17

Supplementary Table 4: Details of three SNP-based genetic maps. The maps are constructed for the mapping panels HXB-BXH, FXLE-LEXF (recombinant inbred [RI] strains), and for GKxBN (F2 cross). Number of observed SDPs (N), genetic length (cM) and average spacing between resolved genetic positions (cM) are given.

SDPs mapped in HXB-BXH

RI strains SDPs mapped in FXLE-LEXF

RI strains SDPs mapped in GKxBN

F2 rats

Chr N Length Average distance

N Length Average distance

N Length Average distance

1 111 180 1.65 111 189 1.72 191 188 1.19 2 130 170 1.32 76 125 1.67 214 162 1.01 3 118 134 1.14 70 107 1.55 110 135 1.35 4 67 103 1.56 59 95 1.65 156 122 1.00 5 107 131 1.23 75 121 1.63 167 123 0.89 6 77 97 1.28 46 79 1.74 118 92 0.92 7 83 98 1.20 64 102 1.62 141 102 0.90 8 62 95 1.58 67 115 1.74 119 101 1.05 9 74 105 1.46 50 84 1.72 92 83 1.11

10 86 98 1.15 52 102 2.01 120 92 0.99 11 44 64 1.53 38 55 1.48 73 50 0.99 12 47 52 1.13 17 42 2.64 56 59 1.21 13 34 38 1.15 27 62 2.37 54 67 1.49 14 62 74 1.21 35 74 2.19 97 83 1.08 15 61 75 1.26 37 64 1.78 105 91 0.99 16 44 65 1.52 38 57 1.54 64 72 1.36 17 52 134 2.62 41 64 1.60 104 75 0.94 18 56 70 1.27 33 70 2.19 87 67 0.95 19 48 65 1.38 26 55 2.21 69 76 1.15 20 41 58 1.46 35 63 1.85 61 61 1.08 X 26 77 2.96 23 112 4.87 22 77 3.51

total 1430 1984 1.39 1020 1837 1.80 2220 1977 0.89

SDPs are derived from 20,283 SNPs in the HXB-BXH and FXLE-LEXF RI panel and from 9,691

SNPs in the GKxBN F2 cross.

18

Supplementary Table 5: Linkage breaks between contigs on the X chromosome. Flanking markers on the contigs of the X chromosome were identified and used to compute the LOD scores for linkage between markers of adjacent contigs. There are three contig boundaries where there is no linkage between flanking markers (LOD < 2) marked in bold.

Contig

From To LOD score HXB-BXH LOD score FXLE-LEXFLOD score GKxBN

1 2 9.33 4.15 24.11 2 3 0.12 0.32 0.00 3 4 9.33 6.66 no markers 4 5 9.33 no markers no markers 5 6 0.01 no markers 0.00 6 7 0.01 no markers no markers 7 8 7.41 no markers no markers 8 9 no markers no markers 20.81 9 10 no markers no markers no markers

10 11 no markers no markers no markers 11 12 3.16 1.77 no markers 12 13 9.33 no markers no markers 13 14 no markers no markers no markers 14 15 no markers no markers 26.49 15 16 no markers no markers 24.41 16 17 1.22 no markers 6.53 17 18 6.11 3.19 16.71 18 19 no markers no markers no markers 19 20 no markers no markers no markers 20 21 no markers no markers no markers

19

Supplementary Table 6: Summary of the obtained QTL. Chromosome number, peak location in cM and the LOD score are shown. Parameters newly examined in the Rat are printed in red. Note: QTL were counted separately if their peaks showed a minimal distance of 5 cM and at least 1 LOD difference between top and valley. Figures on all QTL are availableURL1.

Trait Chromosome - cM (LOD score) Body weight 5 weeks 1 - 10 (7.4), 5 - 91 (3.9), 5 - 104 (6.7), 6 -79 (6.0), 6 - 91 (11.4), 9 - 3 (4.1),

14 - 13 (8.0) Body weight 6 weeks 6 - 78 (5.0), 6 - 91 (8.1) Body weight 10 weeks 2 - 54 (5.9), 10 - 88 (5.3), 10 - 96 (6.2), 13 - 33 (9.0) Brain weight 10 weeks Heart weight 10 weeks 10 - 92 (4.4), 13 - 33 (8.7) Lung weight 10 weeks 14 - 3 (10.0) Liver weight 10 weeks 2 - 56 (4.0), 10 - 97 (6.1), 13 - 33 (6.1) Kidney weight 10 weeks 1 - 148 (3.0) Spleen weight 10 weeks 1 - 11 (6.3), 1 - 18 (4.5) Adrenals weight 10 weeks 2 - 105 (4.3), 2 - 111 (5.0), 18 - 55 (6.9) Testis weight 10 weeks 3 - 80 (4.2), 6 - 63 (6.6), 11 - 8 (5.9), 11 - 21 (4.0) Relative Brain weight 10 weeks 1 - 156 (3.0), 2 - 1 (2.5), 13 - 25 (2.6), 13 - 34 (3.1), 18 - 52 (3.2), 19 - 57

(3.5) Relative Heart weight 10 weeks 10 - 72 (5.0), 10 - 80 (5.5), 13 - 2 (4.7) Relative Lung weight 10 weeks 3 - 94 (3.7), 10 - 67 (8.4), 13 - 13 (4.6) Relative Liver weight 10 weeks 2 - 103 (3.3), 3 - 33 (3.6), 4 - 86 (5.9) Relative Kidney weight 10 weeks 15 - 40 (7.5), 19 - 1 (5.9), 20 - 0 (4.2) Relative Spleen weight 10 weeks 1 - 3 (6.2), 3 – 63 (7.3), 10 - 81 (8.3) Relative Adrenals weight 10 weeks 18 - 64 (7.5) Relative Testis weight 10 weeks 1 - 161 (2.5), 4 - 1 (2.5) Systolic Blood Pressure 10 - 16 (4.3), 10 - 23 (7,8), 10 - 101 (7.1), 19 - 18 (5.4) Heart Rate 1 - 168 (4.9), 1 - 178 (5.4), 3 - 84 (4.9), 11 - 12 (6.2), 14 - 61 (4.5) Body temperature 10 - 11 (5.2), 14 - 48 (4.1) Red Blood Cell count 3 - 18 (4.8), 11 - 9 (4.0), 11 - 17 (8.0), 17 - 13 (4.6), 19 - 28 (5.8) Hemoglobin concentration 1 - 90 (6.9), 2 - 45 (5.4), 2 - 55 (12.5), 18 - 9 (7.6) Hematocrit 4 - 85 (3.6), 12 - 0 (8.2), 12 - 8 (5.0) Mean Corpuscular Volume 1 - 104 (7.6), 1 - 110 (4.9), 6 - 81 (7.0), 6 - 88 (4.6), 17 - 55 (4.7) Mean Cell Hemoglobin mass 2 - 16 (4.3), 2 - 36 (4.2), 6 - 15 (6.3), 11 - 43 (6.2), 18 - 61 (3.54), 20 - 59

(8.5) Mean Cell Hemoglobin concentration 5 - 100 (3.7), 11 - 22 (4.1), 12 - 0 (7.6), 12 - 8 (5.7) White Blood Cell count 5 - 79 (6.5), 18 - 28 4.4) Platelet number 4 - 41 (3.7), 8 - 79 (4.6), 17 - 27 (8.4) Prothrombin Time 9 - 30 (3.9), 13 - 33 (7.4) Activated Partial Thromboplastin Time 1 - 79 (5.2), 5 - 69 (3.9), 7 - 57 (6.1), 7 - 62 (10.4), 7 - 68 (4.6), 7 - 73 (6.8),

14 - 19 (6.2), 18 - 51 (8.1) Glutamate Oxalacetate Transaminase 1 - 167 (8.6), 1 - 173 (7.0), 10 - 63 (3.9), 16 - 30 (5.3) Glutamate Pyrovate Transaminase 2 - 65 (5.1), 12 - 1 (7.4), 12 - 6 (6.5), 14 - 5 (4.0) Alkaline Phosphatase 1 - 88 (4.9), 5 - 42 (5.0), 5 - 52 (11.6), 5 - 61 (4.0) Total Protein Albumin 8 - 18 (6.4), 8 - 24 (8.5), 10 - 9 (4.7), 10 - 15 (4.9), 10 - 80 (8.0), 18 - 28

(7.6), 18 - 34 (5.9) Albumin Total Protein Ratio Glucose 2 - 5 (9.0), 4 - 7 (4.8), 17 - 23 (4.7) Total Cholesterol 2 - 3 (6.3), 14 - 68 (5.6), 18 - 37 (10.5) High Density Lipoprotein 5 - 14 (4.2), 13 - 0 (3.8), 13 - 6 (4.2), 13 - 12 (4.0), 18 - 37 (12.2), 20 - 10

(5.9) Low Density Lipoprotein 5 - 76 (4.2), 5 - 81 (7.5), 9 - 0 (4.9), 10 - 53 (3.74), 10 - 63 (5.6), 15 - 24 (5.0) Triglyceride 2 - 12 (6.8), 7 - 40 (4.3), 7 - 46 (4.6), 13 - 4 (4.3), 13 - 14 (6.9) Total Bilirubin 3 - 52 (4.2), 8 - 55 (7.0), 13 - 14 (4.5), 14 - 17 (11.5), 20 - 8 (3.6) Blood Urea Nitrogen 1 - 46 (5.8), 9 - 48 (6.5), 9 - 53 (4.8)

20

Supplementary Table 6 (continuation)

Trait Chromosome - cM (LOD score) Creatinine 1 - 178 (6.2), 4 - 72 (8.4) Inorganic Phosphate 1 - 125 (5.7), 20 - 7.2 (4.0) Calcium (plasma) 10 - 70 (7.7), 10 - 76 (3.8), 17 - 13 (3.7), 17 - 23 (8.5), 17 - 30 (4.4) Sodium (plasma) 1 - 125 (5.2), 1 - 132 (3.0), 9 - 42 (3.1), 9 - 49 (4.1), 10 - 65 (2.8), 10 - 73

(5.6), 10 - 81 (3.4), 14 - 6 (7.7) Potassium (plasma) 1 - 151 (4.0), 4 - 7 (5.0), 4 - 12 (4.4), 17 - 50 (4.2) Chloride (plasma) 6 - 1 (6.4), 15 - 34 (7.4) White Blood Cells Eosinophiles 3 - 0 (5.0), 17 - 27 (3.6) White Blood Cells Stab from leukocytes 1 - 59 (6.2), 13 - 22 (3.9) White Blood Cells Segmented leukocytes 7 - 98 (4.2), 8 - 30 (4.0), 9 - 62 (6.1), 10 - 26 (5.5), 10 - 86 (6.4), 10 - 95 (5.8) White Blood Cells Lymphocytes 1 - 103 (6.7), 1 - 108 (3.9), 16 - 31 (5.6) White Blood Cells Monocytes 3 - 43 (4.6), 6 - 94 (6.0), 9 - 47 (3.9), 17 - 12 (6.0) Urine volume 2 - 9 (6.7), 7 – 14 (6.6), 9 - 70 (6.6), 20 - 32 (4.3) Sodium (urine) 2 - 12 (3.8), 5 - 24 (5.9), 9 - 69 (5.0), 9 - 74 (10.6), 9 - 80 (4.6), 13 - 0 (7.4),

13 - 6 (7.7) Potassium (urine) 2 - 105 (4.1), 4 - 85 (7.8), 5 - 54 (3.4), 16 - 1 (6.0) Chloride (urine) 8 - 16 (6.1), 10 - 17 (8.4), 13 - 4 (4.3), 13 - 14 (5.9), 16 - 40 (5.1) Relative Urine volume 3 - 10 (5.8), 3 - 15 (5.1), 10 - 8 (4.4) Relative Sodium concentration (urine) 2 - 12 (4.4), 9 - 80 (6.5), 14 - 23 (3.5) Relative Potassium concentration (urine) 2 - 11 (7.1), 15 - 7 (5.9) Relative Chloride concentration (urine) 4 - 14 (5.7), 5 - 60 (5.1), 13 - 54 (4.6) Rearings 10 - 95 (4.4) Forelimb grip strength 6 - 80 (3.9), 6 - 89 (7.4), 13 - 51 (5.6) Hindlimb grip strength 18 - 53 (5.9), 18 - 62 (10.3) Landing foot splay 5 - 93 (3.9) Locomotor acitvity 10 min 1 - 130 (6.0), 1 - 139 (4.1), 14 - 17 (4.5) Locomotor acitvity 20 min 8 - 3 (3.9) Locomotor acitvity 30 min 4 - 99 (5.2), 14 - 0 (5.9), 20 - 9 (8.4) Total Locomotor acitvity 4 - 109 (4.7), 9 - 22 (5.0), 9 - 28 (3.5), 10 - 98 (6.4) Passive Avoidance test training time 15 - 1 (3.6), 20 - 31 (4.1), 20 - 54 (5.2), 20 - 59 (4.3) Passive Avoidance test retention time 10 - 12 (4.4), 15 - 13 (6.3)

21

Supplementary Methods

Animals. A set of recombinant inbred (RI) strains was produced by inbreeding

between members of the F2 generation resulting from the cross of the two highly

inbred strains Brown Norway (BN-Lx/Cub) and SHR/Ola3, designated here as BN

and SHR. In the present study 31 RI strains (BXH-HXB) at > F60 were used. Details

of the GKxBN F2 cross are given in4.

The 33 FXLE-LEXF RI strains and their parental strains, F344/Stm and

LE/Stm, were originally generated at the Saitama Cancer Center Research Institute5.

LE/Stm was derived from a closed Long Evans colony from the Ben May Laboratory

for Cancer Research of the University of Chicago, and F344/Stm originated from

F344/DuCrj (Charles River Japan). Several RI lines had substrains that branched out

at the 7th to 11th generation after an attempt to fix the coat color (N = 8 substrains).

Details on the history of these RI strains are described elsewhere5,6.

Phenotyping. Phenotypic profiles for the RI lines FXLE-LEXF consisted of the

following 7 categories covering 109 parameters; 1) Functional Observational Battery

(FOB, neurobehavioral test), 2) behavior studies, 3) blood pressure, 4) urine

parameters, 5) biochemical blood tests, 6) hematology, and 7) anatomy. All

measurements were performed on all male rats from each strain from 5 to 10 weeks of

age. The detailed protocols used for measurements of these parameters are

availableURL1 and were previously described7. QTL analysis was performed using a

subset of 74 quantitative parameters, which were part of the above mentioned

phenotypic profiles (supplementary Table 6).

Genomic shotgun fragment sequencing. Sequence base calling was performed by

Phred and mated read pairs were assembled using the Perl script pairassemble.pl,

calling the PHRAP assembler. Consensus sequences of assembled read pairs and

sequences of unpaired reads with a minimum length of 100 bp, respectively, were

used for subsequent analysis. This Whole Genome Shotgun project has been

deposited at EMBL/GenBank and in the NCBI genome project databaseURL2.

22

SNP selection and genotyping. SNP information was available from sequences of

the inbred rat strains WKY/Bbb, GK/Ox, SHRSP/Bbb, SS/Jr, and from SD (Celera),

compared to BN. For the 10K Assay panel (Affymetrix), the contribution of SNPs of

these strains was almost evenly distributed. A final set of 9,691 SNPs was chosen for

the 10K assay panel. This resource of candidate SNPs contains variations which map

uniquely to the draft rat genome sequence assembly (RGSC_v3.4) within a +/-100 bp

window. Candidate SNPs within repetitive elements (masked by RepeatMasker for

species Rattus) have been omitted. The 10K Targeted Genotyping technology is based

on primer annealing and extension. Thus, only SNPs that map uniquely within +/-20

basepairs of surrounding sequence were selected. The minimal Phred score was set to

30 for variation and for neighboring bases. The final selection was based on an equal

distribution throughout the genome (without chromosome Y).

The selection of 10,752 SNP markers for genotyping seven Illumina GoldenGate

panels was computationally processed in a custom-made database management

system (DBMS) designed for genotyping projects. The DBMS administrates changes

across genome sequence assembly versions (NCBI build releases 2, 3, and 4), locates

SNP markers by sequence alignment, validates allele consistency between SNP

discovery results and reference genome sequence, compares candidate SNP marker

panels, as well as integrates NCBI data from RefSeq, dbSNP, UniSTS and Entrez

Gene databases required for selection analysis. The selection of SNP markers was

based on quality indicators obtained from the annotation databases, Illumina’s SNP

score for assessment of success, and genome coverage. For Illumina SNP selection,

we created an initial set containing all known rat SNPs that have a unique position in

the genome, lacking unfavorable flanking regions, and have annotated data that are

consistent with the sequence alignment analysis. Illumina scored this initial set, and

we excluded all markers that scored lower than 0.4 or were already present on the

10K assay panel. At the moment of making a selection between nearby markers with

similar score, the priority was given to the marker with the best public data. A

favorable indicator was the information known of SNP marker in gene loci. The

resulting SNP set was utilized as the starting marker pool from where the final

candidate markers were selected. The selection approach was recursively iterated with

23

the aim of ending up with one SNP every 100kb with the highest possible Illumina

score. We genotyped a total of 10,752 SNPs dispatched within 7 panels of 1,536

multiplexed SNPs.

Genomic DNA was extracted from rat tissues (liver) using a standard phenol

chloroform protocol. DNA was dissolved in TAE and distributed to the two

genotyping platforms. For Illumina GoldenGate genotyping DNA was used at a

concentration of 50ng/µl. Genotyping was carried out using the GoldenGate protocol

in a fully automated BeadLab8. Samples were processed in 96-well plates. 91 wells

per 96-well plate were samples. The 5 controls included positive, negative and

replicated samples for blinded quality control. For Affymetrix Targeted Genotyping,

DNA was used in 150ng/µl concentrations. Genotyping was carried out using the

GeneChip® Scanner 3000 Targeted Genotyping System protocol from Affymetrix,

originally described as MIP technology9,10. We processed samples on 96 well-plates,

genotyping 23 samples and one control per plate.

Data was subjected to stringent quality control procedures eliminating samples

and SNPs that did not reach sufficiently high call rates. All SNPs with more than 10%

heterozygous genotypes in the inbred strains where removed from the analysis in the

final data set. Also, SNPs with a call rate below 90% were dropped. Our conclusive

data set of 20,283 SNPs comprised 99.2% of all SNPs genotyped with an overall

success rate of 98.7%, covering the genome with an average distance of 130kb.

For the estimation of heterozygosity in the SD rat, we deduce that homozygous

SNPs are supported on average by fewer aligned reads than heterozygous SNPs (as

expected with this relatively low sequence coverage data (here 1.5x for SD)). For a

heterozygous SNP to be called heterozygous, we require at least two reads to align,

whereas a homozygous SNP will be called homozygous on the basis of a single

alignment, which leads to a systematic underestimation of the true heterozygosity in

this strain. This means that some fraction of the SNPs that were called homozygous

might actually be heterozygous, but, on the other hand, virtually none of the SNPs that

were called heterozygous are false positives (i.e. homozygous). 85% of the SNPs that

are called homozygous are supported by more than 1 read and 100% of the

heterozygous called SNPs are supported by more than one read. This defines the

lower limit of at least 14% heterozygous in the sequenced SD. As there will be some

24

residual heterozygosity in the 15% of the homozygous calls that are only supported by

a single sequence read the fraction of heterozygotes will actually be slightly larger.

Computing functional predictions. In the promotor meta-alignment approach, a

collection of position weight matrices (PWMs) that associate Transcription Factors

(TF) to nucleotide sequence motifs (called TFs-maps) are aligned between rat and

mouse promoters to identify conserved TFBS11. The potential effect of SNPs in

posttranscriptional regulation was assessed - the microRNA target prediction program

miRanda12 was used to scan all the regions containing SNPs in 3'UTR for rat

microRNA target sites. 132 SNPs in 3’UTR regions affect microRNA targets.

MicroRNAs are single-stranded RNA molecules about 21-23 nucleotides in length,

which are complementary to sites in 3'UTR regions.

Linkage disequilibrium and haplotype structure. For comparison of cross-species

haplotype block size correlation between mouse and rat strains we used full sets of

SNPs/strains. Using Ensembl compara database we established anchor points –

orthologous nucleotides between rat and mouse genomes with distance of 10 kb

between consecutive anchors (distance based on rat genome assembly). Taking one

anchor point at a time we determined the size of the haplotype block that contains the

anchor in mouse and in rat genomes. For a complete set of 63,064 anchors we

calculated a Spearman correlation between mouse and rat block sizes. In this analysis

we completely skipped regions with undefined orthology and also SNP-poor regions,

i.e. if anchor point belongs to an interval with distance between two adjacent SNPs

exceeding 500 kb.

QTL Analysis. The LOD threshold was determined by at least 1,000 genome wide

permutations for each trait with a significance level of 0.05. The Forward Regression

Method was used and CIM was performed with a walk speed of 2 cM, CIM Model 6,

5 control markers and a 10 cM Window size adjacent to both sides of the control

25

markers. The calculated LOD scores are the result of the comparison of two nested

hypothesis (H0 with likelihood L0 is nested within H1 with likelihood L1) and is two

times the negative natural log of the ratio of the likelihoods (-2ln(L0/L1). The

calculated QTL were counted separately if their peaks showed a minimal distance of 5

cM and at least 1 LOD difference between top and valley. Note that 8 sublines were

included in the composite interval mapping. These strains - extended with the

character B, C or D at the end of their name - are not completely independent like the

other RI lines since they were derived between the 7th and 11th generation from the

respective progenitor line. Hence, their genetic variance is by the factor 3 - 4 lower if

compared with other RI strains. About 25% of SNPs out of 20,283 show a

polymorphism between the parental strains; usual 12-16% in independent RI strains

and 3 -5 % in the sublines. This different genomic character has two effects on the

calculation conditions: first, it reduces the average numbers of recombinations and

second, these sublines could bias the detection of the QTL. However, the high number

of markers and the resulting very dense genetic map allow neglecting the subline

effect. A genome-wide permutation test subsequently followed by composite interval

mapping using only 27 independent RI strains revealed almost unchanged threshold

values and an almost identical QTL pattern. Since the additional recombinations of

the sublines in comparison to the independent lines are spread across the whole

genome, they do positively contribute to the detection of putative QTL.

Simulations on genome wide association. A simulator program was written in Perl

which takes the observed SNP genotypes across the 100 most diverse strains as fixed,

and then simulates QTL of defined magnitudes at any number of randomly selected

SNPs and adds a random Normally –distributed environmental effect of defined

variance. For a simulated QTL with variance V at a SNP with minor allele frequency

p, the magnitude of the allelic effect is [V/(4p(1-p))]0.5. Thus in the simulations, the

allelic effects vary in order to take account of the allele frequencies of the SNPs

selected to be QTL. Once the phenotypes have been simulated then 3-SNP haplotype

association is performed at every SNP, and the significance measured by the negative

base 10 logarithm of the ANOVA P-value (logP). The genome-wide maximum logP,

or logPmax, was used to define thresholds for genome-wide significance. The simulator

26

was run using several parameter settings: (i) To establish genome-wide significance

levels under the null hypothesis of no genetic effect, 100 simulations were run where

no QTL were present. The median logPmax across all simulations was 4.10, and which

used as a threshold for genome-wide significance; on average half of all genome scans

would exceed this threshold once. (ii) Fifty minor QTL each accounting for 1% of the

variance (i.e. 50% genetic variance), an infinitesimal model in which each QTL

should be undetectable but where half the variance can be explained as differences

between strains. (iii) A single major QTL accounting for 50% of the total phenotypic

variance. (iv) Ten minor QTL each accounting for 5% of the variance (i.e. 50%

genetic variance in total). Note that, because it is theoretically possible to make

unlimited numbers of replicate observations on inbred strains, the fraction of variance

attributable to environmental effects can be reduced by working with the averages of

R replicates of each of the 100 strains. Thus if environmental effects account for the

fraction Ve of the phenotypic variance in one animal (so that Vg + Ve = 1, where is the

fraction of variance explained by all the QTL and for simplicity we consider only

additive genetic effects), they will account for Ve/(RVg + Ve) of the variance, in the

mean phenotype measured across the R replicates. Hence the scenario (iii) could

correspond either to a one-replicate experiment in which the major QTL explains 50%

of the variance or to a 10-replicate experiment in which the major QTL explains only

9.1% of the variance. In practice there are constraints in maintaining a constant

environment over a large number of replicates, and the scenarios simulated here are

consistent with realistic number of replicates (say 5-10) being used.

Supplementary Note

Polymorphisms in the Rat genome and generation of a SNP map

To assess the utility of the SNPs for rat genetics we genotyped a subset (n=20,283) in

167 inbred strains, 2 RI panels of 31 and 33 strains, and tested a subset (n=9,691) in

89 F2 animals. The allele-frequencies of SNPs across the inbred strains are

approximately evenly distributed between 3% and 50%. The genotyping across the

320 individual rats reported here was carried out with two different multiplex

technologies (Illumina GoldenGate Assay8 and Affymetrix Targeted Genotyping9,10),

27

each producing roughly 50% of the data (see Methods and supplementary Methods).

We assayed seven 1,536-plex assays that can be run on the Illumina BeadLab station

(10,752 SNPs). In parallel, we have developed a 10K rat Targeted Genotyping panel

to be run on an Affymetrix platform (9,691 SNPs). The rationale behind this

separation was to evaluate the appropriate technology for efficient SNP genotyping in

the rat. These genotyping tools now are commercially available from Illumina and

Affymetrix, respectively. Both panels together yielded 20,283 validated SNPs. We

genotyped 1,057 SNPs with both platforms in 231 rats as a control with a concordance

of 99.8%. Furthermore, we constructed separate phylogenetic trees using data from

each genotyping platform and found that the clustering is very stable even for those

nodes that are not supported by a high bootstrap value (supplementary Figures 1a, b,

c).

Accessibility of the data

Our data is freely accessible through the following web sites: (i) STAR websiteURL3,

(ii) EnsemblURL4, (iii) Wellcome Trust Center for Human GeneticsURL5, URL6, (iv) Rat

Genome DatabaseURL7, and (v) GeneNetworksURL8. SNPs have been deposited in

dbSNP and sequence traces in the NCBI Trace ArchiveURL2.

The complete set of novel SNPs reported and the entire set of genotypes across

all rat strains are publicly accessible to view in detail and to download via Ensembl,

where we have developed new interfaces to facilitate the use of this information. The

BioMart tool provides a particularly flexible interface to this data, where both

genomic position queries and gene list queries can be used to select a set of SNPs

which are polymorphic between two strain combinations. STAR SNPs are fully

integrated with SNPs from other sources on Ensembl overview displays such as

ContigView. Variation-specific interfaces include TranscriptSNPView13, which

displays both strain specific consequences of SNPs within a specific Ensembl

transcript and the extent of sequence coverage for those strains for which we

generated sequence. Beyond the visual displays sequence and genotype data is

available for downloadURL4, URL5. The full extent of the sequencing coverage is

displayed on the Rat SequenceAlignView (available from the “Resequencing

28

Alignment” link to the left of the main genome views) page where the sequence of the

STAR strains, the SD strain and the F344/Stm strain is accessible as a multiple

alignment; this is most useful for the SD strain where there is reasonable genome

coverage. Advanced users can access the data through the Ensembl API for

bioinformatics analysis (for additional details seeURL4). A tool to select subsets of

SNPs is availableURL5,6, and visualization of polymorphisms along chromosomes is

availableURL6. To facilitate access to the data in the context of additional information

the data represented here has been integrated in other databases, namely RGDURL7 and

GeneNetworksURL8. As well as the documentation on the Ensembl website to provide

a more in depth introduction to this data, specific tailored courses for using this

information are available; these can be run at any institution. (For more information,

contact [email protected].) Finally, data presented for functional assessment of

the SNP consequence type is comprehensively availableURL9.

Supplementary URL List

URL1: http://www.anim.med.kyoto-u.ac.jp/nbr/phenotype URL 2 http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide; URL 3 http://www.snp-star.eu; URL 4 http://www.ensembl.org/; URL 5 http://gscan.well.ox.ac.uk/gsBleadingEdge/rat.snp.selector.cgi; URL 6 http://www.well.ox.ac.uk/rat_mapping_resources/SNPbased_maps.html; URL7 http://rgd.mcw.edu/strains/ URL 8 http://www.genenetwork.org/; URL 9 http://bg.upf.edu/funcSTAR.

29

Reference List

1. Benjamini, Y. & Yekutieli, D. Quantitative trait Loci analysis using the false

discovery rate. Genetics 171, 783-790 (2005).

2. Wilder, S. P. et al. Integration of the rat recombination and EST maps in the rat genomic sequence and comparative mapping analysis with the mouse genome. Genome Res. 14, 758-765 (2004).

3. Pravenec, M., Klir, P., Kren, V., Zicha, J., & Kunes, J. An analysis of spontaneous hypertension in spontaneously hypertensive rats by means of new recombinant inbred strains. J.Hypertens. 7, 217-221 (1989).

4. Gauguier, D. et al. Chromosomal mapping of genetic loci associated with non-insulin dependent diabetes in the GK rat. Nat.Genet. 12, 38-43 (1996).

5. Shisa, H. et al. The LEXF: a new set of rat recombinant inbred strains between LE/Stm and F344. Mamm.Genome 8, 324-327 (1997).

6. Tachibana, M. et al. Quantitative trait loci determining weight reduction of testes and pituitary by diethylstilbesterol in LEXF and FXLE recombinant inbred strain rats. Exp.Anim 55, 91-95 (2006).

7. Mashimo, T., Voigt, B., Kuramoto, T., & Serikawa, T. Rat Phenome Project: the untapped potential of existing rat strains. J.Appl.Physiol 98, 371-379 (2005).

8. Oliphant, A., Barker, D. L., Stuelpnagel, J. R., & Chee, M. S. BeadArray technology: enabling an accurate, cost-effective approach to high-throughput genotyping. Biotechniques Suppl, 56-1 (2002).

9. Hardenbol, P. et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat.Biotechnol. 21, 673-678 (2003).

10. Hardenbol, P. et al. Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res. 15, 269-275 (2005).

11. Blanco, E., Messeguer, X., Smith, T. F., & Guigo, R. Transcription factor map alignment of promoter regions. PLoS.Comput.Biol. 2, e49 (2006).

12. John, B. et al. Human MicroRNA targets. PLoS.Biol. 2, e363 (2004).

13. Cunningham, F. et al. TranscriptSNPView: a genome-wide catalog of mouse coding variation. Nat.Genet. 38, 853 (2006).

Documents

Supplementary Material NG-TR22292 · 1 Supplementary Material SNP and haplotype mapping for genetic analysis in the Rat The STAR consortium 1Kathrin Saar, 2Alfred Beck, 3Marie-Thérèse