Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
Supplementary Material SNP and haplotype mapping for genetic analysis in the Rat The STAR consortium
1Kathrin Saar, 2Alfred Beck, 3Marie-Thérèse Bihoreau, 4Ewan Birney, 3Denise Brocklebank, 4Yuan Chen, 5Edwin Cuppen, 6Stephanie Demonchy, 7Joaquin Dopazo, 4Paul Flicek, 6Mario Foglio,8,9Asao Fujiyama, 6Ivo G. Gut, 3Dominique Gauguier, 10,11Roderic Guigo, 5Victor Guryev, 1Matthias Heinig, 1Oliver Hummel, 12Niels Jahn, 2Sven Klages, 13,14Vladimir Kren, 2Michael Kube, 2Heiner Kuhl, 15Takashi Kuramoto, 8Yoko Kuroki, 6Doris Lechner, 1,16Young-Ae Lee, 10,11Nuria Lopez-Bigas, 6G. Mark Lathrop, 15Tomoji Mashimo, 7Ignacio Medina, 3Richard Mott, 1Giannino Patone, 6Jeanne-Antide Perrier-Cornet, 12Matthias Platzer, 13,14Michal Pravenec, 2Richard Reinhardt, 8Yoshiyuki Sakaki, 12Markus Schilhabel, 1Herbert Schulz, 15Tadao Serikawa, 11Medya Shikhagaie, 8Shouji Tatsumoto, 12Stefan Taudien, 8Atsushi Toyoda, 15Birger Voigt, 6Diana Zelenika, 1Heike Zimdahl, 1Norbert Hubner
2
Supplementary Figure 1a
3
Supplementary Figure 1b
4
Supplementary Figure 1c
5
Supplementary Figure 1a, b, c: Neighbor-Joining tree of laboratory rat strains.
The bootstrap values over 50% (based on 1,000 replicates) are shown next to
the branches. All positions containing alignment gaps and missing data were
eliminated only in pairwise sequence comparisons (pairwise deletion option).
Branches containing several isolates from the same strain were collapsed. Three
trees are shown: 1a is based on the complete set (20,283 positions genotyped in
167 rat samples); 1b: set exclusively genotyped on Illumina GoldenGate platform
(10,665 positions); 1c: set exclusively genotyped on Affymetrix Targeted
Genotyping platform (8,935 positions). We defined 10 clusters of rat strains
(coloured), which are supported by bootstrap value exceeding 75%.
6
Supplementary Figure 2: Effect of reduction in SNP density or number of rats
genotyped on haplotype block structure. Example of linkage disequilibrium (r-
squared) and defined haplotype blocks (black triangles) are given for rat
chromosomes 19 and 20. First row represents haplotype block structures for full
dataset with the following rows showing blocks for sets lacking rats from the most
divergent BN strain ('no BN'); reduced number of strains ('-10% strains', 3
random replicates) and reduced SNP density ('-10% SNPs', 3 random
replicates). The figure shows that block structure is robust to the selection of
strains and/or markers.
Supplementary Figure 2
7
Supplementary Figure 3: Decay of linkage disequilibrium with distance for
laboratory mice and rats. Laboratory mouse strains show greater linkage
disequilibrium if compared to rat strains. Average R-squared was calculated as a
function of distance between SNPs. Four datasets were used for calculations:
genotype data from laboratory rats, 17.9 thousand genomic positions in 167 (red)
and in 38 most diverged rats (brown); genotype data from 38 laboratory mouse
strains for 17.9 thousands (same SNP density/distribution as in rat, dark blue)
and for full set of 96 thousand SNPs (light blue).
Supplementary Figure 3
8
Supplementary Figure 4
9
Supplementary Figure 4 Genetic and physical map of the GKxBN F2 cross.
10
Supplementary Figure 5: Three SNP-based genetic maps. The maps are constructed for the mapping panels HXB-BXH
(a), FXLE-LEXF (b), and for GKxBN (c) after removal of markers that generate suspiciously large map distances. Genetic
positions (given in cM, x-axis) of observed SDP are represented by vertical lines for all autosomes (ordered ascending on
the y-axis).
Supplementary Figure 5
a b c
11
Supplementary Figure 6a: Rearrangement of contigs on the X chromosome.
The X chromosome was split into four fragments (red, blue, green and yellow) at
contig boundaries where no linkage between the flanking markers could be
detected (LOD < 2) (supplementary Table 5). Contig boundaries are marked on
the ideograms. We rearranged the fragments in the order (see “X_new” in the
upper bar) that generates the smallest average recombination fraction in the
three populations.
Supplementary Figure 6a
12
Supplementary Figure 6b: Genetic maps of the X chromosome.
Supplementary Figure 6b
13
Supplementary Figure 7: QTL mapping for cholesterol and lipid parameters. QTL as the result of Composite Interval Mapping (CIM) for Cholesterol, High Density Lipoproteins, Low Density Lipoproteins and Triglycerides are shown in red, green, yellow and blue, respectively. The LOD thresholds were determined by at least 1,000 genome wide permutations for each trait with a significance level of 0.05 and are indicated with horizontal bars, where the green bar for HDL is covered by the same valued blue Triglyceride line (3.6). The calculated LOD scores are the result of the comparison of two nested hypothesis (H0 with likelihood L0 is nested within H1 with likelihood L1) and is two times the negative natural log of the ratio of the likelihoods (-2ln(L0/L1). QTL were considered independently if their peaks showed a minimal distance of 5 cM and at least 1 LOD difference between top and valley. Multiple testing across the various traits investigated was corrected using FDR and LOD scores were considered significant at P<0.051. The lower panel displays the additive effect of the LE/Stm and F344/Stm alleles at the corresponding QTL locations. All detected QTL are availableURL1.
Supplementary Figure 7
14
Supplementary Table 1: Characterization of identified SNPs with regard to putative function.
all SNPs Coding region
Splicing Promoter region 3’-UTR OMIM
Splice Site TFBS DNA-triplex miRNA target Cancer Disease Consequence type # of SNPs Omega
Donor Acceptor
intergenic 208,972 upstream 14,737 1,019 568 180 744 downstream 12,725 158 729 3’-UTR 625 1 132 5 42 5’-UTR 330 3 15 intronic 94,091 57 63 1,516 5,635 essential splice site 21 1 1 splice site 303 5 35 non synonymous coding 1,160 56 1 18 70 synonymous coding 2,599 2 43 178 stop gained 21 Total # of SNPs 325,788
15
Supplementary Table 2: In silico mapping. Median number of putative QTL exceeding given LogP threshold for a scenario of ten 5% QTL. One hundred inbred Rat strains were evaluated by simulation for their potential for WGA mapping of QTL.
LogP threshold # putative QTL
4.1 1,412.0 5 681.5 6 248.0 7 154.0 8 19.5
16
Supplementary Table 3: Comparative analysis of genetic maps constructed in the GKxBN F2 cross with microsatellite and SNP markers using Joinmap. Number of markers mapped and average spacing between resolved genetic positions are given. *Taken from 2.
Microsatellite markers * SNP markers Chr. Mapped Average (cM) Typed Mapped Average (cM)
1 333 1.52 936 594 1.24 2 192 1.26 1,005 711 1.03 3 110 1.97 646 399 1.11 4 159 1.67 712 501 1.10 5 172 1.45 704 520 1.06 6 102 1.47 534 375 0.95 7 104 1.66 545 382 0.98 8 108 1.64 488 345 1.19 9 75 2.11 404 267 1.18
10 128 1.57 457 311 1.10 11 40 2.94 330 210 1.25 12 50 1.95 208 133 1.32 13 78 2.02 349 206 1.26 14 72 2.15 401 317 1.37 15 53 2.63 339 221 1.26 16 45 2.78 300 178 1.58 17 92 1.54 379 234 0.98 18 59 1.81 323 193 1.14 19 41 1.98 204 146 1.11 20 45 2.41 225 134 1.13
Total 2,058 1.77 9,489 6,377 1.17
17
Supplementary Table 4: Details of three SNP-based genetic maps. The maps are constructed for the mapping panels HXB-BXH, FXLE-LEXF (recombinant inbred [RI] strains), and for GKxBN (F2 cross). Number of observed SDPs (N), genetic length (cM) and average spacing between resolved genetic positions (cM) are given.
SDPs mapped in HXB-BXH
RI strains SDPs mapped in FXLE-LEXF
RI strains SDPs mapped in GKxBN
F2 rats
Chr N Length Average distance
N Length Average distance
N Length Average distance
1 111 180 1.65 111 189 1.72 191 188 1.19 2 130 170 1.32 76 125 1.67 214 162 1.01 3 118 134 1.14 70 107 1.55 110 135 1.35 4 67 103 1.56 59 95 1.65 156 122 1.00 5 107 131 1.23 75 121 1.63 167 123 0.89 6 77 97 1.28 46 79 1.74 118 92 0.92 7 83 98 1.20 64 102 1.62 141 102 0.90 8 62 95 1.58 67 115 1.74 119 101 1.05 9 74 105 1.46 50 84 1.72 92 83 1.11
10 86 98 1.15 52 102 2.01 120 92 0.99 11 44 64 1.53 38 55 1.48 73 50 0.99 12 47 52 1.13 17 42 2.64 56 59 1.21 13 34 38 1.15 27 62 2.37 54 67 1.49 14 62 74 1.21 35 74 2.19 97 83 1.08 15 61 75 1.26 37 64 1.78 105 91 0.99 16 44 65 1.52 38 57 1.54 64 72 1.36 17 52 134 2.62 41 64 1.60 104 75 0.94 18 56 70 1.27 33 70 2.19 87 67 0.95 19 48 65 1.38 26 55 2.21 69 76 1.15 20 41 58 1.46 35 63 1.85 61 61 1.08 X 26 77 2.96 23 112 4.87 22 77 3.51
total 1430 1984 1.39 1020 1837 1.80 2220 1977 0.89
SDPs are derived from 20,283 SNPs in the HXB-BXH and FXLE-LEXF RI panel and from 9,691
SNPs in the GKxBN F2 cross.
18
Supplementary Table 5: Linkage breaks between contigs on the X chromosome. Flanking markers on the contigs of the X chromosome were identified and used to compute the LOD scores for linkage between markers of adjacent contigs. There are three contig boundaries where there is no linkage between flanking markers (LOD < 2) marked in bold.
Contig
From To LOD score HXB-BXH LOD score FXLE-LEXFLOD score GKxBN
1 2 9.33 4.15 24.11 2 3 0.12 0.32 0.00 3 4 9.33 6.66 no markers 4 5 9.33 no markers no markers 5 6 0.01 no markers 0.00 6 7 0.01 no markers no markers 7 8 7.41 no markers no markers 8 9 no markers no markers 20.81 9 10 no markers no markers no markers
10 11 no markers no markers no markers 11 12 3.16 1.77 no markers 12 13 9.33 no markers no markers 13 14 no markers no markers no markers 14 15 no markers no markers 26.49 15 16 no markers no markers 24.41 16 17 1.22 no markers 6.53 17 18 6.11 3.19 16.71 18 19 no markers no markers no markers 19 20 no markers no markers no markers 20 21 no markers no markers no markers
19
Supplementary Table 6: Summary of the obtained QTL. Chromosome number, peak location in cM and the LOD score are shown. Parameters newly examined in the Rat are printed in red. Note: QTL were counted separately if their peaks showed a minimal distance of 5 cM and at least 1 LOD difference between top and valley. Figures on all QTL are availableURL1.
Trait Chromosome - cM (LOD score) Body weight 5 weeks 1 - 10 (7.4), 5 - 91 (3.9), 5 - 104 (6.7), 6 -79 (6.0), 6 - 91 (11.4), 9 - 3 (4.1),
14 - 13 (8.0) Body weight 6 weeks 6 - 78 (5.0), 6 - 91 (8.1) Body weight 10 weeks 2 - 54 (5.9), 10 - 88 (5.3), 10 - 96 (6.2), 13 - 33 (9.0) Brain weight 10 weeks Heart weight 10 weeks 10 - 92 (4.4), 13 - 33 (8.7) Lung weight 10 weeks 14 - 3 (10.0) Liver weight 10 weeks 2 - 56 (4.0), 10 - 97 (6.1), 13 - 33 (6.1) Kidney weight 10 weeks 1 - 148 (3.0) Spleen weight 10 weeks 1 - 11 (6.3), 1 - 18 (4.5) Adrenals weight 10 weeks 2 - 105 (4.3), 2 - 111 (5.0), 18 - 55 (6.9) Testis weight 10 weeks 3 - 80 (4.2), 6 - 63 (6.6), 11 - 8 (5.9), 11 - 21 (4.0) Relative Brain weight 10 weeks 1 - 156 (3.0), 2 - 1 (2.5), 13 - 25 (2.6), 13 - 34 (3.1), 18 - 52 (3.2), 19 - 57
(3.5) Relative Heart weight 10 weeks 10 - 72 (5.0), 10 - 80 (5.5), 13 - 2 (4.7) Relative Lung weight 10 weeks 3 - 94 (3.7), 10 - 67 (8.4), 13 - 13 (4.6) Relative Liver weight 10 weeks 2 - 103 (3.3), 3 - 33 (3.6), 4 - 86 (5.9) Relative Kidney weight 10 weeks 15 - 40 (7.5), 19 - 1 (5.9), 20 - 0 (4.2) Relative Spleen weight 10 weeks 1 - 3 (6.2), 3 – 63 (7.3), 10 - 81 (8.3) Relative Adrenals weight 10 weeks 18 - 64 (7.5) Relative Testis weight 10 weeks 1 - 161 (2.5), 4 - 1 (2.5) Systolic Blood Pressure 10 - 16 (4.3), 10 - 23 (7,8), 10 - 101 (7.1), 19 - 18 (5.4) Heart Rate 1 - 168 (4.9), 1 - 178 (5.4), 3 - 84 (4.9), 11 - 12 (6.2), 14 - 61 (4.5) Body temperature 10 - 11 (5.2), 14 - 48 (4.1) Red Blood Cell count 3 - 18 (4.8), 11 - 9 (4.0), 11 - 17 (8.0), 17 - 13 (4.6), 19 - 28 (5.8) Hemoglobin concentration 1 - 90 (6.9), 2 - 45 (5.4), 2 - 55 (12.5), 18 - 9 (7.6) Hematocrit 4 - 85 (3.6), 12 - 0 (8.2), 12 - 8 (5.0) Mean Corpuscular Volume 1 - 104 (7.6), 1 - 110 (4.9), 6 - 81 (7.0), 6 - 88 (4.6), 17 - 55 (4.7) Mean Cell Hemoglobin mass 2 - 16 (4.3), 2 - 36 (4.2), 6 - 15 (6.3), 11 - 43 (6.2), 18 - 61 (3.54), 20 - 59
(8.5) Mean Cell Hemoglobin concentration 5 - 100 (3.7), 11 - 22 (4.1), 12 - 0 (7.6), 12 - 8 (5.7) White Blood Cell count 5 - 79 (6.5), 18 - 28 4.4) Platelet number 4 - 41 (3.7), 8 - 79 (4.6), 17 - 27 (8.4) Prothrombin Time 9 - 30 (3.9), 13 - 33 (7.4) Activated Partial Thromboplastin Time 1 - 79 (5.2), 5 - 69 (3.9), 7 - 57 (6.1), 7 - 62 (10.4), 7 - 68 (4.6), 7 - 73 (6.8),
14 - 19 (6.2), 18 - 51 (8.1) Glutamate Oxalacetate Transaminase 1 - 167 (8.6), 1 - 173 (7.0), 10 - 63 (3.9), 16 - 30 (5.3) Glutamate Pyrovate Transaminase 2 - 65 (5.1), 12 - 1 (7.4), 12 - 6 (6.5), 14 - 5 (4.0) Alkaline Phosphatase 1 - 88 (4.9), 5 - 42 (5.0), 5 - 52 (11.6), 5 - 61 (4.0) Total Protein Albumin 8 - 18 (6.4), 8 - 24 (8.5), 10 - 9 (4.7), 10 - 15 (4.9), 10 - 80 (8.0), 18 - 28
(7.6), 18 - 34 (5.9) Albumin Total Protein Ratio Glucose 2 - 5 (9.0), 4 - 7 (4.8), 17 - 23 (4.7) Total Cholesterol 2 - 3 (6.3), 14 - 68 (5.6), 18 - 37 (10.5) High Density Lipoprotein 5 - 14 (4.2), 13 - 0 (3.8), 13 - 6 (4.2), 13 - 12 (4.0), 18 - 37 (12.2), 20 - 10
(5.9) Low Density Lipoprotein 5 - 76 (4.2), 5 - 81 (7.5), 9 - 0 (4.9), 10 - 53 (3.74), 10 - 63 (5.6), 15 - 24 (5.0) Triglyceride 2 - 12 (6.8), 7 - 40 (4.3), 7 - 46 (4.6), 13 - 4 (4.3), 13 - 14 (6.9) Total Bilirubin 3 - 52 (4.2), 8 - 55 (7.0), 13 - 14 (4.5), 14 - 17 (11.5), 20 - 8 (3.6) Blood Urea Nitrogen 1 - 46 (5.8), 9 - 48 (6.5), 9 - 53 (4.8)
20
Supplementary Table 6 (continuation)
Trait Chromosome - cM (LOD score) Creatinine 1 - 178 (6.2), 4 - 72 (8.4) Inorganic Phosphate 1 - 125 (5.7), 20 - 7.2 (4.0) Calcium (plasma) 10 - 70 (7.7), 10 - 76 (3.8), 17 - 13 (3.7), 17 - 23 (8.5), 17 - 30 (4.4) Sodium (plasma) 1 - 125 (5.2), 1 - 132 (3.0), 9 - 42 (3.1), 9 - 49 (4.1), 10 - 65 (2.8), 10 - 73
(5.6), 10 - 81 (3.4), 14 - 6 (7.7) Potassium (plasma) 1 - 151 (4.0), 4 - 7 (5.0), 4 - 12 (4.4), 17 - 50 (4.2) Chloride (plasma) 6 - 1 (6.4), 15 - 34 (7.4) White Blood Cells Eosinophiles 3 - 0 (5.0), 17 - 27 (3.6) White Blood Cells Stab from leukocytes 1 - 59 (6.2), 13 - 22 (3.9) White Blood Cells Segmented leukocytes 7 - 98 (4.2), 8 - 30 (4.0), 9 - 62 (6.1), 10 - 26 (5.5), 10 - 86 (6.4), 10 - 95 (5.8) White Blood Cells Lymphocytes 1 - 103 (6.7), 1 - 108 (3.9), 16 - 31 (5.6) White Blood Cells Monocytes 3 - 43 (4.6), 6 - 94 (6.0), 9 - 47 (3.9), 17 - 12 (6.0) Urine volume 2 - 9 (6.7), 7 – 14 (6.6), 9 - 70 (6.6), 20 - 32 (4.3) Sodium (urine) 2 - 12 (3.8), 5 - 24 (5.9), 9 - 69 (5.0), 9 - 74 (10.6), 9 - 80 (4.6), 13 - 0 (7.4),
13 - 6 (7.7) Potassium (urine) 2 - 105 (4.1), 4 - 85 (7.8), 5 - 54 (3.4), 16 - 1 (6.0) Chloride (urine) 8 - 16 (6.1), 10 - 17 (8.4), 13 - 4 (4.3), 13 - 14 (5.9), 16 - 40 (5.1) Relative Urine volume 3 - 10 (5.8), 3 - 15 (5.1), 10 - 8 (4.4) Relative Sodium concentration (urine) 2 - 12 (4.4), 9 - 80 (6.5), 14 - 23 (3.5) Relative Potassium concentration (urine) 2 - 11 (7.1), 15 - 7 (5.9) Relative Chloride concentration (urine) 4 - 14 (5.7), 5 - 60 (5.1), 13 - 54 (4.6) Rearings 10 - 95 (4.4) Forelimb grip strength 6 - 80 (3.9), 6 - 89 (7.4), 13 - 51 (5.6) Hindlimb grip strength 18 - 53 (5.9), 18 - 62 (10.3) Landing foot splay 5 - 93 (3.9) Locomotor acitvity 10 min 1 - 130 (6.0), 1 - 139 (4.1), 14 - 17 (4.5) Locomotor acitvity 20 min 8 - 3 (3.9) Locomotor acitvity 30 min 4 - 99 (5.2), 14 - 0 (5.9), 20 - 9 (8.4) Total Locomotor acitvity 4 - 109 (4.7), 9 - 22 (5.0), 9 - 28 (3.5), 10 - 98 (6.4) Passive Avoidance test training time 15 - 1 (3.6), 20 - 31 (4.1), 20 - 54 (5.2), 20 - 59 (4.3) Passive Avoidance test retention time 10 - 12 (4.4), 15 - 13 (6.3)
21
Supplementary Methods
Animals. A set of recombinant inbred (RI) strains was produced by inbreeding
between members of the F2 generation resulting from the cross of the two highly
inbred strains Brown Norway (BN-Lx/Cub) and SHR/Ola3, designated here as BN
and SHR. In the present study 31 RI strains (BXH-HXB) at > F60 were used. Details
of the GKxBN F2 cross are given in4.
The 33 FXLE-LEXF RI strains and their parental strains, F344/Stm and
LE/Stm, were originally generated at the Saitama Cancer Center Research Institute5.
LE/Stm was derived from a closed Long Evans colony from the Ben May Laboratory
for Cancer Research of the University of Chicago, and F344/Stm originated from
F344/DuCrj (Charles River Japan). Several RI lines had substrains that branched out
at the 7th to 11th generation after an attempt to fix the coat color (N = 8 substrains).
Details on the history of these RI strains are described elsewhere5,6.
Phenotyping. Phenotypic profiles for the RI lines FXLE-LEXF consisted of the
following 7 categories covering 109 parameters; 1) Functional Observational Battery
(FOB, neurobehavioral test), 2) behavior studies, 3) blood pressure, 4) urine
parameters, 5) biochemical blood tests, 6) hematology, and 7) anatomy. All
measurements were performed on all male rats from each strain from 5 to 10 weeks of
age. The detailed protocols used for measurements of these parameters are
availableURL1 and were previously described7. QTL analysis was performed using a
subset of 74 quantitative parameters, which were part of the above mentioned
phenotypic profiles (supplementary Table 6).
Genomic shotgun fragment sequencing. Sequence base calling was performed by
Phred and mated read pairs were assembled using the Perl script pairassemble.pl,
calling the PHRAP assembler. Consensus sequences of assembled read pairs and
sequences of unpaired reads with a minimum length of 100 bp, respectively, were
used for subsequent analysis. This Whole Genome Shotgun project has been
deposited at EMBL/GenBank and in the NCBI genome project databaseURL2.
22
SNP selection and genotyping. SNP information was available from sequences of
the inbred rat strains WKY/Bbb, GK/Ox, SHRSP/Bbb, SS/Jr, and from SD (Celera),
compared to BN. For the 10K Assay panel (Affymetrix), the contribution of SNPs of
these strains was almost evenly distributed. A final set of 9,691 SNPs was chosen for
the 10K assay panel. This resource of candidate SNPs contains variations which map
uniquely to the draft rat genome sequence assembly (RGSC_v3.4) within a +/-100 bp
window. Candidate SNPs within repetitive elements (masked by RepeatMasker for
species Rattus) have been omitted. The 10K Targeted Genotyping technology is based
on primer annealing and extension. Thus, only SNPs that map uniquely within +/-20
basepairs of surrounding sequence were selected. The minimal Phred score was set to
30 for variation and for neighboring bases. The final selection was based on an equal
distribution throughout the genome (without chromosome Y).
The selection of 10,752 SNP markers for genotyping seven Illumina GoldenGate
panels was computationally processed in a custom-made database management
system (DBMS) designed for genotyping projects. The DBMS administrates changes
across genome sequence assembly versions (NCBI build releases 2, 3, and 4), locates
SNP markers by sequence alignment, validates allele consistency between SNP
discovery results and reference genome sequence, compares candidate SNP marker
panels, as well as integrates NCBI data from RefSeq, dbSNP, UniSTS and Entrez
Gene databases required for selection analysis. The selection of SNP markers was
based on quality indicators obtained from the annotation databases, Illumina’s SNP
score for assessment of success, and genome coverage. For Illumina SNP selection,
we created an initial set containing all known rat SNPs that have a unique position in
the genome, lacking unfavorable flanking regions, and have annotated data that are
consistent with the sequence alignment analysis. Illumina scored this initial set, and
we excluded all markers that scored lower than 0.4 or were already present on the
10K assay panel. At the moment of making a selection between nearby markers with
similar score, the priority was given to the marker with the best public data. A
favorable indicator was the information known of SNP marker in gene loci. The
resulting SNP set was utilized as the starting marker pool from where the final
candidate markers were selected. The selection approach was recursively iterated with
23
the aim of ending up with one SNP every 100kb with the highest possible Illumina
score. We genotyped a total of 10,752 SNPs dispatched within 7 panels of 1,536
multiplexed SNPs.
Genomic DNA was extracted from rat tissues (liver) using a standard phenol
chloroform protocol. DNA was dissolved in TAE and distributed to the two
genotyping platforms. For Illumina GoldenGate genotyping DNA was used at a
concentration of 50ng/µl. Genotyping was carried out using the GoldenGate protocol
in a fully automated BeadLab8. Samples were processed in 96-well plates. 91 wells
per 96-well plate were samples. The 5 controls included positive, negative and
replicated samples for blinded quality control. For Affymetrix Targeted Genotyping,
DNA was used in 150ng/µl concentrations. Genotyping was carried out using the
GeneChip® Scanner 3000 Targeted Genotyping System protocol from Affymetrix,
originally described as MIP technology9,10. We processed samples on 96 well-plates,
genotyping 23 samples and one control per plate.
Data was subjected to stringent quality control procedures eliminating samples
and SNPs that did not reach sufficiently high call rates. All SNPs with more than 10%
heterozygous genotypes in the inbred strains where removed from the analysis in the
final data set. Also, SNPs with a call rate below 90% were dropped. Our conclusive
data set of 20,283 SNPs comprised 99.2% of all SNPs genotyped with an overall
success rate of 98.7%, covering the genome with an average distance of 130kb.
For the estimation of heterozygosity in the SD rat, we deduce that homozygous
SNPs are supported on average by fewer aligned reads than heterozygous SNPs (as
expected with this relatively low sequence coverage data (here 1.5x for SD)). For a
heterozygous SNP to be called heterozygous, we require at least two reads to align,
whereas a homozygous SNP will be called homozygous on the basis of a single
alignment, which leads to a systematic underestimation of the true heterozygosity in
this strain. This means that some fraction of the SNPs that were called homozygous
might actually be heterozygous, but, on the other hand, virtually none of the SNPs that
were called heterozygous are false positives (i.e. homozygous). 85% of the SNPs that
are called homozygous are supported by more than 1 read and 100% of the
heterozygous called SNPs are supported by more than one read. This defines the
lower limit of at least 14% heterozygous in the sequenced SD. As there will be some
24
residual heterozygosity in the 15% of the homozygous calls that are only supported by
a single sequence read the fraction of heterozygotes will actually be slightly larger.
Computing functional predictions. In the promotor meta-alignment approach, a
collection of position weight matrices (PWMs) that associate Transcription Factors
(TF) to nucleotide sequence motifs (called TFs-maps) are aligned between rat and
mouse promoters to identify conserved TFBS11. The potential effect of SNPs in
posttranscriptional regulation was assessed - the microRNA target prediction program
miRanda12 was used to scan all the regions containing SNPs in 3'UTR for rat
microRNA target sites. 132 SNPs in 3’UTR regions affect microRNA targets.
MicroRNAs are single-stranded RNA molecules about 21-23 nucleotides in length,
which are complementary to sites in 3'UTR regions.
Linkage disequilibrium and haplotype structure. For comparison of cross-species
haplotype block size correlation between mouse and rat strains we used full sets of
SNPs/strains. Using Ensembl compara database we established anchor points –
orthologous nucleotides between rat and mouse genomes with distance of 10 kb
between consecutive anchors (distance based on rat genome assembly). Taking one
anchor point at a time we determined the size of the haplotype block that contains the
anchor in mouse and in rat genomes. For a complete set of 63,064 anchors we
calculated a Spearman correlation between mouse and rat block sizes. In this analysis
we completely skipped regions with undefined orthology and also SNP-poor regions,
i.e. if anchor point belongs to an interval with distance between two adjacent SNPs
exceeding 500 kb.
QTL Analysis. The LOD threshold was determined by at least 1,000 genome wide
permutations for each trait with a significance level of 0.05. The Forward Regression
Method was used and CIM was performed with a walk speed of 2 cM, CIM Model 6,
5 control markers and a 10 cM Window size adjacent to both sides of the control
25
markers. The calculated LOD scores are the result of the comparison of two nested
hypothesis (H0 with likelihood L0 is nested within H1 with likelihood L1) and is two
times the negative natural log of the ratio of the likelihoods (-2ln(L0/L1). The
calculated QTL were counted separately if their peaks showed a minimal distance of 5
cM and at least 1 LOD difference between top and valley. Note that 8 sublines were
included in the composite interval mapping. These strains - extended with the
character B, C or D at the end of their name - are not completely independent like the
other RI lines since they were derived between the 7th and 11th generation from the
respective progenitor line. Hence, their genetic variance is by the factor 3 - 4 lower if
compared with other RI strains. About 25% of SNPs out of 20,283 show a
polymorphism between the parental strains; usual 12-16% in independent RI strains
and 3 -5 % in the sublines. This different genomic character has two effects on the
calculation conditions: first, it reduces the average numbers of recombinations and
second, these sublines could bias the detection of the QTL. However, the high number
of markers and the resulting very dense genetic map allow neglecting the subline
effect. A genome-wide permutation test subsequently followed by composite interval
mapping using only 27 independent RI strains revealed almost unchanged threshold
values and an almost identical QTL pattern. Since the additional recombinations of
the sublines in comparison to the independent lines are spread across the whole
genome, they do positively contribute to the detection of putative QTL.
Simulations on genome wide association. A simulator program was written in Perl
which takes the observed SNP genotypes across the 100 most diverse strains as fixed,
and then simulates QTL of defined magnitudes at any number of randomly selected
SNPs and adds a random Normally –distributed environmental effect of defined
variance. For a simulated QTL with variance V at a SNP with minor allele frequency
p, the magnitude of the allelic effect is [V/(4p(1-p))]0.5. Thus in the simulations, the
allelic effects vary in order to take account of the allele frequencies of the SNPs
selected to be QTL. Once the phenotypes have been simulated then 3-SNP haplotype
association is performed at every SNP, and the significance measured by the negative
base 10 logarithm of the ANOVA P-value (logP). The genome-wide maximum logP,
or logPmax, was used to define thresholds for genome-wide significance. The simulator
26
was run using several parameter settings: (i) To establish genome-wide significance
levels under the null hypothesis of no genetic effect, 100 simulations were run where
no QTL were present. The median logPmax across all simulations was 4.10, and which
used as a threshold for genome-wide significance; on average half of all genome scans
would exceed this threshold once. (ii) Fifty minor QTL each accounting for 1% of the
variance (i.e. 50% genetic variance), an infinitesimal model in which each QTL
should be undetectable but where half the variance can be explained as differences
between strains. (iii) A single major QTL accounting for 50% of the total phenotypic
variance. (iv) Ten minor QTL each accounting for 5% of the variance (i.e. 50%
genetic variance in total). Note that, because it is theoretically possible to make
unlimited numbers of replicate observations on inbred strains, the fraction of variance
attributable to environmental effects can be reduced by working with the averages of
R replicates of each of the 100 strains. Thus if environmental effects account for the
fraction Ve of the phenotypic variance in one animal (so that Vg + Ve = 1, where is the
fraction of variance explained by all the QTL and for simplicity we consider only
additive genetic effects), they will account for Ve/(RVg + Ve) of the variance, in the
mean phenotype measured across the R replicates. Hence the scenario (iii) could
correspond either to a one-replicate experiment in which the major QTL explains 50%
of the variance or to a 10-replicate experiment in which the major QTL explains only
9.1% of the variance. In practice there are constraints in maintaining a constant
environment over a large number of replicates, and the scenarios simulated here are
consistent with realistic number of replicates (say 5-10) being used.
Supplementary Note
Polymorphisms in the Rat genome and generation of a SNP map
To assess the utility of the SNPs for rat genetics we genotyped a subset (n=20,283) in
167 inbred strains, 2 RI panels of 31 and 33 strains, and tested a subset (n=9,691) in
89 F2 animals. The allele-frequencies of SNPs across the inbred strains are
approximately evenly distributed between 3% and 50%. The genotyping across the
320 individual rats reported here was carried out with two different multiplex
technologies (Illumina GoldenGate Assay8 and Affymetrix Targeted Genotyping9,10),
27
each producing roughly 50% of the data (see Methods and supplementary Methods).
We assayed seven 1,536-plex assays that can be run on the Illumina BeadLab station
(10,752 SNPs). In parallel, we have developed a 10K rat Targeted Genotyping panel
to be run on an Affymetrix platform (9,691 SNPs). The rationale behind this
separation was to evaluate the appropriate technology for efficient SNP genotyping in
the rat. These genotyping tools now are commercially available from Illumina and
Affymetrix, respectively. Both panels together yielded 20,283 validated SNPs. We
genotyped 1,057 SNPs with both platforms in 231 rats as a control with a concordance
of 99.8%. Furthermore, we constructed separate phylogenetic trees using data from
each genotyping platform and found that the clustering is very stable even for those
nodes that are not supported by a high bootstrap value (supplementary Figures 1a, b,
c).
Accessibility of the data
Our data is freely accessible through the following web sites: (i) STAR websiteURL3,
(ii) EnsemblURL4, (iii) Wellcome Trust Center for Human GeneticsURL5, URL6, (iv) Rat
Genome DatabaseURL7, and (v) GeneNetworksURL8. SNPs have been deposited in
dbSNP and sequence traces in the NCBI Trace ArchiveURL2.
The complete set of novel SNPs reported and the entire set of genotypes across
all rat strains are publicly accessible to view in detail and to download via Ensembl,
where we have developed new interfaces to facilitate the use of this information. The
BioMart tool provides a particularly flexible interface to this data, where both
genomic position queries and gene list queries can be used to select a set of SNPs
which are polymorphic between two strain combinations. STAR SNPs are fully
integrated with SNPs from other sources on Ensembl overview displays such as
ContigView. Variation-specific interfaces include TranscriptSNPView13, which
displays both strain specific consequences of SNPs within a specific Ensembl
transcript and the extent of sequence coverage for those strains for which we
generated sequence. Beyond the visual displays sequence and genotype data is
available for downloadURL4, URL5. The full extent of the sequencing coverage is
displayed on the Rat SequenceAlignView (available from the “Resequencing
28
Alignment” link to the left of the main genome views) page where the sequence of the
STAR strains, the SD strain and the F344/Stm strain is accessible as a multiple
alignment; this is most useful for the SD strain where there is reasonable genome
coverage. Advanced users can access the data through the Ensembl API for
bioinformatics analysis (for additional details seeURL4). A tool to select subsets of
SNPs is availableURL5,6, and visualization of polymorphisms along chromosomes is
availableURL6. To facilitate access to the data in the context of additional information
the data represented here has been integrated in other databases, namely RGDURL7 and
GeneNetworksURL8. As well as the documentation on the Ensembl website to provide
a more in depth introduction to this data, specific tailored courses for using this
information are available; these can be run at any institution. (For more information,
contact [email protected].) Finally, data presented for functional assessment of
the SNP consequence type is comprehensively availableURL9.
Supplementary URL List
URL1: http://www.anim.med.kyoto-u.ac.jp/nbr/phenotype URL 2 http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide; URL 3 http://www.snp-star.eu; URL 4 http://www.ensembl.org/; URL 5 http://gscan.well.ox.ac.uk/gsBleadingEdge/rat.snp.selector.cgi; URL 6 http://www.well.ox.ac.uk/rat_mapping_resources/SNPbased_maps.html; URL7 http://rgd.mcw.edu/strains/ URL 8 http://www.genenetwork.org/; URL 9 http://bg.upf.edu/funcSTAR.
29
Reference List
1. Benjamini, Y. & Yekutieli, D. Quantitative trait Loci analysis using the false
discovery rate. Genetics 171, 783-790 (2005).
2. Wilder, S. P. et al. Integration of the rat recombination and EST maps in the rat genomic sequence and comparative mapping analysis with the mouse genome. Genome Res. 14, 758-765 (2004).
3. Pravenec, M., Klir, P., Kren, V., Zicha, J., & Kunes, J. An analysis of spontaneous hypertension in spontaneously hypertensive rats by means of new recombinant inbred strains. J.Hypertens. 7, 217-221 (1989).
4. Gauguier, D. et al. Chromosomal mapping of genetic loci associated with non-insulin dependent diabetes in the GK rat. Nat.Genet. 12, 38-43 (1996).
5. Shisa, H. et al. The LEXF: a new set of rat recombinant inbred strains between LE/Stm and F344. Mamm.Genome 8, 324-327 (1997).
6. Tachibana, M. et al. Quantitative trait loci determining weight reduction of testes and pituitary by diethylstilbesterol in LEXF and FXLE recombinant inbred strain rats. Exp.Anim 55, 91-95 (2006).
7. Mashimo, T., Voigt, B., Kuramoto, T., & Serikawa, T. Rat Phenome Project: the untapped potential of existing rat strains. J.Appl.Physiol 98, 371-379 (2005).
8. Oliphant, A., Barker, D. L., Stuelpnagel, J. R., & Chee, M. S. BeadArray technology: enabling an accurate, cost-effective approach to high-throughput genotyping. Biotechniques Suppl, 56-1 (2002).
9. Hardenbol, P. et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat.Biotechnol. 21, 673-678 (2003).
10. Hardenbol, P. et al. Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res. 15, 269-275 (2005).
11. Blanco, E., Messeguer, X., Smith, T. F., & Guigo, R. Transcription factor map alignment of promoter regions. PLoS.Comput.Biol. 2, e49 (2006).
12. John, B. et al. Human MicroRNA targets. PLoS.Biol. 2, e363 (2004).
13. Cunningham, F. et al. TranscriptSNPView: a genome-wide catalog of mouse coding variation. Nat.Genet. 38, 853 (2006).