3
letter nature genetics • volume 21 • march 1999 323 Reliable identification of large numbers of candidate SNPs from public EST data Kenneth H. Buetow 1 , Michael N. Edmonson 2 & Anna B. Cassidy 2 1 Laboratory of Population Genetics, NCI, NIH, Bethesda, Maryland 20892, USA. 2 Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA. Correspondence should be addressed to K.H.B. (e-mail: [email protected]). High-resolution genetic analysis of the human genome promises to provide insight into common disease susceptibility. To per- form such analysis will require a collection of high-throughput, high-density analysis reagents. We have developed a polymor- phism detection system that uses public-domain sequence data. This detection system is called the single nucleotide polymor- phism pipeline (SNPpipeline). The analytic core of the SNPpipeline is composed of three components: PHRED, PHRAP and DEMIGLACE. PHRED and PHRAP are components of a sequence analysis suite developed to perform the semi-auto- mated analysis required for large-scale genomes 1,2 (provided courtesy of P. Green). Using these informatics tools, which exam- ine redundant raw expressed sequence tag (EST) data, we have identified more than 3,000 candidate single-nucleotide polymor- phisms (SNPs). Empiric validation studies of a set of 192 candi- dates indicate that 82% identify variation in a sample of ten Centre d’Etudes Polymorphism Humain (CEPH) individuals. Our results suggest that existing sequence resources may serve as a valuable source for identifying genetic variation. We conducted a test of the feasibility of this approach to SNP identification using the 9-29-97 version of the NCBI UNIGENE sets 3 . To increase the information content available to the pipeline (and to reduce overhead), we restricted the data analysed to clusters for which a minimum of 10 chromatograms were available from the Washington University FTP web site (ref. 4; http://genome.wustl.edu/est/esthmpg.html). This approach reduced the number of UNIGENE sets considered to 8,262. The SNPpipeline identified 3,089 ‘candidate’ SNPs with scores of 0.99 or greater from these sets (Fig. 2). We interpret a score of this magnitude to mean that there is heterogeneity in nucleotide composition at this location in this assembly with 99% certainty. These ‘high-confidence’ candidate SNPs are dis- tributed among 1,963 UNIGENE sets, 628 of which contain multiple candidate SNPs. Only 164 of 3,089 (5.3%) candidates identified here represent duplicates of those described by the Whitehead Genome Center 5 . We performed two sets of experiments to determine whether the candidate SNPs represent true polymorphic sites. In the first set of experiments, we designed short (on average 156 bp) sequence tagged sites (STSs) to test 96 candidate SNPs. Most (83/96; 86%) STSs produced a single PCR-amplification product. We then bi-directionally sequenced the STSs using a pool of DNA samples from ten independent CEPH samples. We inspected the sequence data from this pooled sample for bi-directional evidence of multiple bases at the same location when assembled with the data originally used to identify the candidate. In cases in which no evidence of multiple bases was observed in the pooled sample, we sequenced the ten individual samples and inspected the sequences for evidence of multiple or alternative bases at the candidate loca- tion within the assembly. To validate the pooled sample outcomes we also sequenced ten SNPs identified in the pooled set in the ten individual samples. Only 59 of 83 STSs produced bi-directional sequence of sufficient quality to evaluate reliably the candidate location. Among those, however, 54 (92%) had an alternative base. We observed 36 of 59 (61%) in the pooled sample. In all ten instances tested, the alternative base we observed in the pooled sample was also observed in one or more individual samples. In the second set of experiments we examined another set of 96 candidate SNPs. We selected high-scoring candidates for evalua- tion on the basis of whether the candidate nucleotide variant was detected as a change in a restriction endonuclease (RE) cleavage site. To determine this, we recovered RE target sequences, cleav- age patterns and manufacturers from the REBASE database (http://www.neb.com/rebase). We used the target sequences to scan the ‘high-confidence’ targets and to predict restriction digest patterns in the STSs. Larger STSs were designed (of approximately 600 bp) to facilitate scoring of the RE digestion products. A sample RE validation outcome is shown (Fig. 1). Fig. 1 RFLP confirmation of candidate SNP in UNIGENE set Hs.54515. The agarose gel contains two sets of experiments. The top of the gel shows alterna- tive restriction patterns observed in ten independent CEPH samples. The two most external lanes are Φ174/HaeIII fragments used as size markers. The next ten lanes (from left to right) show the CEPH digestion patterns. In this sample, only two of three digest patterns are observed. The bottom of the gel shows the pattern associated with CEPH family 1347. Immediately adjacent to the size standards are the two parents, 01 and 02. Flanked by the parents are ten off- spring samples (03-11,16). In this family all three possible genotypic alterna- tives are observed, confirming mendelian transmission. bp bp © 1999 Nature America Inc. • http://genetics.nature.com © 1999 Nature America Inc. • http://genetics.nature.com

document

  • Upload
    anna-b

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: document

letter

nature genetics • volume 21 • march 1999 323

Reliable identification of large numbers of candidateSNPs from public EST data

Kenneth H. Buetow1, Michael N. Edmonson2 & Anna B. Cassidy2

1Laboratory of Population Genetics, NCI, NIH, Bethesda, Maryland 20892, USA. 2Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA.Correspondence should be addressed to K.H.B. (e-mail: [email protected]).

High-resolution genetic analysis of the human genome promisesto provide insight into common disease susceptibility. To per-form such analysis will require a collection of high-throughput,high-density analysis reagents. We have developed a polymor-phism detection system that uses public-domain sequence data.This detection system is called the single nucleotide polymor-phism pipeline (SNPpipeline). The analytic core of theSNPpipeline is composed of three components: PHRED, PHRAPand DEMIGLACE. PHRED and PHRAP are components of asequence analysis suite developed to perform the semi-auto-mated analysis required for large-scale genomes1,2 (providedcourtesy of P. Green). Using these informatics tools, which exam-ine redundant raw expressed sequence tag (EST) data, we haveidentified more than 3,000 candidate single-nucleotide polymor-phisms (SNPs). Empiric validation studies of a set of 192 candi-dates indicate that 82% identify variation in a sample of tenCentre d’Etudes Polymorphism Humain (CEPH) individuals. Ourresults suggest that existing sequence resources may serve as avaluable source for identifying genetic variation.We conducted a test of the feasibility of this approach to SNPidentification using the 9-29-97 version of the NCBI UNIGENEsets3. To increase the information content available to thepipeline (and to reduce overhead), we restricted the dataanalysed to clusters for which a minimum of 10 chromatogramswere available from the Washington University FTP web site (ref.4; http://genome.wustl.edu/est/esthmpg.html). This approachreduced the number of UNIGENE sets considered to 8,262.

The SNPpipeline identified 3,089 ‘candidate’ SNPs withscores of 0.99 or greater from these sets (Fig. 2). We interpret ascore of this magnitude to mean that there is heterogeneity innucleotide composition at this location in this assembly with99% certainty. These ‘high-confidence’ candidate SNPs are dis-tributed among 1,963 UNIGENE sets, 628 of which containmultiple candidate SNPs. Only 164 of 3,089 (5.3%) candidatesidentified here represent duplicates of those described by theWhitehead Genome Center5.

We performed two sets of experiments to determine whetherthe candidate SNPs represent true polymorphic sites. In the firstset of experiments, we designed short (on average 156 bp)sequence tagged sites (STSs) to test 96 candidate SNPs. Most(83/96; 86%) STSs produced a single PCR-amplification product.We then bi-directionally sequenced the STSs using a pool of DNAsamples from ten independent CEPH samples. We inspected thesequence data from this pooled sample for bi-directional evidenceof multiple bases at the same location when assembled with thedata originally used to identify the candidate. In cases in which noevidence of multiple bases was observed in the pooled sample, wesequenced the ten individual samples and inspected the sequencesfor evidence of multiple or alternative bases at the candidate loca-tion within the assembly. To validate the pooled sample outcomes

we also sequenced ten SNPs identified in the pooled set in the tenindividual samples. Only 59 of 83 STSs produced bi-directionalsequence of sufficient quality to evaluate reliably the candidatelocation. Among those, however, 54 (92%) had an alternativebase. We observed 36 of 59 (61%) in the pooled sample. In all teninstances tested, the alternative base we observed in the pooledsample was also observed in one or more individual samples.

In the second set of experiments we examined another set of 96candidate SNPs. We selected high-scoring candidates for evalua-tion on the basis of whether the candidate nucleotide variant wasdetected as a change in a restriction endonuclease (RE) cleavagesite. To determine this, we recovered RE target sequences, cleav-age patterns and manufacturers from the REBASE database(http://www.neb.com/rebase). We used the target sequences toscan the ‘high-confidence’ targets and to predict restrictiondigest patterns in the STSs. Larger STSs were designed (ofapproximately 600 bp) to facilitate scoring of the RE digestionproducts. A sample RE validation outcome is shown (Fig. 1).

Fig. 1 RFLP confirmation of candidate SNP in UNIGENE set Hs.54515. Theagarose gel contains two sets of experiments. The top of the gel shows alterna-tive restriction patterns observed in ten independent CEPH samples. The twomost external lanes are Φ174/HaeIII fragments used as size markers. The nextten lanes (from left to right) show the CEPH digestion patterns. In this sample,only two of three digest patterns are observed. The bottom of the gel showsthe pattern associated with CEPH family 1347. Immediately adjacent to the sizestandards are the two parents, 01 and 02. Flanked by the parents are ten off-spring samples (03−11,16). In this family all three possible genotypic alterna-tives are observed, confirming mendelian transmission.

bp bp

© 1999 Nature America Inc. • http://genetics.nature.com©

199

9 N

atu

re A

mer

ica

Inc.

• h

ttp

://g

enet

ics.

nat

ure

.co

m

Page 2: document

letter

324 nature genetics • volume 21 • march 1999

Only 49 of 96 (51%) produced PCR products. To assess whetherthis was simply a function of the larger STSs, we designed smallerSTSs (with a target size of 300 bp) to examine the 47 RE targets thatfailed to amplify in the second validation experiment. Twenty-threeredesigned STSs produced a clean, single-band amplification prod-uct, bringing the total PCR success rate in the second experiment to75%. Among the 72 successful PCR products, 4 (5.5%) producedsingle bands larger than those expected from the EST sequence.

We scored the 72 successful PCR amplimers for the presence ofRFLPs. Of these, 53 (74%) had variable restriction patterns in one ormore of 10 CEPH individuals examined. To test whether the variantsdescribed above were consistent with mendelian transmission, wegenetically mapped ten loci (four from the first experiment, six fromthe second). We performed RFLP analysis using the seven-familysubset of the CEPH panel used for reference map construction byboth Genethon and CHLC. In each case we observed mendeliantransmission and integrated the locus into the reference genetic mapwithout the introduction of additional recombination events.

To determine whether SNPs identified by this approach haveuseful heterozygosities when examined in a larger set of samples,we validated 10 SNPs identified as RFLPs by screening 90 inde-pendent CEPH individuals. We observed the fraction of het-erozygous individuals to range from 0.1023 to 0.5281. The meanvalue for the 10 markers examined was 0.38. In only one instancedid we observe the frequency of the rare allele to be less than 0.10.

Our validation experiments indicate an overall success rate of82% among candidates from which a PCR product was obtained.The PCR success rate (81%) for these 3´ UTR loci is comparablewith that seen in other STS-based projects5. The PCR develop-ment cost associated with these gene-based reagents is compara-ble with development costs for PCR-based genomic SNPprojects. This can be quantified by comparing the ‘raw’ discoveryrates with those associated with other SNP identificationapproaches. If all 192 STSs targets constructed are considered as adenominator, the 107 SNPs identified by this mining approachresult in an overall yield of 56%. This compares favourably withthe 14% rate observed without mining by the Whitehead groupfrom DNA chip-based re-sequencing of STS targets5.

We identified SNPs at a lower rate (82%) than the candidateSNP score threshold of 99%. The SNP score reflects the proba-bility that a position within a given assembly has heterogeneityin nucleotide composition. This is influenced by the number ofsequences, the sources of the libraries and the diversity oflibraries in an assembly. The libraries used for gene discovery bythe various EST sequencing projects are not selected to be repre-sentative of the European-American sample used in the valida-tion studies. Therefore valid sequence differences obtained frommore diverse populations may not be observed in the currentvalidation population. Even if they are present in the populationexamined, the validation sample size of ten used may be toosmall to confirm anything but common variants.

It is possible to assess objectively whether failure to find pre-dicted variants is attributable to their frequency in the popula-tion. We determined approximate allele frequencies of thecandidate SNPs from the library and sequence data that com-pose the assemblies using gene counting allele estimation meth-ods. We estimated allele frequencies by assuming that eachlibrary showing only one variant is homozygous for thatnucleotide and each library contributing both alternative allelesis heterozygous. As might be expected, we observed a significantdifference (t=2.96, d.f.=109, P=0.004) in rare allele frequencyfor candidates that were positive in the validation experiment(0.38) when compared with the negative outcomes (0.26).

We speculate that the results obtained to date represent apromising beginning. The volume of sequence data has grownappreciably since this pilot was initiated. The current collectionof candidates was chosen from a limited number of UNIGENEsets chosen to have a large number of chromatographs avail-able. This is not required. The SNP score threshold of 99% isalso arbitrary. If this value were reduced to 90%, theSNPpipeline would detect 4,523 candidates among the 8,262UNIGENE sets examined.

Detailed experimental outcomes, including primers, STSs, SNPlocations, chromatographs and SNP assemblies, are available(http://www.chlc.org/cgai). A prototype SNP detection engineusing the software applied here is available through this site.

Fig. 2 Candidate SNP withinUNIGENE set Hs.83816. Left,PHRAP-generated multiplesequence assembly and align-ment from sequence datacalled using PHRED. The pur-ple vertical bar shows thelocation of the candidatevariant with the less frequentnucleotide in red. The box (tothe immediate right of thebutton ‘Prev’) indicates theSNP score for this candidate.Additional SNPs in this assem-bly can be viewed by clickingon the radio buttons. To theright of the assembly is aninformation window provid-ing the location and qualityinformation (amplitude, posi-tion and PHRAP quality score)associated with the callednucleotide (and alternative ifone was present). Below thisbox are chromatographs thatcorrespond to individualsequences in the assembly.The information box andchromatographs are dis-played by clicking on a givennucleotide in the assembly.

© 1999 Nature America Inc. • http://genetics.nature.com©

199

9 N

atu

re A

mer

ica

Inc.

• h

ttp

://g

enet

ics.

nat

ure

.co

m

Page 3: document

letter

nature genetics • volume 21 • march 1999 325

MethodsSNPpipeline. The SNPpipeline brings a sequence and its ancillary informa-tion, which have been previously identified to be a ‘set’, from public datasources to a local data store. If trace information was available, the sequencewas re-called with quality scores assigned to each nucleotide (via PHRED).To reduce noise, low quality sequence (as determined from features orPHRED) was trimmed from each member. The members of the set werethen assembled/aligned (via PHRAP) and candidate variants identified andscored (by DEMIGLACE). The results of these analyses were stored in anextended version of the CHLC database6. Primers were designed for all highprobability ‘candidate’ polymorphisms using the PRIMER program(obtained from the Whitehead Genome Center). We used PCR conditionssimilar to those currently used to design primers for the CHLC markers.DEMIGLACE is the name given to the tools (generated here) that identifycandidate polymorphisms from sequence assemblies. Like PHRAP, it usesnucleotide quality measures to discriminate regions likely to contain vari-ants from those where differences are likely due to low sequence quality.Paralogue clustering problems are addressed by assessing the genetic dis-tance between members of sequences, which have been grouped into clus-ters. To address low sequence quality, we developed tools that permit usersto examine the sequences trace data at the location of ‘putative’ polymor-phisms. Furthermore, we developed a SNP prediction statistic that uses thebase-calling quality information to assess objectively the probability thatmore than one base is present at a given location within an assembly. Candi-date SNPs are inspected through a graphical user interface (Fig. 2). Thisinterface allows the user to step through the assembly inspecting multiplecandidate SNPs. It is also possible to display graphically the chromatographsand quality information for alternative calls. The tool also permits inspec-tion of the clusters constructed through the tree-building process. TheSNPpipeline is written primarily in Perl 5, using the Tk extension for graph-ical interfaces and the Sybperl module for database access. For accessibilityvia the Internet, we implemented a subset of the tools in Java. Each clusterwas converted into FASTA format and assembled with PHRAP, substitutingsequence and quality data from local EST data where available. Followingset refinement, subclusters were searched for SNPs.

Set refinement. To identify the related members in sets, a distance matrixbased on substitutions observed between all possible pairs of sequence wascalculated. We used this matrix of distances to construct trees that showedthe degree of relatedness among sequences. In this application the Jin-Neidistance metric7 was calculated. We constructed trees using the UPGMAmethod8. To identify clusters within each contig, the tree was traversedbetween each sequence from each node. The mean distance (and variance)was used to define clusters within the contig such that all distances werewithin the 95% confidence interval of the mean. We performed analysisusing the programs from the PHYLIP analysis suite (courtesy of J. Felsen-stein; http://evolution.genetics.washington.edu/phylip.html).

Identifying potential SNPs. Sequence and alignment data (from PHRAP .acefile), quality scores for the sequence base calls (from PHRED quality files),distance information (from PHYLIP dnadist and neighbour programs) and

1. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencertraces using phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).

2. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II.Error probabilities. Genome Res. 8, 186–194 (1998).

3. Schuler, G.D. Pieces of the puzzle: expressed sequence tags and the catalog ofhuman genes. J. Mol. Med. 75, 694–698 (1997).

4. Hillier, L. et al. Generation and analysis of 280,000 human expressed sequencetags. Genome Res. 6, 807–828 (1996).

5. Wang, D.G. et al. Large-scale identification, mapping, and genotyping of single-

nucleotide polymorphisms in the human genome. Science 280, 1077–1082(1998).

6. Murray, J.C. et al. A comprehensive human linkage map with centimorgandensity. Cooperative Human Linkage Center (CHLC). Science 265, 2049–2054(1994).

7. Jin, L. & Nei, M. Limitations of the evolutionary parsimony method ofphylogenetic analysis. Mol. Biol. Evol. 7, 82−102 (1990).

8. Sokal, R.R. & Sneath, P.H.A. Principles of Numerical Taxonomy (W.H. Freeman, SanFrancisco, 1963).

base-calling data (from PHRED ‘−d’ switch) were loaded into memory. Wealigned the sequences, and examined each vertical chunk (‘slice’) of theresulting assembly for disagreement. Any such slice was considered a candi-date SNP (DEMIGLACE). A number of filters are used by DEMIGLACE toeliminate slices that are not likely to represent true polymorphisms. Theseinclude filters that: (i) exclude sequences in any given slice from SNP consid-eration where neighbouring sequence quality scores drop 40% or more; (ii)exclude calls in which peak amplitude is below the fifteenth percentile of allbase calls for that nucleotide type; (iii) disqualify regions of a sequence hav-ing a high number of disagreements with the consensus from participating inSNP calculations; (iv) remove from consideration any base call with an alter-native call in which the peak takes up 25% or more of the area of the calledpeak; (v) exclude variations that occur in only one read direction. PHREDquality scores were converted into probability-of-error values for eachnucleotide in the slice. We used standard Baysian methods to calculate theposterior probability that there is evidence of nucleotide heterogeneity at agiven location. The prior probability used was 0.001.

Validation experiments. We validated candidate SNPs either throughsequencing of CEPH individuals or RFLP analysis. We used CEPH individ-uals 133101, 133102, 133301, 133302, 134001, 134002, 134101, 134102,134501, 134502 (sequencing experiment) or 133101, 133102, 133201,133202, 134701, 134702, 136201, 136202, 141301 and 141302 (RFLPanalysis) in the analysis. We used mendelian transmission and geneticmapping in the seven-family subset 884, 1331, 1362, 1413, 1416, 1332 and1347 of the CEPH panel used in reference map construction5 to validatethe SNPs.

STS development. Using primers designed by the SNPpipeline, weobtained optimal PCR conditions using standard reaction mixes and vary-ing annealing temperature. We prepared and amplified PCR reactions (5 µlreaction volume; 0.2 mM dNTPs, 1×GeneAmp 10×PCR buffer (PerkinElmer), 0.5 µM primers, 15 ng DNA and either AmpliTaq or AmpliTaqGold DNA Polymerase (Perkin Elmer)) using the ABI Prism 877 IntegratedThermal Cycler (Perkin Elmer). We then evaluated STS products produc-ing a single band on 2% agarose gels by sequencing or RFLP analysis.

Sequencing. We purified PCR products (15 µl) using Wizard PCR PrepsDNA purification system (Promega). We sequenced purified PCR productsbi-directionally using the ABI 377XL with ABI Prism dRhodamine termi-nator cycle sequencing ready reaction kit with AmpliTaq DNA Polymerase,FS (Applied Biosystems).

RFLP analysis. We digested amplified products (5 µl) with RE (2 U) in theappropriate reaction buffer in a final volume (10 µl) and incubated at theappropriate temperature for 1 h. We ran digested products on a 2% agarosegel containing 0.5% ethidium bromide, ultraviolet transilluminated andphotographed the gels.

Received 17 November 1998; accepted 27 January 1999.

© 1999 Nature America Inc. • http://genetics.nature.com©

199

9 N

atu

re A

mer

ica

Inc.

• h

ttp

://g

enet

ics.

nat

ure

.co

m