2
393 forblm Finding one's way through DNA As increasingly detailed maps of the human genome are generated, and the Human Genome Project (HGP) moves into the gene-discovery/se- quencing phase, a new set of prob- lems confronts human genomists. They include: how to use mapping data efficiently to locate genes of interest; how to increase the rate at which DNA sequences are gener- ated; and how to interpret the sequence data to infer gene function. All of these issues are becoming more pressing as the amount of genome data increases, and these and other problems facing the genome com- munity were the theme of a recent meeting*. The problems of genome mapping, DNA sequencing and the use of these data to understand human biology were addressed by 20 speakers from academic, government and biotechnology laboratories. Improving sample throughput Genome mapping is producing maps of the human genome at a resolution that is both enhancing positional-cloning strategies and making linkage-based molecular diagnostics possible. Techniques for improving throughput, both physi- cally (i.e. in the number of samples handled), and in the analysis of the data generated, are important for gene localization and discovery, as well as for molecular diagnostics. Techniques and instruments are being developed using horizontal ultrathin gel electrophoresis (HUGE) technologies, which could enable genotyping (12 markers) of >800 samples per hour (Robert L. Brumley, GeneSys Technologies Inc., Mazomanie, WI, USA). These systems will be further enhanced by computer algorithms that can assist in deconvohiting the data to enable allele assignment in polymerase chain reaction (PCtL)-based experiments using polymorphic genetic markers such as CA-repeats (Mark W. Perlin, Carnegie Mellon University, Pittsburgh, PA, USA). Perlin also discussed a technique called inner product mapping, which allows con- *The meeting 'DNA Sequencing, Mapping, and Bioinformatics',organizedby IBC, was held in San Francisco, CA, USA, 28-29 July 1994. sistent, integrated maps to be made from independently generated data based on mapping sequence-tagged sites (STSs) to yeast artificial chromo- somes (YACs) and radiation hybrids. Another way of increasing the throughput in genotyping for disease- gene localization is to multiplex the process at any one of several stages (Tim P. Keith, Collaborative Re- search Inc., Waltham, MA, USA). Multiplexing, i.e. processing a num- ber of samples in parallel, can be applied to the PCR step by using multiple primers and, to the electro- phoresis step, by combining several samples per lane, which are later deconvohited at the hybridization stage using probes that identify a unique primer in each lane. Scientists at Collaborative Research have used this methodology to screen >500 markers in a genome-wide survey for the gene(s) associated with manic depressive illness. Other disease genes that are being sought in genome- wide screening using other methods include the gene(s) for non-insulin- dependent diabetes mellitus (Patricia Killian, Glaxo Research Institute, Triangle Park, NC, USA). Increasing the efficiency of DNA sequencing Increases in the efficiency of DNA sequencing will come both from greater automation and fine-tuning of existing methodologies, and from the development of new non-electro- phoretic methods. Automation is already enabling individual labora- tories to achieve sequencing rates approaching 1 × 106 bases of finished sequence per year. The 'dream' sys- tem for DNA sequencing would be, according to Lloyd Smith (Univer- sity of Wisconsin, Madison, WI, USA), a turnkey system, where DNA is put in at one end of the sys- tem, and finished, assembled DNA sequence, at the rate of approxi- mately one 40 kb cosmid per day, comes out at the other. Clearly, reaching this goal will involve automating as many of the steps in the process as possible. Using robotics to assemble sets of 'good' templates, i.e. clones that will con- tribute unique information to the final assembled sequence, is an important starting point in any large- scale sequencing project (John Mulligan, Darwin Molecular Tech- nologies Inc., Bothell, WA, USA). Automation is an important part of a megabase per year sequencing effort at the Lawrence Berkeley Laboratory Human Genome Center (Berkeley, CA, USA), described by Martin J. Pollard. LBL has a large-scale sequencing effort involving both human and Drosophila DNA, which uses automation at all stages from clone picking to oligonucleotide synthesis, DNA preparation, PCR assays and sequencing reactions. The throughput of DNA sequenc- ing can also be enhanced by maxi- mizing the amount of sequence data obtained per sample sequenced. By using ultrathin gels and uncoupling run temperatures and voltages, it is possible to read consistently >800 bases per reaction (Dean Burgi, Genomyx Corp., S. San Francisco, CA, USA). For these very-long reads, template quality, as well as gel- and sampling handling are very important. The yield of sequence data from automated sequencing machines can also be improved by using, in the sequencing reactions, enzymes that have been engineered for good processivity, while mini- mizing artefacts caused by unfavor- able properties such as exonuclease activity (Richard D. Abramson, Roche Molecular Systems Inc., Alameda, CA, USA). Capillary electrophoresis (CE) is also playing a part in optimizing DNA sequencing. CE has many of the properties that make ultrathin gels so useful, i.e. superior heat dissi- pation and short run times. The use of new matrices has overcome the problems originally encountered in casting acrylamide gels in capillaries, and this has enabled new gels to be made quickly and reproducibly (Kenneth D. Konrad, Beckman Instruments Inc., Fullerton, CA, USA). Instruments incorporating CE are being developed and will have cycle times (the time from the begin- ning of one run to the beginning of the next) of the order of 90 minutes. So what should we expect from these enhancements to current sequencing technology? A 'back-of-the-en- velope' calculation suggests that a 48 capillary machine, capable of reading 500 bases per run, with a two-hour cycle time and making four runs a day, would have a throughput of nearly 100 000 bases of raw sequence per day. Meeting report © 1994, ElsevierScience Ltd TIBTECHOCTOBER1994 (VOL12)

Finding one's way through DNA

Embed Size (px)

Citation preview

Page 1: Finding one's way through DNA

393

forblm

Finding one's way through DNA

As increasingly detailed maps of the human genome are generated, and the Human Genome Project (HGP) moves into the gene-discovery/se- quencing phase, a new set of prob- lems confronts human genomists. They include: how to use mapping data efficiently to locate genes of interest; how to increase the rate at which DNA sequences are gener- ated; and how to interpret the sequence data to infer gene function. All of these issues are becoming more pressing as the amount of genome data increases, and these and other problems facing the genome com- munity were the theme of a recent meeting*. The problems of genome mapping, DNA sequencing and the use of these data to understand human biology were addressed by 20 speakers from academic, government and biotechnology laboratories.

Improving sample throughput Genome mapping is producing

maps of the human genome at a resolution that is both enhancing positional-cloning strategies and making linkage-based molecular diagnostics possible. Techniques for improving throughput, both physi- cally (i.e. in the number of samples handled), and in the analysis of the data generated, are important for gene localization and discovery, as well as for molecular diagnostics.

Techniques and instruments are being developed using horizontal ultrathin gel electrophoresis (HUGE) technologies, which could enable genotyping (12 markers) of >800 samples per hour (Robert L. Brumley, GeneSys Technologies Inc., Mazomanie, WI, USA). These systems will be further enhanced by computer algorithms that can assist in deconvohiting the data to enable allele assignment in polymerase chain reaction (PCtL)-based experiments using polymorphic genetic markers such as CA-repeats (Mark W. Perlin, Carnegie Mellon University, Pittsburgh, PA, USA). Perlin also discussed a technique called inner product mapping, which allows con-

*The meeting 'DNA Sequencing, Mapping, and Bioinformatics', organized by IBC, was held in San Francisco, CA, USA, 28-29 July 1994.

sistent, integrated maps to be made from independently generated data based on mapping sequence-tagged sites (STSs) to yeast artificial chromo- somes (YACs) and radiation hybrids.

Another way of increasing the throughput in genotyping for disease- gene localization is to multiplex the process at any one of several stages (Tim P. Keith, Collaborative Re- search Inc., Waltham, MA, USA). Multiplexing, i.e. processing a num- ber of samples in parallel, can be applied to the P C R step by using multiple primers and, to the electro- phoresis step, by combining several samples per lane, which are later deconvohited at the hybridization stage using probes that identify a unique primer in each lane. Scientists at Collaborative Research have used this methodology to screen >500 markers in a genome-wide survey for the gene(s) associated with manic depressive illness. Other disease genes that are being sought in genome- wide screening using other methods include the gene(s) for non-insulin- dependent diabetes mellitus (Patricia Killian, Glaxo Research Institute, Triangle Park, NC, USA).

Increasing the efficiency of DNA sequencing

Increases in the efficiency of DNA sequencing will come both from greater automation and fine-tuning of existing methodologies, and from the development of new non-electro- phoretic methods. Automation is already enabling individual labora- tories to achieve sequencing rates approaching 1 × 106 bases of finished sequence per year. The 'dream' sys- tem for DNA sequencing would be, according to Lloyd Smith (Univer- sity of Wisconsin, Madison, WI, USA), a turnkey system, where DNA is put in at one end of the sys- tem, and finished, assembled DNA sequence, at the rate of approxi- mately one 40 kb cosmid per day, comes out at the other. Clearly, reaching this goal will involve automating as many of the steps in the process as possible. Using robotics to assemble sets of 'good' templates, i.e. clones that will con- tribute unique information to the final assembled sequence, is an important starting point in any large-

scale sequencing project (John Mulligan, Darwin Molecular Tech- nologies Inc., Bothell, WA, USA). Automation is an important part of a megabase per year sequencing effort at the Lawrence Berkeley Laboratory Human Genome Center (Berkeley, CA, USA), described by Martin J. Pollard. LBL has a large-scale sequencing effort involving both human and Drosophila DNA, which uses automation at all stages from clone picking to oligonucleotide synthesis, DNA preparation, P C R assays and sequencing reactions.

The throughput of DNA sequenc- ing can also be enhanced by maxi- mizing the amount of sequence data obtained per sample sequenced. By using ultrathin gels and uncoupling run temperatures and voltages, it is possible to read consistently >800 bases per reaction (Dean Burgi, Genomyx Corp., S. San Francisco, CA, USA). For these very-long reads, template quality, as well as gel- and sampling handling are very important. The yield of sequence data from automated sequencing machines can also be improved by using, in the sequencing reactions, enzymes that have been engineered for good processivity, while mini- mizing artefacts caused by unfavor- able properties such as exonuclease activity (Richard D. Abramson, Roche Molecular Systems Inc., Alameda, CA, USA).

Capillary electrophoresis (CE) is also playing a part in optimizing DNA sequencing. CE has many of the properties that make ultrathin gels so useful, i.e. superior heat dissi- pation and short run times. The use of new matrices has overcome the problems originally encountered in casting acrylamide gels in capillaries, and this has enabled new gels to be made quickly and reproducibly (Kenneth D. Konrad, Beckman Instruments Inc., Fullerton, CA, USA). Instruments incorporating CE are being developed and will have cycle times (the time from the begin- ning of one run to the beginning of the next) of the order of 90 minutes. So what should we expect from these enhancements to current sequencing technology? A 'back-of-the-en- velope' calculation suggests that a 48 capillary machine, capable of reading 500 bases per run, with a two-hour cycle time and making four runs a day, would have a throughput of nearly 100 000 bases of raw sequence per day.

Meeting report

© 1994, Elsevier Science Ltd TIBTECH OCTOBER 1994 (VOL 12)

Page 2: Finding one's way through DNA

394

f O rbt lq~l

Multiplex sequencing has been brought on-hne at Collaborative Research (Gerard F. Vovis), where 10 Mb raw DNA sequence has been generated, yielding >900 kb of fin- ished sequence from a mycobac- terium, and ~50 kb human genomic sequence. Using what would now be considered conventional tech- nologies, Lee R.owen (University of Washington School of Medicine, Seattle, WA, USA) has sequenced >600 kb of the human, and ~400 kb of the mouse T-cell receptor loci. Their current throughput is approxi- mately one cosmid's worth of fin- ished sequence per week.

There was a general consensus among individuals involved in large- scale sequencing projects that the real bottlenecks for such projects in the future will be in the areas of data acquisition, sequence assembly and data analysis. One area of data acqui- sition that is being improved is auto- matic base calling on automated- sequencing machines, particularly in assigning a reliability to each base call. The ability to assign such prob- abilities would make assembly easier and faster and would reduce the total number of errors (Michael G. Walker, Stanford School of Medicine, Stanford University, Stanford, CA, USA).

New sequencing technologies The other hope for improved

sequencing efficiency is the devel- opment of new technologies. Considerable effort has gone into developing methods for DNA sequencing based on mass spec- trometry. Using matrix-assisted laser desorption/ionization and time-of- flight mass spectrometry, oligo- nucleotides 40 nucleotides long can be resolved from those that are 41 nucleotides long, and chains of up to ~90 residues can be detected (Chris Becker, SRI International, Menlo Park, CA, USA). While mass-spec- trometric methods are not yet devel- oped sufficiently to make a signifi- cant impact on DNA sequencing or genome analysis, another non- conventional technique, sequencing by hybridization, appears ready to move into the research and, possibly, the medical-diagnostic laboratory (Robert J. Lipshutz, Affymetrix, Santa Clara, CA, USA). Arrays of oligonucleotides can be fabricated using techniques that are similar to photolithography, which has been very successful in the semiconduc- to r industry. Arrays of all possible

8-mers (65536) have been syn- thesized and tested.

Data handling and analysis Clearly, as more mapping and

sequencing data are generated, the challenge of analyzing, storing and making these data available to a diverse user community will become greater. As o f l5June 1994, >37000 loci, representing ~5000 genes have been mapped on the human genome. Making human-genome data accessible to researchers, clin- icians and biotechnologists is a pri- mary function of the Genome Data Base (GDB) (David T. Kingsbury, Johns Hopkins University, Baltimore, MD, USA). GDB and other genetic- and sequence-orientated databases must be able to provide a very diverse user community, not only with access to the data stored in a particular database, but must also provide a 'seamless' way of moving among databases to retrieve infor- mation that is relevant to the needs of a particular user. Kingsbury dis- cussed a framework for a proposed federation of databases that would begin to address this issue.

Many patterns exist in DNA sequences that are critical to gene function. Regulatory regions, pro- tein-coding regions, and RNA-pro- cessing sites are but a few of the types of sequences that can give an insight into gene function if they can be rec- ogrfized reliably. Signal-processing technology, originally developed in the defense industry, has been applied to these problems by trans- forming DNA-sequence infor- mation using wavelet, cyclostation- ary and Fourier techniques and presenting this transformed data to a neural-network pattern recognizer (Terence R. Thompson, Metron Biomedical, Inc., Reston, VA, USA). Applying these methods to the problem of protein-coding- region recognition has met with some success and a high-speed proto- type system is being built for this application.

Other computer-based approaches to pattern recognition in DNA sequences were also discussed. GRAIL is a very successful system that uses artificial intelligence and machine-learning techniques to locate biologically important features (protein-coding regions, transcrip- tion-control regions, repetitive- DNA elements, etc.) in DNA sequences (Richard J. Mural, Oak Ridge National Laboratory, Oak

Ridge, TN, USA). The system can be accessed by e-mail, or through a client-server-based graphical user interface (XGRAIL) and is used by >1000 labs worldwide. It provides very accurate gene models, as well as a seamless interface to database- search tools (genQuest), and has proved useful in locating genes m anonymous genomic DNA sequences.

Once functionally important regions of DNA sequence have been determined, many approaches will be needed to gain understanding of the roles of their associated proteins in normal physiology and develop- ment. Databases that crossreference protein-sequence data with the wealth of functional, structural and genetic data associated with these proteins are critical to our ability to use comparative-sequence analysis to identify gene function (Temple F. Smith, Boston University, Boston, MA, USA). High throughput cDNA sequencing is providing another powerful approach to gene identifi- cation and the study of gene ex- pression (Jeffery J. Seilhamer, INCYTE Pharmaceuticals Inc., Palo Alto, CA, USA). Keeping track of the quantities of various cDNAs found in libraries from different tis- sues and/or different developmental stages gives insight into the patterns of gene expression and may help in assigning gene function.

There is little doubt that the ex- plosion ofgene mapping and DNA- sequence data will continue, and that this will have a major effect on biomedical research and, ultimately, on human welfare. The tools for generating and interpreting these data are at hand, or under develop- ment, and the next ten years should be a heady time for the fledgling science of genomics.

Richard J. Mural Biology Division, Oak Ridge National

Laboratory, Oak Ridge, TN37831, USA.

Students

Did you know that you are entitled to a 50% discount on a

subscription to TIBTECH?

See the subscription order card bound into this issue for more

details

IBTECH OCTOBER 1994 (VOL 12)