8
0018-9162/02/$17.00 © 2002 IEEE July 2002 25 PERSPECTIVES Computers Are from Mars, Organisms Are from Venus I n recent years, biology and computer science have undergone an increasingly intense interdisciplinary merger, one that has attracted media attention and eager investors. The resulting media buzz arises from the strenuous struggle as these two disciplines come together. Is the sound we hear the agonizing grind of unmatched gears, or the smooth hum of an accelerating machine? I suspect the latter. Biology and computer science share a natural affinity. Erwin Schrödinger envisioned life as an aperiodic crystal, observing that the organizing structure of life is neither completely regular, like a pure crystal, nor completely chaotic and without structure, like dust in the wind. Perhaps this is why biological information has never satisfactorily yielded to classi- cal mathematical analysis. A simple look out the window, however, shows the great abundance of structure in biological objects, from the fractals found in the branches of an oak tree to the symmetries of DNA’s double helix. Machine computations combine elegant algorithms with brute-force cal- culations—which seems a reasonable approach to this aperiodic structure. Likewise, computing seeks to create a machine that can flexibly solve diverse problems. In nature, such plastic problem solving resides uniquely in the domain of organic matter. Whether through historical evolution or indi- vidual behavior, organisms always adaptively solve the problems their envi- ronment poses. Thus, examining how organisms solve problems can lead to new computation and algorithm-development approaches. BIOLOGY MEETS COMPUTER SCIENCE Biology is the youngest of the natural sciences. When its collected infor- mation reaches a critical density, a natural science progresses from infor- mation gathering to information processing. This latter activity has become the dominant one in more mature sciences like physics, in which theoreti- cal abstractions and predictions play a primary role because new informa- tion is scarce. Until recently, the major activity in biology has been gathering new infor- mation in the lab and field. The growth of biological information in the past five years, especially at the molecular level, has been astonishing. The growth curve of the information stored in GenBank (http://www.ncbi.nlm.nih.gov/), the primary molecular-biology information database, follows an exponen- Combining cold silicon and hot protoplasm may constitute a marriage of opposites, but this union could produce genetics research prodigies. Junhyong Kim Yale University

Computers are from Mars, organisms are from Venus

Embed Size (px)

Citation preview

0018-9162/02/$17.00 © 2002 IEEE July 2002 25

P E R S P E C T I V E S

Computers Are from Mars, Organisms Arefrom Venus

I n recent years, biology and computer science have undergone anincreasingly intense interdisciplinary merger, one that has attractedmedia attention and eager investors. The resulting media buzz arisesfrom the strenuous struggle as these two disciplines come together.Is the sound we hear the agonizing grind of unmatched gears, or the

smooth hum of an accelerating machine?I suspect the latter. Biology and computer science share a natural affinity.

Erwin Schrödinger envisioned life as an aperiodic crystal, observing that theorganizing structure of life is neither completely regular, like a pure crystal,nor completely chaotic and without structure, like dust in the wind. Perhapsthis is why biological information has never satisfactorily yielded to classi-cal mathematical analysis. A simple look out the window, however, shows thegreat abundance of structure in biological objects, from the fractals found inthe branches of an oak tree to the symmetries of DNA’s double helix.

Machine computations combine elegant algorithms with brute-force cal-culations—which seems a reasonable approach to this aperiodic structure.Likewise, computing seeks to create a machine that can flexibly solve diverseproblems. In nature, such plastic problem solving resides uniquely in thedomain of organic matter. Whether through historical evolution or indi-vidual behavior, organisms always adaptively solve the problems their envi-ronment poses. Thus, examining how organisms solve problems can lead tonew computation and algorithm-development approaches.

BIOLOGY MEETS COMPUTER SCIENCEBiology is the youngest of the natural sciences. When its collected infor-

mation reaches a critical density, a natural science progresses from infor-mation gathering to information processing. This latter activity has becomethe dominant one in more mature sciences like physics, in which theoreti-cal abstractions and predictions play a primary role because new informa-tion is scarce.

Until recently, the major activity in biology has been gathering new infor-mation in the lab and field. The growth of biological information in the pastfive years, especially at the molecular level, has been astonishing. The growthcurve of the information stored in GenBank (http://www.ncbi.nlm.nih.gov/),the primary molecular-biology information database, follows an exponen-

Combining cold siliconand hot protoplasmmay constitute amarriage of opposites,but this union couldproduce geneticsresearch prodigies.

Junhyong KimYale University

26 Computer

tial curve that closely mimics Moore’s law—doubling every 18 months or so.

When I started doing laboratory workabout 10 years ago, it took us around fivedays to obtain 200 base pairs of DNAsequence data. Two years ago, biotechnol-ogy corporation Celera produced the fruit flygenome’s 170 million or so bases in a fewmonths—a feat that researchers apparentlycan now perform in a few weeks. Currentestimates indicate that the public-sectorsequencing capacity devoted to just theHuman Genome Project has reached about

28 million bases a month, with private capacityclimbing to several times that figure.

This volume of information overwhelms us. Nosingle person, or even a large group, can ever hopeto make sense of this information, especially at therate researchers produce it. By helping researchersprocess this vast collection of data, computer sci-ence can assist in dispersing this information storm.

Currently, the two most successful uses of com-puters in biology are comparative sequence analysisand in silico cloning—the process of using a com-puter search of existing databases to clone a gene.

Comparative sequence analysisWhen researchers isolate new molecular

sequence data in the laboratory, they want to knoweverything about that sequence. A first step is to seeif other researchers have already studied any mol-ecular sequences similar to this sequence. Twomechanisms generate similarity of sequences. Likehuman relatives, biomolecular sequences relatedby evolutionary descent share sequence similarities.The structural requirement of performing a partic-ular function also constrains two molecules withsimilar function to resemble each other. Therefore,researchers can learn a great deal about a biomol-ecular sequence by comparing extrapolated infor-mation to similar, already well-studied sequences.

Probably the most widely used computationaltool in biology, BLAST—Basic Local AlignmentSearch Tool—searches databases like GenBank forall sequences similar to a target sequence. Thesedays, when researchers isolate a new molecularsequence, the first thing they usually do is run aBLAST search against existing databases.

In silico cloningFinding the information to perform successful

cloning—an important activity in biology—is com-parable to searching for a particular sentence in abook that resides somewhere in a huge library.

Finding a sentence might encompass searching bysemantics, such as “find a sentence that expressesthe angst of a young prince”; or by syntax, such as“find a sentence that starts with a preposition, pre-sent subjective verb,” and so forth; or by a pattern,such as “find a sentence that starts with ‘To be ornot to be.’” Likewise, in an experimental setting,we may want to clone a gene by its phenotype, say,olfaction; by its structure, say, a G protein-coupledreceptor; or by a pattern fragment, say, the DNApattern “ACCAGTC.”

Performing these searches in the laboratory resem-bles physically going to the library to look for a bookthat contains the information we seek: Both activi-ties present many logistical problems. Genome pro-jects are somewhat like putting all the library’s bookson a CD-ROM: Although doing so alleviates thephysical access problems, other problems remain.Storing the genome data is like having all the bookson a CD without help guides such as titles, annota-tions, and cross-references. Fulfilling the request to“find a sentence that expresses the angst of a youngprince” would require much reading and interpreta-tion. However, conducting a pattern search wouldsimplify the task by helping to sift through the infor-mation rapidly. Similarly, using in silico cloning tosearch existing databases is a significant benefit ofgenome projects. Using biological semantics to searchgenome databases would be a major achievement.

Computational approaches that use the computerto guide the more expensive and time-consumingwet-lab experiments can help solve long-standing lab-oratory problems. In a collaborative project withDrosophila biologist John Carlson, my colleaguesand I successfully used a quasiperiodic feature clas-sifier (QFC), a new computer algorithm, to solve a15-year-old problem of isolating the fruit fly’s olfac-tory genes.1 Through identifying G protein-coupledreceptors in the Drosophila genome, the computerprogram narrowed the possible candidate genes suf-ficiently to make the experimental work manageable.

Synthesizing information into knowledgeThese days, biologists use computers routinely

to assist with many activities, including

• biomolecular sequence alignment,• assembly of DNA pieces,• multivariate analysis of large-scale gene ex-

pressions, and• metabolic pathway analysis.

Just as conducting normal business activities hasbecome nearly impossible without informatics sup-

Currently, the twomost successful

uses of computers in biology are comparative

sequence analysisand in silico cloning.

July 2002 27

port, so too has biological research come increas-ingly to depend on informatics. What other impor-tant challenges confront computational biology?

Historically, and somewhat inherently, biology isa diverse field that collects information from manydistributed sources. For example, 50 different lab-oratories around the world might study a particu-lar gene. Given these heterogeneous informationsources, collecting and integrating them into acoherent information set presents a major problem.

Given the pace of data production, the funda-mental task of curating and integrating distributeddatabases is critically important. The problemsencountered range from relatively simple to almostphilosophical. Simple problems include differentlabs giving the same object 10 different names.Difficult problems include the seemingly trivialchallenge of how to define a gene—a practicalrather than philosophical problem for a databasespecialist. Currently, most major databases, such asGenBank and Swiss-PROT (http://www.ebi.ac.uk/swissprot/index.html), operate partially byhuman curation and partially by various auto-mated data-sharing schemes. Issues of data qual-ity, interoperation, and integration of dispersedinformation sources remain problematic.

Computational biology researchers use existinginformation to extrapolate knowledge about novelbiomolecules. Because genome projects generateraw data without giving them biological meaning,annotating this raw data with all the pieces of infor-mation that might be biologically relevant becomesimportant. Useful information includes whether astretch of DNA contains an amino acid codingsequence, transposons, or a regulatory sequenceand, if an amino acid is coded, what its putativefunction is, and so on.

One detailed annotation addressed the3,000,000 bases that surround the Drosophilamelanogaster ADH sequence (http://www.flybase.org/). When researchers completed the Drosophilasequence, they held an annotation jamboree to pro-vide as much biological context as possible for theraw sequences.2 The researchers used a variety ofcomputer tools combined with human interpreta-tion to carry out much of this annotation. However,given the rate at which researchers now generateDNA sequence information, automatically anno-tating the raw data presents a computational chal-lenge, and careful human analysis is becomingincreasingly difficult. The tools necessary foraddressing this problem include gene prediction,gene classification, comparative genomics, and evo-lutionary modeling.

Although researchers increasingly synthe-size biological information into general theo-ries, our knowledge remains scattered andcomplex. For example, we have a great dealof detailed information about the molecularevents governing the early development ofDrosophila. However, this extremely complexknowledge takes the form of statements like“gene A and B interact to positively inducegene C, but under the presence of gene D, thepositive regulation is modulated by.” We usu-ally do not use an a priori theory to gathercoordinated data; instead, hundreds of differ-ent research groups make independent obser-vations hoping to synthesize that theory. Thesehundreds of independent observations appearin thousands of articles that use scores of variationsin terminology, methodology, and so forth.

Worse, all this data applies to only a tiny segmentof the field: the biology of fruit fly development. Tocope with the information explosion representedby the tens of thousands of research articles thatappear in print each year, we need a system thatperforms automatic knowledge extraction: a com-puter program that scans all these articles, classi-fies them, and produces synthetic new information.While daunting, such a project would be useful notonly in the biological context but also in everydaylife. No doubt this is why automatic text retrievaland knowledge extraction have become an impor-tant research area in computer science (http://www.cs.cmu.edu/cald/research.html), which has justbegun to provide new tools for biological research.3

COMPUTATIONAL BIOLOGY’S HOLY GRAILWhen asked what the Holy Grail of computa-

tional biology is, most researchers would answerthat it is either sequence-structure-function predic-tion or computing the genotype-phenotype map.

Predicting molecular structureSequence-structure-function prediction refers to

the idea that, given a molecule’s sequence identity,we would like to predict its three-dimensional struc-ture and, from that structure, infer its molecularfunction. In the past, researchers were widely suc-cessful in deducing the universal genetic code,which consists of a relational map from DNAsequences to amino acid sequences, revealing theamino acid identity once we know the DNAsequence identity. This code’s universality and ele-gant combinatorial structure are a remarkable factof nature. In addition, our knowledge of the codehas extremely practical consequences.

Althoughresearchers increasingly synthesizebiological

information intogeneral theories,

our knowledgeremains scattered

and complex.

28 Computer

Identifying the amino acid sequence of aprotein is difficult and costly, but if we havethe genetic code, we only need to identify thecorresponding DNA sequence. When goingfrom the amino acid sequence to proteinstructure, if we could determine a direct rela-tionship between the amino acid sequenceand the corresponding protein’s three-dimen-sional structure, we should be able to gener-ate a relational map from the primary aminoacid sequences to the folded protein struc-tures. Deducing this map would be anotherwonderful triumph that would prove tremen-dously useful. Once we have this second

genetic code, obtaining complete information onthe corresponding protein structure would onlyrequire knowing the DNA sequence.

Several factors make structure prediction fromsequence extremely difficult, however. In the caseof the genetic code, the possible values consist ofthe 20 different amino acids. Proteins consist ofapproximately 1,000 different major structures—called folds—each with tens of thousands of vari-ations. In proteins, the physical forces that governthe interaction of the hundreds to thousands ofamino acid residues determine the structure. Wedo not know the details of these interactions. Evenif we knew them, computing the consequences ofthese forces would be nearly impossible becausedoing so requires solving a physics problem thatinvolves hundreds to thousands of nonideal bodies.

Still, we have made significant progress in thisarea, thanks in part to the Critical Assessment ofStructure Prediction (CASP) competition (http://www.ncbi.nlm.nih.gov/Structure/RESEARCH/casp3/index.shtml). This worldwide open challengeinvites computational biologists to predict proteinstructures drawn from test cases that have beensolved using experimental techniques yet to be pub-licly released. CASP’s contest format has generateda great response, resulting in dramatic improve-ments in prediction rates.

From genotype to phenotypeResearchers have uncovered reasonable evidence

indicating that a protein’s structure approximatelydetermines its molecular functions, such as cataly-sis, DNA binding, and cell component binding.Therefore, some researchers think a relational mapbetween structure and function should bededucible, perhaps as a third genetic code. This ideaalso drives so-called rational drug design.

If researchers could predict the action of a pro-

tein by looking at its three-dimensional structure—much as an engineer looks at an automobiledesign—they could say, for example, “If we stream-line the structure here and reduce the bump here,we will get a better working drug.” Further, ifresearchers solve the sequence-structure problem,they would also know exactly how to make thosechanges.

Unfortunately, this ideal remains beyond reach,not least because we have at best only a murky ideaof what protein function means, both in theory andpractice. An object’s function often depends oncontext. For example, a screw that holds a chairtogether has a markedly different context than thescrew on a car jack. Because the parts of an objectthat confer function also often interrelate, we can’tchop off a part and expect the function to remainunchanged.

All of these concerns bring us to the genotype-phenotype map problem. A genotype refers to thegenetically encoded information in an individualgenome. Technical details aside, practically speak-ing, the genotype consists of the sequence identityfor a person’s entire DNA. The phenotype refers toany measured trait of a particular individual—forexample, a person’s hair color, body weight,propensity to heart attack, and so on. Geneticencoding does not determine all these measure-ments, but the idea is that they result from geneticinformation interacting with a particular environ-mental context. For every environmental context,there is a map between the genotype and pheno-type. Thus, given a person’s DNA sequence andenvironmental context, we should be able to deter-mine most measured traits. That is why many drugcompanies have begun to ask how the efficacy of agiven drug treatment interacts with the recipient’sgenotype.

Ultimately, we must determine if we can take anengineering approach to biological objects. Some ofthe most spectacular advances in biology camefrom the optimistic view that we will achieve aphysics-like understanding of this discipline. Per-haps, then, the Holy Grail of biology is a mechan-ical, physical understanding of living organisms. Isthere in fact a grand theory of the organism—a setof laws that govern its form and function?Evolutionary theory has provided such laws andtheories for the generative process of populations.Can we now obtain similar theories for an indi-vidual generative process? Clearly, the growingbody of molecular data and its computation willplay a fundamental role in forming the answer tothis question.

The relationshipbetween amino acid strings and the 3D structure

of proteins isthought of as the second

genetic code.

July 2002 29

DNA COMPUTING AND GENETIC ALGORITHMS

Leonard Adleman4 first suggested the possibilityof using DNA for computation. Adleman usedsequence-specific hybridization of DNA moleculesand polymerase chain reaction to solve the com-putational problem of finding a Hamiltonian pathin a directed graph—a route that starts at a partic-ular vertex of a graph and completely traverseseach vertex exactly once before ending at a seconddesignated vertex. A sample Hamiltonian problemwould be to find the route, if it exists, from NewYork to Boston, that goes through each of 10 majorEastern seaboard cities exactly once.

Molecular computingMore specifically, Adleman was motivated by

the computational difficulty presented by the classof NP-complete problems—here NP stands fornondeterministic polynomial time. Each problemin this class has the property that one can effi-ciently verify whether a proposed solution isindeed an actual solution, but no one knows howto efficiently find an actual solution if one exists.For the past 30 years, computer scientists havegenerally held that no efficient algorithm existsfor any of the NP-complete problems. The prob-lem of finding a Hamiltonian path in a directedgraph is NP-complete. It is easy to verify whethera proposed sequence of vertices indeed constitutesa Hamiltonian path. However, researchers believethat there is no efficient, deterministic algorithmfor finding a Hamiltonian path in a given directedgraph—assuming such a path exists.

One useful observation that can be made aboutany NP-complete problem is that a finite set ofpotential solutions to propose always exists——although that set may be of exponential size. Forthe directed Hamiltonian path problem, we needonly consider the set of all possible permutations ofthe graph’s set of vertices. If a Hamiltonian pathexists in the graph, it must be represented by one ofthe permutations. We say that the problem can besolved efficiently nondeterministically because if wecan guess one of the possible permutations, that per-mutation can be verified as a solution—or not—effi-ciently. Hence, we can efficiently solve the problemof finding a Hamiltonian path with an exponentialnumber of processors, each verifying a different per-mutation. In Adleman’s alternate formulation, weuse the hybridization of an exponentially vast num-ber of DNA molecules, in parallel, to solve the prob-lem of actually finding a permutation that representsa Hamiltonian path.

Since Adleman’s original work, additionalresearch has shown that biologists can useDNA to encode a universal computer, thesame kind of computer we use on our desk-tops, and that this property stems purely fromDNA’s ability to find complementary sequencepairs. Yet considerable skepticism still sur-rounds the practical use of such computers.The main issues involve concerns about

• encoding the problem and reading theoutput, which can take days for evensimple problems;

• inherent computational errors; and• the amount of DNA required to solve practi-

cal hard problems.

As an example of this last issue, currently thelargest supercomputers perform approximately 1019

elementary switch operations per second. Althoughthe molecular interactions of DNA hybridization canbe extremely fast, cycling such operations takesapproximately 102 seconds. It’s unclear how manylogical-circuit switch operations equal a single DNAhybridization reaction. However, if we assume equiv-alency, obtaining 1019 switch operations wouldrequire approximately 10 mmoles of DNA. Thiswould amount to about 100 grams of DNA using,say, 24 base pair sequences—an unbelievably largeamount of DNA for a molecular experiment.

Despite these problems, the idea of DNA com-puters has opened exciting new avenues of researchin nanotechnology and computation models.Although real organisms do perform complex com-putations,5 whether we can harness this ability forour specific use remains an open question.

Genetic algorithmsResearchers such as John Holland laid the foun-

dations for genetic algorithms in the 1960s whenthey began describing the possibility of machinesthat can adaptively solve complex problems. Inrecent years, the term evolutionary computing hasbeen applied to the diverse field that flowered fromthis seminal work because the predominant idea isto solve complex problems—such as a Hamiltonianproblem—by emulating the evolutionary adaptivebehavior of real organisms.

Strict evolutionary adaptation requires threecomponents. First, the organism should have aproperty or suite of properties that governs its dif-ferential survival. Second, individuals should inheritthese properties. Third, a mechanism should gen-erate variations of these properties via mutation.

The idea of DNAcomputers has

opened exciting newavenues of researchin nanotechnologyand computation

models.

30 Computer

Evolutionary computing thus involves gen-erating a population of computer programsand—by tying their survival to their prob-lem-solving ability—selects those that areparticularly good at solving a posed problem.The reproduction and inheritance propertyensures that this selection process continuesthrough many cycles. The mutation propertyallows the population to continuously trynew variants of the solutions. Many studiesshow that these algorithm classes may be par-ticularly suitable for hard problems where“the objective function landscape is rough.”

We can better understand the meaning ofthis phrase if we consider that many difficult

problems can be visualized as searching for thehighest mountain peak in a mountain range. In thismetaphor, the height of the mountains representsan optimal criterion, often called the objective func-tion. The land these mountains occupy representsthe search space—the collection of possible solu-tions whose appropriateness the objective functionmeasures.

For example, in the protein-structure-determi-nation problem, the optimal criterion might be thetotal free energy of the folded structure, and thesearch space would consist of all possible ways tofold an amino acid sequence. Together, the searchspace and the objective function comprise the prob-lem’s landscape.

Many computer algorithms use a rule, such as“look around and go up the steepest direction,” totraverse such a landscape. But, just as in real life, arugged landscape often foils these simple rules.Although researchers remain uncertain as to exactlyhow the theory works, it seems that the principlesused in evolutionary algorithms work well in suchsituations.

Evolutionary computing versus artificial lifeDespite its appearance of evolving intelligence

and self-direction, evolutionary computationshould not be confused with artificial life.Evolutionary computing generally functions as aproblem-solving device, whereas artificial life mim-ics an organism’s behavior.

These two ideas do mix, however, in evolution-ary programs. Basically, the code from these pro-grams self-replicates and adaptively changes inresponse to some kind of selection scheme, usuallytheir ability to solve a problem in the shortestamount of time. In genetic algorithms, researchersschematically code into the programs the problemto be solved. Further, this architecture does not

change the scheme’s parameters—they changeadaptively.

In evolutionary programming, on the other hand,the actual execution of the program changes fun-damentally. An example involving the protein-struc-ture-determination problem will make this clearer.Given a string of amino acids, we must compute thethree-dimensional position of individual atoms thatwill minimize total free energy. To implement agenetic algorithm, we might start by assigning ran-dom three-dimensional coordinates to the atoms toobtain a set of random solutions. Next, we evalu-ate the solutions’ implied free energy and select forthe lower free-energy solutions. We then change thethree-dimensional coordinates to mutate the solu-tions.

Under the evolutionary programming approach,on the other hand, we implement a program thattakes as input a string of amino acids and producesas output a set of coordinates. Then, the programitself randomly changes, rearranging its own codeto, for example, duplicate some parts, delete oth-ers, and so on. The environment selects those pro-grams that produce the right solution and makeefficient computations.

The idea for a program that replicates andchanges its own code first arose among Bell Labsresearchers in the 1950s.6 Tom Ray, a former trop-ical ecologist, created Tierra, the first truly evolv-ing self-replicating program. An artificial world ofsimplified program languages that allows programsto mutate, replicate, and evolve, Tierra originallyrandomly generated small programs, then selectedthose that learned to self-replicate. Soon, other vari-ants appeared that self-replicated more efficiently.Further, a complete ecology of computer programsevolved, including those that reproduced as a par-asite hitching a ride on normal programs.

Many developments have followed Ray’s work,including Avida (http://www.krl.caltech.edu/avida/), an Internet program based on conceptssimilar to Tierra, which lives on the Web and seeksspare computer cycles to examine issues pertain-ing to evolutionary biology.7

INTERDISCIPLINARY RESEARCHAlthough many of the most exciting scientific

advances have arisen from interdisciplinary work,in practice such collaboration is difficult. Many ofthe most creative people have trouble finding sup-port and institutional positions. While combiningthe knowledge of two different fields can be diffi-cult, we can overcome such problems if we workhard and do our homework. There is absolutely no

Evolutionarycomputinggenerates a

population ofcomputer programsand selects those

that are particularlygood at solving aposed problem.

July 2002 31

reason why an expert in biological sciences shouldnot also be deeply knowledgeable in computer sci-ence, mathematics, and statistics. Thus, the mainobstacles to interdisciplinary science stem from ourhubris, which manifests itself in two fallacies: col-laboration and expertism.

Collaborative dissonanceThe collaboration fallacy occurs in several steps.

First, an expert in one field needs “help in just thislittle problem” and nothing else. A biologist forexample, might approach a computer scientist andsay, “I have this gene-finding problem. Can you runa program for me?” The biologist assumes that

• the computer scientist will not be interested inthe biology,

• it’s too involved to explain everything, and• the computer scientist would not understand

the explanation anyway.

Running such a program usually requires tediouswork on the computer scientist’s part. Lacking thescientific context, such as why the gene is interest-ing and the problem important, the computingexpert has no motivation to collaborate.

Second, the desire to revolutionize the other fieldarises. After having been told the problem’s con-text, the computer scientist may say, “Your wholeconcept of a gene is flawed—here’s how we shoulddefine a gene.” Indeed such fundamental restruc-turing of an idea in Field A using insights from FieldB can be extremely rewarding and revolutionary.However, this kind of attitude does not apply toevery interdisciplinary problem. Moreover, everydiscipline has its own narrative tradition for pos-ing and answering questions. Making progressrequires each expert to respect the other field’s tra-dition. Thus, fundamental revisions are a long-term, careful undertaking not to be embarked uponhastily.

Third, the collaborators assume that “an expertin Field A will have nothing useful to say aboutField B and vice versa.” Specifically, when twoexperts get together, they expect each other to staywithin their own domains and communicate solelythrough some narrowly prescribed interface.

If a computer scientist comments on the molec-ular mechanisms of carcinogenesis, or if a biologistmakes a remark about Lambda calculus, our firstinstinct is to wonder if this person knows what heor she is talking about—an extremely unfortunatesituation given that such insights from outsiderscan lead to important breakthroughs.

Hubris of expertismThus does the certainty of our scientific

pedigree create the expert fallacy. We live inan age of specialization. Experts rule. We talkof “consulting the world’s foremost expert inX” and “being taught by an expert in Y.”

True, we have amassed more informationtoday than ever before, and no single personcan be expected to absorb even a small frac-tion of that knowledge. When two distinctdisciplines such as biology and computer sci-ence come together, it may be asking toomuch for one person to acquire the equivalentknowledge of a specialist in each field. Yet, histori-cally significant advances have frequently been madeby such cross-disciplinary collaboration.

By its very nature, the expert concept embodiesgreed. It is not enough to be expert in a discipline,you must strive to be the foremost expert. If othersattempt to contribute to your discipline, they threatenyour goal of exceeding all others. Thus, the computerscience expert may resist the notion that a biologistmight know something about computer science, justas the biology expert may resist the notion of a com-puter scientist understanding biology.

More importantly, given a collection of expertsfrom different fields, all will strive to position theirexpertise as being the most important. Thus do welearn to disdain the soft and applied sciences, non-natural sciences, mere engineering, and so on.Worse, this need to differentiate ourselves steadilyshrinks the domain of our expertise until we limitourselves to, say, the Department of the Droso-phila’s Left Wing.

To make interdisciplinary research successful, wemust jettison this idea of the expert. All knowledgeis equal. Indeed, if we really knew which knowledgeis important and which is not, we could all use itwith shared certainty. Growth of knowledge,whether personal or fieldwide, is haphazard and fullof windings and intricate turnings. Our best hopelies in keeping our eyes open just in case somethinginteresting happens. We should pursue knowledgevigorously but with agnostic value judgments.

ACCELERATING PROGRESSAssuming we can overcome the obstacles to inter-

disciplinary collaboration, what can we expect fromthe interactivities between biology and computerscience? In the ideal world, creativity and insight,not technical processes, would drive science.

I remember the thrill of sitting at a terminalattached to one of the early supercomputers andsuddenly realizing I was getting results back as fast

To makeinterdisciplinary

research successful,we must jettison

this idea ofthe expert. All

knowledge is equal.

32 Computer

as I could think of things to do. My grand hope forcomputational biology is that we will be able to pro-vide this responsiveness in the future for all biolog-ical problems. Not being a card-carrying computerscientist, I cannot say what their dreams would bein relation to biology. But I suspect they wouldn’tdiffer much from the idea of a new computationmodel that leads to robust, adaptive, flexible prob-lem-solving machines—just like organisms.

The human brain contains about 1012 neuronswith peak synaptic activity of about 1 kHz, result-ing in 1015 neuronal operations per second. Weobviously do not know how a single synaptic activ-ity translates into a computer chip’s elementaryswitch operations. But, with respect to pure infor-mation processing capability, if that number fallsbetween 10,000 to 100,000 such operations, itseems our thinking machines can be made compa-rable to the human brain. At that point, we onlyneed to determine the software and the computa-tion model to achieve genuine artificial intelligence.This is where the interface between computationand biology becomes important.

W e can anticipate a great prognosis forinterdisciplinary research in biology andcomputers. Practically speaking, career

opportunities in computational biology are increas-ing exponentially. Scientifically, computationalapproaches rapidly devour the problems so easy toapproach using a computer, yet so difficult to tacklein the laboratory.

The interface between computation and biologythus offers one of the most exciting growth fieldstoday. Certainly, some of today’s claims and specu-lations will turn out to be noise, and some of theheat will inevitably cool. However, computers andbiology form a relationship so natural and com-pelling that the growth of either field will criticallyrely on their interaction. Our best hope lies in enthu-siastically embracing this relationship without thehubris of prejudice, thus hastening a time when indi-viduals who can discuss quantum computation andpost-translational regulation with equal facility ini-tiate a wave of synergistic revolutions. �

REFERENCES1. J. Kim et al., “Identification of Multi-Transmem-

brane Proteins from Genomic Databases UsingQuasi-Periodic Structural Properties,” Bioinformat-ics, vol. 16, 2000, pp. 767-775.

2. E. Pennisi, “Ideas Fly at Gene-Finding Jamboree,”Science, vol. 287, 2000, pp. 2182-2184.

3. C.M. Blaschke et al., “Automatic Extraction of Bio-logical Information from Scientific Text: Protein-Pro-tein Interactions,” ISMB, 1999, pp. 60-67.

4. C.C. Maley, “DNA Computation: Theory, Practice,and Prospects,” Evolutionary Computation, vol. 6,1998, pp. 201-229.

5. L.F. Landweber and L. Kari, “The Evolution of Cel-lular Computing: Nature’s Solution to a Computa-tional Problem,” Biosystems, vol. 52, 1999, pp. 3-13.

6. A.K. Dewndey, “Computer Recreations: In the GameCalled Core War, Hostile Programs Engage in a Bat-tle of Bits,” Scientific Am., May 1984, pp. 14-22.

7. C. Adami and C.T. Brown, “Evolutionary Learningin the 2D Artificial Life System Avida,” Proc. Artifi-cial Life IV, MIT Press, Boston, 1994, pp. 377-381.

Junhyong Kim is an associate professor of ecologyand evolutionary biology at Yale University. Hisresearch interests include computational biology,statistical theory, and molecular evolution. Kimreceived a PhD in ecology and evolution from theState University of New York, Stony Brook. He isa member of the Society for Systematic Biology andthe Genetics Society of America. Contact him [email protected].

Investing in Students

computer.org/students/

Lance Stafford Larson Student Scholarship best paper contest

✶Upsilon Pi Epsilon/IEEE Computer Society Award

for Academic Excellence

Each carries a $500 cash award.

Application deadline: 31 October

SCHOLARSHIPMONEY FORSTUDENTMEMBERS