COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.the-eye.eu/public/concen.org/Scientific American... · special online issue no. 2 THE FUTURE OF THE WEB The dotcom bubble may have ﬁnall

COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.

2

5

7

12

17

24

31

TABLE OF CONTENTS

ScientificAmerican.comspecial online issue no. 2

THE FUTURE OF THE WEB

The dotcom bubble may have finally burst but there can be no doubt that the Internet has forever changed the waywe communicate, do business and find information of all kinds. Scientific American has regularly covered theadvances making this transformation possible. And during the past five years alone, many leading researchersand computer scientists have aired their views on the Web in our pages.

In this collection, expert authors discuss a range of topics—from XML and hypersearching the web to filteringinformation and preserving the Internet in one vast archive. Other articles cover more recent ideas, includingways to make Web content more meaningful to machines and plans to create an operating system that wouldspan the Internet as a whole. --the Editors

Filtering Information on the InternetBY PAUL RESNICK; SCIENTIFIC AMERICAN, MARCH 1997Look for the labels to decide if unknown software and World Wide Web sites are safe and interesting.

Preserving the InternetBY BREWSTER KAHLE; SCIENTIFIC AMERICAN, MARCH 1997An archive of the Internet may prove to be a vital record for historians, businesses and governments.

Searching the InternetBY CLIFFORD LYNCH; SCIENTIFIC AMERICAN, MARCH 1997Combining the skills of the librarian and the computer scientist may help organize the anarchy of the Internet.

XML and the Second-Generation WebBY JON BOSAK AND TIM BRAY; SCIENTIFIC AMERICAN, MAY 1999The combination of hypertext and a global Internet started a revolution. A new ingredient, XML, is poised to finish the job.

Hypersearching the WebBY MEMBERS OF THE ”CLEVER” PROJECT; SCIENTIFIC AMERICAN, JUNE 1999With the volume of on-line information in cyberspace growing at a breakneck pace, more effective search tools are desperately needed. A new technique analyzes how Web pages are linked together.

The Semantic WebBY TIM BERNERS-LEE, JAMES HENDLER, AND ORA LASSILA; SCIENTIFIC AMERICAN, MAY 2001A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities.

The Worldwide ComputerBY DAVID P. ANDERSON AND JOHN KUBIATOWICZ; SCIENTIFIC AMERICAN, MARCH 2002An operating system spanning the Internet would bring the power of millions of the world’s Internet-connected PCs to everyone’s fingertips.

1 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.

The Internet is oftencalled a global village,suggesting a huge butclose-knit communitythat shares common

values and experiences. The metaphoris misleading. Many cultures coexist onthe Internet and at times clash. In itspublic spaces, people interact commer-cially and socially with strangers as wellas with acquaintances and friends. Thecity is a more apt metaphor, with itssuggestion of unlimited opportunitiesand myriad dangers.

To steer clear of the most obviouslyoffensive, dangerous or just boring neigh-borhoods, users can employ some me-chanical filtering techniques that identi-fy easily definable risks. One techniqueis to analyze the contents of on-line ma-terial. Thus, virus-detection softwaresearches for code fragments that itknows are common in virus programs.Services such as AltaVista and Lycos caneither highlight or exclude World WideWeb documents containing particularwords. My colleagues and I have beenat work on another filtering techniquebased on electronic labels that can beadded to Web sites to describe digitalworks. These labels can convey charac-teristics that require human judgment—whether the Web page is funny or offen-sive—as well as information not readilyapparent from the words and graphics,such as the Web site’s policies about theuse or resale of personal data.

The Massachusetts Institute of Tech-nology’s World Wide Web Consortium

has developed a set of technical stan-dards called PICS (Platform for InternetContent Selection) so that people canelectronically distribute descriptions ofdigital works in a simple, computer-readable form. Computers can processthese labels in the background, auto-matically shielding users from undesir-able material or directing their atten-tion to sites of particular interest. Theoriginal impetus for PICS was to allowparents and teachers to screen materialsthey felt were inappropriate for childrenusing the Net. Rather than censoringwhat is distributed, as the Communica-tions Decency Act and other legislativeinitiatives have tried to do, PICS enablesusers to control what they receive.

What’s in a Label?

PICS labels can describe any aspectof a document or a Web site. The

first labels identified items that mightrun afoul of local indecency laws. Forexample, the Recreational Software Ad-visory Council (RSAC) adapted its com-puter-game rating system for the Inter-

net. Each RSACi (the “i” stands for“Internet”) label has four numbers, in-dicating levels of violence, nudity, sexand potentially offensive language. An-other organization, SafeSurf, has devel-oped a vocabulary with nine separatescales. Labels can reflect other concernsbeyond indecency, however. A privacyvocabulary, for example, could describeWeb sites’ information practices, suchas what personal information they col-lect and whether they resell it. Similarly,an intellectual-property vocabulary coulddescribe the conditions under which anitem could be viewed or reproduced [see“Trusted Systems,” by Mark Stefik, page78]. And various Web-indexing organi-zations could develop labels that indi-cate the subject categories or the relia-bility of information from a site.

Labels could even help protect com-puters from exposure to viruses. It hasbecome increasingly popular to down-load small fragments of computer code,bug fixes and even entire applicationsfrom Internet sites. People generally trust

FILTERING INFORMATION ON THE INTERNET

Look for the labels to decide if unknown software and World Wide Web sites are safe and interesting

by Paul Resnick

FILTERING SYSTEM for the World WideWeb allows individuals to decide for them-selves what they want to see. Users speci-fy safety and content requirements (a),which label-processing software (b) thenconsults to determine whether to block ac-cess to certain pages (marked with a stopsign). Labels can be affixed by the Website’s author (c), or a rating agency canstore its labels in a separate database (d).


that the software they download willnot introduce a virus; they could add amargin of safety by checking for labelsthat vouch for the software’s safety. Thevocabulary for such labels might indi-cate which virus checks have been runon the software or the level of confidencein the code’s safety.

In the physical world, labels can beattached to the things they describe, orthey can be distributed separately. Forexample, the new cars in an automobileshowroom display stickers describingfeatures and prices, but potential cus-tomers can also consult independentlistings such as consumer-interest mag-azines. Similarly, PICS labels can be at-tached or detached. An information pro-vider that wishes to offer descriptionsof its own materials can directly embedlabels in Web documents or send themalong with items retrieved from theWeb. Independent third parties can de-scribe materials as well. For instance, theSimon Wiesenthal Center, which tracksthe activities of neo-Nazi groups, couldpublish PICS labels that identify Webpages containing neo-Nazi propaganda.

These labels would be stored on a sepa-rate server; not everyone who visits theneo-Nazi pages would see the Wiesen-thal Center labels, but those who wereinterested could instruct their softwareto check automatically for the labels.

Software can be configured not mere-ly to make its users aware of labels butto act on them directly. Several Web soft-ware packages, including CyberPatroland Microsoft’s Internet Explorer, al-ready use the PICS standard to controlusers’ access to sites. Such software canmake its decisions based on any PICS-compatible vocabulary. A user whoplugs in the RSACi vocabulary can setthe maximum acceptable levels of lan-guage, nudity, sex and violence. A userwho plugs in a software-safety vocabu-lary can decide precisely which viruschecks are required.

In addition to blocking unwantedmaterials, label processing can assist infinding desirable materials. If a user ex-presses a preference for works of highliterary quality, a search engine mightbe able to suggest links to items labeledthat way. Or if the user prefers that per-

sonal data not be collected or sold, aWeb server can offer a version of its ser-vice that does not depend on collectingpersonal information.

Establishing Trust

Not every label is trustworthy. Thecreator of a virus can easily dis-

tribute a misleading label claiming thatthe software is safe. Checking for labelsmerely converts the question of wheth-er to trust a piece of software to one oftrusting the labels. One solution is touse cryptographic techniques that candetermine whether a document has beenchanged since its label was created andto ensure that the label really is the workof its purported author.

That solution, however, simply chang-es the question again, from one of trust-ing a label to one of trusting the label’sauthor. Alice may trust Bill’s labels if shehas worked with him for years or if heruns a major software company whosereputation is at stake. Or she might trustan auditing organization of some kindto vouch for Bill.

BRYA

N C

HRI

STIE

CONTAINS EDUCATIONAL MATERIAL SUITABLE FOR YOUNG CHILDREN

CONTAINS NAZI PARTY LITERATURE

CONTAINS VIOLENCE

FREE OF VIRUSES

CONTAINS MATERIAL OF HIGHLITERARY QUALITY

b LABEL-PROCESSINGSOFTWARE

d DATABASE OF INDEPENDENT LABELS

c AUTHORLABELS

a

SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 3The Future of the WebCOPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.

Of course, some labels address mat-ters of personal taste rather than pointsof fact. Users may find themselves nottrusting certain labels, simply becausethey disagree with the opinions behindthem. To get around this problem, sys-tems such as GroupLens and Firefly rec-ommend books, articles, videos or mu-sical selections based on the ratings oflike-minded people. People rate itemswith which they are familiar, and thesoftware compares those ratings withopinions registered by other users. Inmaking recommendations, the softwareassigns the highest priority to items ap-proved by people who agreed with theuser’s evaluations of other materials.People need not know who agreed withthem; they can participate anonymous-ly, preserving the privacy of their evalu-ations and reading habits.

Widespread reliance on labeling raisesa number of social concerns. The mostobvious are the questions of who de-cides how to label sites and what labelsare acceptable. Ideally, anyone could la-bel a site, and everyone could establishindividual filtering rules. But there is aconcern that authorities could assign la-bels to sites or dictate criteria for sitesto label themselves. In an example froma different medium, the television indus-try, under pressure from the U.S. gov-ernment, has begun to rate its shows forage appropriateness.

Mandatory self-labeling need notlead to censorship, so long as individu-als can decide which labels to ignore.But people may not always have thispower. Improved individual control re-moves one rationale for central controlbut does not prevent its imposition.Singapore and China, for instance, areexperimenting with national “fire-

walls”—combinations of software andhardware that block their citizens’ ac-cess to certain newsgroups and Web sites.

Another concern is that even withoutcentral censorship, any widely adoptedvocabulary will encourage people tomake lazy decisions that do not reflecttheir values. Today many parents whomay not agree with the criteria used toassign movie ratings still forbid theirchildren to see movies rated PG-13 orR; it is too hard for them to weigh themerits of each movie by themselves.

Labeling organizations must choosevocabularies carefully to match the cri-teria that most people care about, buteven so, no single vocabulary can serveeveryone’s needs. Labels concerned onlywith rating the level of sexual contentat a site will be of no use to someoneconcerned about hate speech. And nolabeling system is a full substitute for athorough and thoughtful evaluation:movie reviews in a newspaper can befar more enlightening than any set ofpredefined codes.

Perhaps most troubling is the sugges-tion that any labeling system, no matterhow well conceived and executed, will

tend to stifle noncommercial communi-cation. Labeling requires human timeand energy; many sites of limited inter-est will probably go unlabeled. Becauseof safety concerns, some people willblock access to materials that are unla-beled or whose labels are untrusted. Forsuch people, the Internet will functionmore like broadcasting, providing accessonly to sites with sufficient mass-mar-ket appeal to merit the cost of labeling.

While lamentable, this problem is aninherent one that is not caused by label-ing. In any medium, people tend toavoid the unknown when there arerisks involved, and it is far easier to getinformation about material that is ofwide interest than about items that ap-peal to a small audience.

Although the Net nearly eliminatesthe technical barriers to communica-tion with strangers, it does not removethe social costs. Labels can reduce thosecosts, by letting us control when we ex-tend trust to potentially boring or dan-gerous software or Web sites. The chal-lenge will be to let labels guide our ex-ploration of the global city of theInternet and not limit our travels.

The Author

PAUL RESNICK joined AT&T Labs–Research in 1995 as the founding mem-ber of the Public Policy Research group.He is also chairman of the PICS work-ing group of the World Wide Web Con-sortium. Resnick received his Ph.D. incomputer science in 1992 from the Mas-sachusetts Institute of Technology andwas an assistant professor at the M.I.T.Sloan School of Management beforemoving to AT&T.

JEN

NIF

ER C

. CH

RIST

IAN

SEN

Further Reading

Rating the Net. Jonathan Weinberg in Hast-ings Communications and Entertainment LawJournal, Vol. 19; March 1997 (in press). Avail-able on the World Wide Web at http://www.msen.com/~weinberg/rating.htm

Recommender Systems. Special section inCommunications of the ACM, Vol. 40, No. 3;March 1997 (in press).

The Platform for Internet Content Selectionhome page is available on the World Wide Webat http://www.w3.org/PICS

COMPUTER CODE for aPICS standards label is typi-cally read by label-processingsoftware, not humans. Thissample label rates both theliterary quality and the vio-lent content of the Web sitehttp://www.w3.org/PICS


Ma n u s c r i p t sfrom the li-brary of Alex-andria in an-cient Egypt dis-

appeared in a fire. The early printedbooks decayed into unrecognizableshreds. Many of the oldest cinematicfilms were recycled for their silver con-tent. Unfortunately, history may repeatitself in the evolution of the Internet—and its World Wide Web.

No one has tried to capture a com-prehensive record of the text and imag-es contained in the documents that ap-pear on the Web. The history of printand film is a story of loss and partial re-construction. But this scenario need notbe repeated for the Web, which has in-creasingly evolved into a storehouse ofvaluable scientific, cultural and histori-cal information.

The dropping costs of digital storagemean that a permanent record of theWeb and the rest of the Internet can bepreserved by a small group of technicalprofessionals equipped with a modestcomplement of computer workstationsand data storage devices. A year ago Iand a few others set out to realize thisvision as part of a venture known as theInternet Archive.

By the time this article is published,we will have taken a snapshot of allparts of the Web freely and technicallyaccessible to us. This collection of datawill measure perhaps as much as twotrillion bytes (two terabytes) of data,

ranging from text to video to audio re-cording. In comparison, the Library ofCongress contains about 20 terabytes oftext information. In the coming months,our computers and storage media willmake records of other areas of the In-ternet, including the Gopher informa-tion system and the Usenet bulletinboards. The material gathered so farhas already proved a useful resource tohistorians. In the future, it may providethe raw material for a carefully indexed,searchable library.

The logistics of taking a snapshot ofthe Web are relatively simple. Our Inter-net Archive operates with a staff of 10people from offices located in a convert-ed military base—the Presidio—in down-town San Francisco; it also runs an in-formation-gathering computer in theSan Diego Supercomputer Center at theUniversity of California at San Diego.

The software on our computers“crawls” the Net—downloading docu-ments, called pages, from one site afteranother. Once a page is captured, thesoftware looks for cross references, orlinks, to other pages. It uses the Web’shyperlinks—addresses embedded with-in a document page—to move to otherpages. The software then makes copiesagain and seeks additional links con-tained in the new pages. The crawleravoids downloading duplicate copies ofpages by checking the identificationnames, called uniform resource locators(URLs), against a database. Programssuch as Digital Equipment Corporation’sAltaVista also employ crawler softwarefor indexing Web sites.

What makes this experiment possibleis the dropping cost of data storage. Theprice of a gigabyte (a billion bytes) ofhard-disk space is $200, whereas tapestorage using an automated mountingdevice costs $20 a gigabyte. We chosehard-disk storage for a small amount ofdata that users of the archive are likelyto access frequently and a robotic de-vice that mounts and reads tapes auto-matically for less used information. Adisk drive accesses data in an average of15 milliseconds, whereas tapes requirefour minutes. Frequently accessed in-formation might be historical docu-ments or a set of URLs no longer in use.

We plan to update the informationgathered at least every few months. Thefirst full record required nearly a yearto compile. In future passes through theWeb, we will be able to update only theinformation that has changed since ourlast perusal.

The text, graphics, audio clips andother data collected from the Web willnever be comprehensive, because thecrawler software cannot gain access tomany of the hundreds of thousands ofsites. Publishers restrict access to dataor store documents in a format inacces-sible to simple crawler programs. Still,the archive gives a feel of what the Weblooks like during a given period of timeeven though it does not constitute a fullrecord.

After gathering and storing the publiccontents of the Internet, what serviceswill the archive provide? We possess thecapability of supplying documents thatare no longer available from the origi-

PRESERVINGTHE INTERNET

An archive of the Internet may prove to be a vital record for historians, businesses and governments

by Brewster Kahle


nal publisher, an important function ifthe Web’s hypertext system is to becomea medium for scholarly publishing. Sucha service could also prove worthwhilefor business research. And the archivaldata might serve as a “copy of record”for the government or other institutionswith publicly available documents. So,over time, the archive would come toresemble a digital library.

Keeping Missing Links

Historians have already found thematerial useful. David Allison of

the Smithsonian Institution has tappedinto the archive for a presidential elec-tion Web site exhibit at the museum, aproject he compares to saving video-tapes of early television campaign ad-vertisements. Many of the links for theseWeb sites, such as those for Texas Sena-tor Phil Gramm’s campaign, have al-ready disappeared from the Internet.

Creating an archive touches on an ar-ray of issues, from privacy to copyright.What if a college student created a Webpage that had pictures of her then cur-rent boyfriend? What if she later want-ed to “tear them up,” so to speak, yetthey lived on in the archive? Should shehave the right to remove them? In con-trast, should a public figure—a U.S. sen-ator, for instance—be able to erase dataposted from his or her college years?Does collecting information made avail-able to the public violate the “fair use”provisions of the copyright law? The is-sues are not easily resolved.

To address these worries, we let au-

thors exclude their works from the ar-chive. We are also considering allowingresearchers to obtain broad censuses ofthe archive data instead of individualdocuments—one could count the totalnumber of references to pachyderms onthe Web, for instance, but not look at aspecific elephant home page. These mea-sures, we hope, will suffice to allay im-mediate concerns about privacy and in-tellectual-property rights. Over time, theissues addressed in setting up the Inter-net Archive might help resolve the larg-er policy debates on intellectual proper-ty and privacy by testing concepts suchas fair use on the Internet.

The Internet Archive complementsother projects intended to ensure thelongevity of information on the Internet.The Commission on Preservation andAccess in Washington, D.C., researcheshow to ensure that data are not lost asthe standard formats for digital storagemedia change over the years. In anothereffort, the Internet Engineering TaskForce and other groups have labored ontechnical standards that give a uniqueidentification name to digital documents.These uniform resource names (URNs),as they are called, could supplement theURLs that currently access Web docu-ments. Giving a document a URN at-tempts to ensure that it can be tracedafter a link disappears, because estimatesput the average lifetime for a URL at 44days. The URN would be able to locateother URLs that still provided access tothe desired documents.

Other, more limited attempts to ar-chive parts of the Internet have also be-

gun. DejaNews keeps a record of mes-sages on the Usenet bulletin boards, andInReference archives Internet mailinglists. Both support themselves with rev-enue from advertisers, a possible fund-ing source for the Internet Archive aswell. Until now, I have funded the proj-ect with money I received from the saleof an Internet software and servicescompany. Major computer companieshave also donated equipment.

It will take many years before an in-frastructure that assures Internet preser-vation becomes well established—andfor questions involving intellectual-prop-erty issues to resolve themselves. For ourpart, we feel that it is important to pro-ceed with the collection of the archivalmaterial because it can never be recov-ered in the future. And the opportunityto capture a record of the birth of anew medium will then be lost.

The Author

BREWSTER KAHLE founded the Inter-net Archive in April 1996. He invented theWide Area Information Servers (WAIS) sys-tem in 1989 and started a company, WAIS,Inc., in 1992 to commercialize this Internetpublishing software. The company helpedto bring commercial and government agen-cies onto the Internet by selling publishingtools and production services. Kahle alsoserved as a principal designer of the Connec-tion Machine, a supercomputer producedby Thinking Machines. He received a bach-elor’s degree from the Massachusetts Insti-tute of Technology in 1982.


One sometimes hears theInternet characterizedas the world’s libraryfor the digital age. Thisdescription does not

stand up under even casual examina-tion. The Internet—and particularly itscollection of multimedia resourcesknown as the World Wide Web—wasnot designed to support the organizedpublication and retrieval of informa-tion, as libraries are. It has evolved intowhat might be thought of as a chaoticrepository for the collective output ofthe world’s digital “printing presses.”This storehouse of information con-tains not only books and papers butraw scientific data, menus, meetingminutes, advertisements, video and au-dio recordings, and transcripts of inter-

active conversations. The ephemeralmixes everywhere with works of lastingimportance.

In short, the Net is not a digital libra-ry. But if it is to continue to grow andthrive as a new means of communica-tion, something very much like tradi-tional library services will be needed toorganize, access and preserve networkedinformation. Even then, the Net will notresemble a traditional library, becauseits contents are more widely dispersedthan a standard collection. Consequent-ly, the librarian’s classification and se-lection skills must be complemented bythe computer scientist’s ability to auto-mate the task of indexing and storinginformation. Only a synthesis of thediffering perspectives brought by bothprofessions will allow this new mediumto remain viable.

At the moment, computer technologybears most of the responsibility for or-ganizing information on the Internet. Intheory, software that automaticallyclassifies and indexes collections of dig-ital data can address the glut of infor-mation on the Net—and the inability ofhuman indexers and bibliographers tocope with it. Automating informationaccess has the advantage of directly ex-ploiting the rapidly dropping costs ofcomputers and avoiding the high ex-pense and delays of human indexing.

But, as anyone who has ever soughtinformation on the Web knows, theseautomated tools categorize informationdifferently than people do. In one sense,the job performed by the various index-ing and cataloguing tools known assearch engines is highly democratic. Ma-chine-based approaches provide uniform

SEARCH ENGINE operates by visiting, or “crawling” through,World Wide Web sites, pictured as blue globes. The yellow and bluelines represent the output from and input to the engine’s server (redtower at center), where Web pages are downloaded. Software on theserver computes an index (tan page) that can be accessed by users.

BRYA

N C

HRI

STIE

Combining the skills of the librarian and the computer scientist may help organize the anarchy of the Internet

by Clifford Lynch

SEARCHING THE INTERNET


and equal access to all the in-formation on the Net. In prac-tice, this electronic egalitarian-ism can prove a mixed bless-ing. Web “surfers” who typein a search request are oftenoverwhelmed by thousands ofresponses. The search resultsfrequently contain references toirrelevant Web sites while leav-ing out others that hold impor-tant material.

Crawling the Web

The nature of electronic in-dexing can be understood

by examining the way Websearch engines, such as Lycosor Digital Equipment Corpora-tion’s AltaVista, construct in-dexes and find information re-quested by a user. Periodically, they dis-patch programs (sometimes referred toas Web crawlers, spiders or indexing ro-bots) to every site they can identify onthe Web—each site being a set of docu-ments, called pages, that can be accessedover the network. The Web crawlersdownload and then examine these pag-es and extract indexing information thatcan be used to describe them. This pro-cess—details of which vary among searchengines—may include simply locatingmost of the words that appear in Webpages or performing sophisticated anal-yses to identify key words and phrases.These data are then stored in the searchengine’s database, along with an ad-dress, termed a uniform resource loca-tor (URL), that represents where the fileresides. A user then deploys a browser,such as the familiar Netscape, to submitqueries to the search engine’s database.The query produces a list of Web re-sources, the URLs that can be clickedon to connect to the sites identified bythe search.

Existing search engines service mil-lions of queries a day. Yet it has becomeclear that they are less than ideal for re-trieving an ever growing body of infor-mation on the Web. In contrast to hu-man indexers, automated programshave difficulty identifying characteris-tics of a document such as its overalltheme or its genre—whether it is a poemor a play, or even an advertisement.

The Web, moreover, still lacks stan-dards that would facilitate automatedindexing. As a result, documents on the

Web are not structured so that programscan reliably extract the routine informa-tion that a human indexer might findthrough a cursory inspection: author,date of publication, length of text andsubject matter. (This information isknown as metadata.) A Web crawlermight turn up the desired article au-thored by Jane Doe. But it might alsofind thousands of other articles in whichsuch a common name is mentioned inthe text or in a bibliographic reference.

Publishers sometimes abuse the indis-criminate character of automated index-ing. A Web site can bias the selectionprocess to attract attention to itself byrepeating within a document a word,such as “sex,” that is known to be quer-ied often. The reason: a search enginewill display first the URLs for the docu-ments that mention a search term mostfrequently. In contrast, humans can eas-ily see around simpleminded tricks.

The professional indexer can describethe components of individual pages ofall sorts (from text to video) and canclarify how those parts fit together intoa database of information. Civil Warphotographs, for example, might formpart of a collection that also includesperiod music and soldier diaries. A hu-man indexer can describe a site’s rulesfor the collection and retention of pro-grams in, say, an archive that storesMacintosh software. Analyses of a site’spurpose, history and policies are beyondthe capabilities of a crawler program.

Another drawback of automated in-dexing is that most search engines rec-

ognize text only. The intenseinterest in the Web, though, hascome about because of the me-dium’s ability to display imag-es, whether graphics or videoclips. Some research has movedforward toward finding colorsor patterns within images [seebox on next two pages]. But noprogram can deduce the un-derlying meaning and culturalsignificance of an image (for ex-ample, that a group of men din-ing represents the Last Supper).

At the same time, the wayinformation is structured onthe Web is changing so that itoften cannot be examined byWeb crawlers. Many Web pag-es are no longer static files thatcan be analyzed and indexed bysuch programs. In many cases,

the information displayed in a docu-ment is computed by the Web site dur-ing a search in response to the user’s re-quest. The site might assemble a map, atable and a text document from differ-ent areas of its database, a disparatecollection of information that conformsto the user’s query. A newspaper’s Website, for instance, might allow a reader tospecify that only stories on the oil-equip-ment business be displayed in a person-alized version of the paper. The databaseof stories from which this document isput together could not be searched by aWeb crawler that visits the site.

A growing body of research has at-tempted to address some of the prob-lems involved with automated classifi-cation methods. One approach seeks toattach metadata to files so that index-ing systems can collect this information.The most advanced effort is the DublinCore Metadata program and an affiliat-ed endeavor, the Warwick Framework—the first named after a workshop inDublin, Ohio, the other for a colloquyin Warwick, England. The workshopshave defined a set of metadata elementsthat are simpler than those in traditionallibrary cataloguing and have also creat-ed methods for incorporating themwithin pages on the Web.

Categorization of metadata mightrange from title or author to type ofdocument (text or video, for instance).Either automated indexing software orhumans may derive the metadata, whichcan then be attached to a Web page forretrieval by a crawler. Precise and de-

GROWTH AND CHANGE on the Internet are reflected inthe burgeoning number of Web sites, host computers andcommercial, or “.com,” sites.

APPROXIMATE NUMBER

OF WEB SITES

.com SITES(PERCENT OF ALL SITES)

JUNE 1993DEC. 1993

JUNE 1994DEC. 1994

JUNE 1995JAN. 1996

JUNE 1996JAN. 1997

130620

2,74010,00023,500

100,000230,000650,000

02

514

1831

5068

63

10 20 30 40 50 60 70

JAN. 1993JAN. 1994JAN. 1995JAN. 1996JULY 1996 12.9

0 2 4 6 8 10 12

NUMBER OF HOST COMPUTERS(IN MILLIONS)

1.32.2

4.99.5

SOU

RCE:

MAT

THEW

K. G

RAY

; BRY

AN

CH

RIST

IE


SI

MPLE

SMARTER

AUTOMATICINDEXING

HUMANINDEXING

PAGE

tailed human annotations can provide amore in-depth characterization of apage than can an automated indexingprogram alone.

Where costs can be justified, humanindexers have begun the laborious taskof compiling bibliographies of someWeb sites. The Yahoo database, a com-mercial venture, classifies sites by broadsubject area. And a research project atthe University of Michigan is one of

The Internet came into its own a few years ago, when theWorld Wide Web arrived with its dazzling array of photogra-

phy, animation, graphics, sound and video that ranged in subjectmatter from high art to the patently lewd. Despite the multimediabarrage, finding things on the hundreds of thousands of Web sitesstill mostly requires searching indexes for words and numbers.

Someone who types the words “French flag” into the popularsearch engine AltaVista might retrieve the requested graphic, aslong as it were captioned by those two identifying words. But whatif someone could visualize a blue, white and red banner but didnot know its country of origin?

Ideally, a search engine should allow the user to draw or scan ina rectangle with vertical thirds that are colored blue, white andred—and then find any matching images stored on myriad Websites. In the past few years, techniques that combine key-word in-dexing with image analysis have begun to pave the way for thefirst image search engines.

Although these prototypes suggest possibilities for the indexingof visual information, they also demonstrate the crudeness of ex-isting tools and the continuing reliance on text to track down im-agery. One project, called WebSEEk, based at Columbia University,illustrates the workings of an image search engine. WebSEEk be-gins by downloading files found by trolling the Web. It then at-tempts to locate file names containing acronyms, such as GIF orMPEG, that designate graphics or video content. It also looks forwords in the names that might identify the subject of the files.When the software finds an image, it analyzes the prevalence ofdifferent colors and where they are located. Using this information,it can distinguish among photographs, graphics and black-and-white or gray images. The software also compresses each pictureso that it can be represented as an icon, a miniature image for dis-play alongside other icons. For a video, it will extract key framesfrom different scenes.

A user begins a search by selecting a category from a menu—“cats,” for example. WebSEEk provides a sampling of icons for the

“cats” category. To narrowthe search, the user canclick on any icons thatshow black cats. Using itspreviously generated col-or analysis, the search en-gine looks for matches ofimages that have a similarcolor profile. The presen-tation of the next set oficons may show blackcats—but also some mar-malade cats sitting onblack cushions. A visitorto WebSEEk can refine asearch by adding or ex-cluding certain colors from an image when initiating subsequentqueries. Leaving out yellows or oranges might get rid of the oddmarmalade. More simply, when presented with a series of icons,the user can also specify those images that do not contain blackcats in order to guide the program away from mistaken choices. Sofar WebSEEk has downloaded and indexed more than 650,000 pic-tures from tens of thousands of Web sites.

Other image-searching projects include efforts at the Universityof Chicago, the University of California at San Diego, Carnegie Mel-lon University, the Massachusetts Institute of Technology’s MediaLab and the University of California at Berkeley. A number of com-mercial companies, including IBM and Virage, have crafted soft-ware that can be used for searching corporate networks or data-bases. And two companies—Excalibur Technologies and InterpixSoftware—have collaborated to supply software to the Web-basedindexing concerns Yahoo and Infoseek.

One of the oldest image searchers, IBM’s Query by Image Con-tent (QBIC), produces more sophisticated matching of image fea-tures than, say, WebSEEk can. It is able not only to pick out the col-

Finding Pictures on the Webby Gary Stix, staff writer

AUTOMATED INDEXING, used byWeb crawler software, analyzes a page(left panel) by designating most words asindexing terms (top center) or by groupingwords into simple phrases (bottom cen-ter). Human indexing (right) gives addi-tional context about the subject of a page.

BRYA

N C

HRI

STIE


several efforts to develop more formaldescriptions of sites that contain mate-rial of scholarly interest.

Not Just a Library

The extent to which either humanclassification skills or automated

indexing and searching strategies areneeded will depend on the people whouse the Internet and on the businessprospects for publishers. For many com-munities of scholars, the model of anorganized collection—a digital library—still remains relevant. For other groups,an uncontrolled, democratic mediummay provide the best vehicle for infor-mation dissemination. Some users, fromfinancial analysts to spies, want com-

prehensive access to raw databases ofinformation, free of any controls orediting. For them, standard search en-gines provide real benefits because theyforgo any selective filtering of data.

The diversity of materials on the Netgoes far beyond the scope of the tradi-tional library. A library does not pro-vide quality rankings of the works in acollection. Because of the greater vol-ume of networked information, Net us-ers want guidance about where to spendthe limited amount of time they have toresearch a subject. They may need toknow the three “best” documents for agiven purpose. They want this informa-tion without paying the costs of em-ploying humans to critique the myriadWeb sites. One solution that again calls

for human involvement is to share judg-ments about what is worthwhile. Soft-ware-based rating systems have begunto let users describe the quality of par-ticular Web sites [see “Filtering Infor-mation on the Internet,” by Paul Res-nick, page 62].

Software tools search the Internet andalso separate the good from the bad.New programs may be needed, though,to ease the burden of feeding the crawl-ers that repeatedly scan Web sites. SomeWeb site managers have reported thattheir computers are spending enormousamounts of time in providing crawlerswith information to index, instead ofservicing the people they hope to at-tract with their offerings.

To address this issue, Mike Schwartz

ors in an image but also to gauge texture by several measures—contrast (the black and white of zebra stripes), coarseness (stonesversus pebbles) and directionality (linear fence posts versus omni-directional flower petals). QBIC also has a limited ability to searchfor shapes within an image. Specifying a pink dot on a green back-ground turns up flowers and other photographs with similarshapes and colors, as shown above. Possible applications rangefrom the selection of wallpaper patterns to enabling police toidentify gang members by clothing type.

All these programs do nothing more than match one visual fea-ture with another. They still require a human observer—or accom-panying text—to confirm whether an object is a cat or a cushion.For more than a decade, the artificial-intelligence community haslabored, with mixed success, on nudging computers to ascertaindirectly the identity of objects within an image, whether they arecats or national flags. This approach correlates the shapes in a pic-ture with geometric models of real-world objects. The programcan then deduce that a pink or brown cylinder, say, is a human arm.

One example is software that looks for naked people, a pro-

gram that is the work of David A. Forsyth of Berkeley and MargaretM. Fleck of the University of Iowa. The software begins by analyz-ing the color and texture of a photograph. When it finds matchesfor flesh colors, it runs an algorithm that looks for cylindrical areasthat might correspond to an arm or leg. It then seeks other flesh-colored cylinders, positioned at certain angles, which might con-firm the presence of limbs. In a test last fall, the program pickedout 43 percent of the 565 naked people among a group of 4,854images, a high percentage for this type of complex image analysis.It registered, moreover, only a 4 percent false positive rate amongthe 4,289 images that did not contain naked bodies. The nudeswere downloaded from the Web; the other photographs cameprimarily from commercial databases.

The challenges of computer vision will most likely remain for adecade or so to come. Searches capable of distinguishing clearlyamong nudes, marmalades and national flags are still an unreal-ized dream. As time goes on, though, researchers would like togive the programs that collect information from the Internet theability to understand what they see.

IBM

CO

RPO

RAT

ION

/RO

MTE

CH

/CO

REL


and his colleagues at the University ofColorado at Boulder developed soft-ware, called Harvest, that lets a Website compile indexing data for the pagesit holds and to ship the information onrequest to the Web sites for the varioussearch engines. In so doing, Harvest’sautomated indexing program, or gath-erer, can avoid having a Web crawlerexport the entire contents of a given siteacross the network.

Crawler programs bring a copy ofeach page back to their home sites to ex-tract the terms that make up an index, aprocess that consumes a great deal ofnetwork capacity (bandwidth). The gath-erer, instead, sends only a file of index-ing terms. Moreover, it exports only in-formation about those pages that havebeen altered since they were last ac-

cessed, thus alleviating the load on thenetwork and the computers tied to it.

Gatherers might also serve a differentfunction. They may give publishers aframework to restrict the informationthat gets exported from their Web sites.This degree of control is needed becausethe Web has begun to evolve beyond adistribution medium for free informa-tion. Increasingly, it facilitates access toproprietary information that is furnishedfor a fee. This material may not be openfor the perusal of Web crawlers. Gath-erers, though, could distribute only theinformation that publishers wish tomake available, such as links to sum-maries or samples of the informationstored at a site.

As the Net matures, the decision toopt for a given information collection

method will depend mostly on users.For which users will it then come to re-semble a library, with a structured ap-proach to building collections? And forwhom will it remain anarchic, with ac-cess supplied by automated systems?

Users willing to pay a fee to under-write the work of authors, publishers,indexers and reviewers can sustain thetradition of the library. In cases whereinformation is furnished without chargeor is advertiser supported, low-cost com-puter-based indexing will most likelydominate—the same unstructured envi-ronment that characterizes much of thecontemporary Internet. Thus, social andeconomic issues, rather than technolog-ical ones, will exert the greatest influencein shaping the future of information re-trieval on the Internet.

The Author

CLIFFORD LYNCH is director of library automation atthe University of California’s Office of the President, wherehe oversees MELVYL, one of the largest public-access in-formation retrieval systems. Lynch, who received a doctor-ate in computer science from the University of California,Berkeley, also teaches at Berkeley’s School of InformationManagement and Systems. He is a past president of theAmerican Society for Information Science and a fellow ofthe American Association for the Advancement of Science.He leads the Architectures and Standards Working Groupfor the Coalition for Network Information.

Further Reading

The Harvest Information Discovery and Access System. C. M. Bow-man et al. in Computer Networks and ISDN Systems, Vol. 28, Nos. 1–2,pages 119–125; December 1995.

The Harvest Information Discovery and Access System is available on theWorld Wide Web at http://harvest.transarc.com

The Warwick Metadata Workshop: A Framework for the Deploy-ment of Resource Description. Lorcan Dempsey and Stuart L. Weibelin D-lib Magazine, July–August 1996. Available on the World Wide Webat http://www.dlib.org/dlib/july96/07contents.html

The Warwick Framework: A Container Architecture for DiverseSets of Metadata. Carl Lagoze, ibid.

HARVEST, a new search-engine architecture, would derive indexing terms usingsoftware called gatherers that reside at Web sites (brown boxes near globes) or op-erate in a central computer (brown hexagon). By so doing, the search engine canavoid downloading all the documents from a Web site, an activity that burdens net-work traffic. The search engine’s server (red structure at center) would simply askthe gatherers (dark blue arrows) for a file of key words (red arrows) that could beprocessed into an index (tan page) for querying by a user.

BRYA

N C

HRI

STIE


XMLXML and theSecond-Generation

WEBWEB

by Jon Bosak and Tim Bray

Give people a few hints, andthey can figure out the rest.They can look at this page,

see some large type followed by blocksof small type and know that they arelooking at the start of a magazine article.They can look at a list of groceries andsee shopping instructions. They can lookat some rows of numbers and under-stand the state of their bank account.

Computers, of course, are not thatsmart; they need to be told exactly whatthings are, how they are related and howto deal with them. Extensible MarkupLanguage (XML for short) is a new lan-guage designed to do just that, to makeinformation self-describing. This sim-ple-sounding change in how computerscommunicate has the potential to extendthe Internet beyond information deliveryto many other kinds of human activity.Indeed, since XML was completed inearly 1998 by the World Wide Web Con-sortium (usually called the W3C), thestandard has spread like wildfire throughscience and into industries ranging frommanufacturing to medicine.

The enthusiastic response is fueled bya hope that XML will solve some of theWeb’s biggest problems. These are wide-ly known: the Internet is a speed-of-lightnetwork that often moves at a crawl;and although nearly every kind of in-formation is available on-line, it can bemaddeningly difficult to find the onepiece you need.

Both problems arise in large part fromthe nature of the Web’s main language,

The combination of hypertext and a

global Internet started a revolution.

A new ingredient, XML, is

poised to finish the job

ILLU

STRA

TIO

NS

BY B

RUC

IE R

OSC

H

XML BRIDGES the incompatibilities of computersystems, allowing people to search for and exchangescientific data, commercial products and multilin-gual documents with greater ease and speed.

12 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE The Future of the WebCOPYRIGHT 2002 SCIENTIFIC AMERICAN, INC.

Star

Tre

k: In

surr

ectio

n is

show

ing

at th

e Mon

doPl

ex 2

000.

Show

times

are a

t 2:15

, 4:30

, 6:45

, 9:00

and 1

1:15.

Tickets

are $8

.50 ea

ch for

adult

s

and $5.00 each for ch

ildren.

Star Trek MondoPlex2:15 4:306:45 9:0011:15Shake-speare...

<movie>

<title>Star Trek: Insurrection</title>

<star>Patrick Stewart</star>

<star>Brent Spiner</star>

<theatre>

<theatre-name>MondoPlex 2000</theatre-name>

<showtime>1415</showtime>





<price>

<adult-price>8.50</-price>

<child-price>5.00</-price>

</price>

</theatre>

<theatre>

<theatre-name>Bigscreen 1</theatre-name>


<price>

<adult-price>6.00</adult-price>

</price>

</theatre>

</movie>

<movie>

<title>Shakespeare in Love</title>

<star>Gwyneth

AUDIBLESPEECH

STYLESHEET

CONVENTIONALSCREEN

STYLESHEET

HANDHELDDISPLAY

STYLESHEET

File Edit View Special

Star Trek

Shakespeare in

Select a showtimeBuy tickets

HTML (shorthand for Hypertext Mark-up Language). Although HTML is themost successful electronic-publishing lan-guage ever invented, it is superficial: inessence, it describes how a Web brows-er should arrange text, images and push-buttons on a page. HTML’s concern withappearances makes it relatively easy tolearn, but it also has its costs.

One is the difficulty in creating a Website that functions as more than just afancy fax machine that sends documentsto anyone who asks. People and compa-nies want Web sites that take orders fromcustomers, transmit medical records,even run factories and scientific instru-ments from half a world away. HTMLwas never designed for such tasks.

So although your doctor may be ableto pull up your drug reaction history onhis Web browser, he cannot then e-mailit to a specialist and expect her to be ableto paste the records directly into her hos-pital’s database. Her computer would

not know what to make of the infor-mation, which to its eyes would be nomore intelligible than <H1>blah blah</H1> <BOLD>blah blah blah </BOLD>.As programming legend Brian Kerni-ghan once noted, the problem with“What You See Is What You Get” isthat what you see is all you’ve got.

Those angle-bracketed labels in the ex-ample just above are called tags. HTMLhas no tag for a drug reaction, whichhighlights another of its limitations: it isinflexible. Adding a new tag involves abureaucratic process that can take solong that few attempt it. And yet everyapplication, not just the interchange ofmedical records, needs its own tags.

Thus the slow pace of today’s on-linebookstores, mail-order catalogues andother interactive Web sites. Change thequantity or shipping method of yourorder, and to see the handful of digitsthat have changed in the total, youmust ask a distant, overburdened serverto send you an entirely new page, graph-ics and all. Meanwhile your own high-

powered machine sits waiting idly, be-cause it has only been told about <H1>sand <BOLD>s, not about prices andshipping options.

Thus also the dissatisfying quality ofWeb searches. Because there is no wayto mark something as a price, it is effec-tively impossible to use price informa-tion in your searches.

Something Old, Something New

The solution, in theory, is very sim-ple: use tags that say what the in-

formation is, not what it looks like. Forexample, label the parts of an order fora shirt not as boldface, paragraph, rowand column—what HTML offers—butas price, size, quantity and color. A pro-gram can then recognize this documentas a customer order and do whatever itneeds to do: display it one way or dis-play it a different way or put it through abookkeeping system or make a new shirtshow up on your doorstep tomorrow.

We, as members of a dozen-strongW3C working group, began craftingsuch a solution in 1996. Our idea waspowerful but not entirely original. Forgenerations, printers scribbled notes onmanuscripts to instruct the typesetters.This “markup” evolved on its own until1986, when, after decades of work, theInternational Organization for Standard-ization (ISO) approved a system for thecreation of new markup languages.

Named Standard Generalized Mark-up Language, or SGML, this languagefor describing languages—a metalan-guage—has since proved useful in manylarge publishing applications. Indeed,HTML was defined using SGML. Theonly problem with SGML is that it istoo general—full of clever features de-signed to minimize keystrokes in an erawhen every byte had to be accounted for.It is more complex than Web browserscan cope with.

Our team created XML by removingfrills from SGML to arrive at a morestreamlined, digestible metalanguage.XML consists of rules that anyone canfollow to create a markup language fromscratch. The rules ensure that a singlecompact program, often called a parser,can process all these new languages.

Consider again the doctor who wantsto e-mail your medical record to a spe-

<Together XML and XSL allow publishers to pour a publication into

myriad forms—write once and publish everywhere. />

MARKED UP WITH XML TAGS, one file—containing, say, movie listings for an entire city—can be displayed on a wide variety of devices.“Stylesheets” can filter, reorder and render thelistings as a Web page with graphics for a desktopcomputer, as a text-only list for a handheld orga-nizer and even as audible speech for a telephone.

LAU

RIE

GRA

CE


cialist. If the medical profession usesXML to hammer out a markup lan-guage for encoding medical records—and in fact several groups have alreadystarted work on this—then your doctor’se-mail could contain <patient> <name>blah blah </name> <drug-allergy> blahblah blah </drug-allergy> </patient>.Programming any computer to recog-nize this standard medical notation andto add this vital statistic to its databasebecomes straightforward.

Just as HTML created a way for everycomputer user to read Internet docu-ments, XML makes it possible, despitethe Babel of incompatible computersystems, to create an Esperanto that allcan read and write. Unlike most com-puter data formats, XML markup alsomakes sense to humans, because it con-sists of nothing more than ordinary text.

The unifying power of XML arisesfrom a few well-chosen rules. One isthat tags almost always come in pairs.Like parentheses, they surround thetext to which they apply. And like quo-tation marks, tag pairs can be nested in-side one another to multiple levels.

The nesting rule automatically forcesa certain simplicity on every XMLdocument, which takes on the structureknown in computer science as a tree.As with a genealogical tree, each graph-ic and bit of text in the document repre-sents a parent, child or sibling of someother element; relationships are unam-biguous. Trees cannot represent everykind of information, but they can repre-sent most kinds that we need comput-ers to understand. Trees, moreover, areextraordinarily convenient for pro-grammers. If your bank statement is inthe form of a tree, it is a simple matterto write a bit of software that will re-order the transactions or display justthe cleared checks.

Another source of XML’s unifyingstrength is its reliance on a new standardcalled Unicode, a character-encodingsystem that supports intermingling oftext in all the world’s major languages.In HTML, as in most word processors,a document is generally in one particularlanguage, whether that be English or Jap-anese or Arabic. If your software cannotread the characters of that language,then you cannot use the document. Thesituation can be even worse: softwaremade for use in Taiwan often cannotread mainland-Chinese texts because ofincompatible encodings. But softwarethat reads XML properly can deal withany combination of any of these charac-

ter sets. Thus, XML enables exchangeof information not only between differ-ent computer systems but also acrossnational and cultural boundaries.

An End to the World Wide Wait

As XML spreads, the Web should be-come noticeably more responsive.

At present, computing devices connect-ed to the Web, whether they are power-ful desktop computers or tiny pocketplanners, cannot do much more thanget a form, fill it out and then swap itback and forth with a Web server untila job is completed. But the structuraland semantic information that can beadded with XML allows these devicesto do a great deal of processing on thespot. That not only will take a big loadoff Web servers but also should reducenetwork traffic dramatically.

To understand why, imagine going toan on-line travel agency and asking forall the flights from London to New Yorkon July 4. You would probably receive alist several times longer than your screencould display. You could shorten the listby fine-tuning the departure time, priceor airline, but to do that, you would

have to send a request across the Inter-net to the travel agency and wait for itsanswer. If, however, the long list offlights had been sent in XML, then thetravel agency could have sent a smallJava program along with the flight rec-ords that you could use to sort and win-now them in microseconds, without everinvolving the server. Multiply this by afew million Web users, and the globalefficiency gains become dramatic.

As more of the information on the Netis labeled with industry-specific XMLtags, it will become easier to find exactlywhat you need. Today an Internet searchfor “stockbroker jobs” will inundateyou with advertisements but probablyturn up few job listings—most will behidden inside the classified ad servicesof newspaper Web sites, out of a searchrobot’s reach. But the Newspaper Asso-ciation of America is even now buildingan XML-based markup language for

classified ads that promises to makesuch searches much more effective.

Even that is just an intermediate step.Librarians figured out a long time agothat the way to find information in ahurry is to look not at the informationitself but rather at much smaller, morefocused sets of data that guide you tothe useful sources: hence the librarycard catalogue. Such information aboutinformation is called metadata.

From the outset, part of the XMLproject has been to create a sister stan-dard for metadata. The Resource De-scription Framework (RDF), finishedthis past February, should do for Webdata what catalogue cards do for li-brary books. Deployed across the Web,RDF metadata will make retrieval farfaster and more accurate than it is now.Because the Web has no librarians andevery Webmaster wants, above all else,to be found, we expect that RDF willachieve a typically astonishing Internetgrowth rate once its power becomesapparent.

There are of course other ways to findthings besides searching. The Web is afterall a “hypertext,” its billions of pagesconnected by hyperlinks—those under-

lined words you click on to get whiskedfrom one to the next. Hyperlinks, too,will do more when powered by XML.A standard for XML-based hypertext,named XLink and due later this yearfrom the W3C, will allow you to choosefrom a list of multiple destinations. Oth-er kinds of hyperlinks will insert text orimages right where you click, instead offorcing you to leave the page.

Perhaps most useful, XLink will en-able authors to use indirect links thatpoint to entries in some central databaserather than to the linked pages them-selves. When a page’s address changes,the author will be able to update all thelinks that point to it by editing just onedatabase record. This should help elim-inate the familiar “404 File Not Found”error that signals a broken hyperlink.

The combination of more efficientprocessing, more accurate searching andmore flexible linking will revolutionize

< XML enables exchange of infor-

mation not only between different

computer systems but also across

national and cultural boundaries. />


7/4/99Sun

7/4/99Sun

7/4/99Sun

New York(JFK)Arrive 10:55 am



New York(JFK)Arrive 12:00 pm

d Flights - JFK - XML Browser

Favorites Help

SoftlandAirlines 115











Edit

8:00 am

8:45 am

8:55 am

10:00 am

10:55 am

12:00 pm

1:15 pm

1:55 pm

2:00 pm

2:00 pm

2:05 pm

7h 55m

7h 55m

7h 55m

7h 55m

7h 55m

7h 55m

7h 55m

7h 55m

7h 55m

7h 55m

7h 55m

7/4/99Sun

7/4/99Sun

7/4/99Sun

7/4/99Sun

7/4/99Sun

7/4/99Sun

7/4/99Sun

7/4/99Sun

7/4/99Sun

7/4/99Sun

7/4/99Sun

London(LHR)Depart 8:00 am





London(LHR)Depart 12:00 pm

















to

to

to

to

to

to

to

to

to

to

to

File Edit

Scheduled Flights - JFK - XML Browser

View Favorites Help












e

Select your seat for Softland flight #118Heathrow to NY Kennedy July 4, 1999

Flight Confirmation - XML Browser

Your reservation will be entered.You must purchase your ticketswithin 72 hours. Proceed?

Yes No Cancel

?

Book flight

Show farerestrictions

Enter new itinerary

Fare restrictions:•Must stay over a Saturday night.•Tickets must be purchased within 24 hours of reservation and not less than 7 days prior to flight.•Tickets are nonrefundable. Changes to itinerary will result in $75 fee and payment of difference in fare.

Show remainingseats

File Edit

Softland Airlines Flight Finder - XML Browser

View Favorites Help

e

Book a flight

Leaving from

Going to

This search is limited to adult round trip coach fareClick "Book a flight" to do a more detailed search.

Try our fast Roundtrip Fare Finder: Register first

Departing

Returning

More search options

Time

Time

1 adult

3/19/99

3/19/99

evening

evening

SoftlandAirlines

the structure of the Web and make pos-sible completely new ways of accessinginformation. Users will find this newWeb faster, more powerful and moreuseful than the Web of today.

Some Assembly Required

Of course, it is not quite that simple.XML does allow anyone to design

a new, custom-built language, but de-signing good languages is a challengethat should not be undertaken lightly.And the design is just the beginning: themeanings of your tags are not going tobe obvious to other people unless youwrite some prose to explain them, norto computers unless you write somesoftware to process them.

A moment’s thought reveals why. Ifall it took to teach a computer to handlea purchase order were to label it with<purchase-order> tags, we wouldn’tneed XML. We wouldn’t even need pro-grammers—the machines would besmart enough to take care of themselves.

What XML does is less magical butquite effective nonetheless. It lays downground rules that clear away a layer of

programming details so that people withsimilar interests can concentrate on thehard part—agreeing on how they wantto represent the information they com-monly exchange. This is not an easyproblem to solve, but it is not a new one,either.

Such agreements will be made, be-cause the proliferation of incompatiblecomputer systems has imposed delays,costs and confusion on nearly everyarea of human activity. People want toshare ideas and do business without allhaving to use the same computers; ac-tivity-specific interchange languages goa long way toward making that possi-ble. Indeed, a shower of new acronymsending in “ML” testifies to the inven-tiveness unleashed by XML in the sci-

ences, in business and in the scholarlydisciplines [see box on opposite page].

Before they can draft a new XML lan-guage, designers must agree on threethings: which tags will be allowed, howtagged elements may nest within oneanother and how they should be pro-cessed. The first two—the language’svocabulary and structure—are typicallycodified in a Document Type Definition,or DTD. The XML standard does notcompel language designers to use DTDs,but most new languages will probablyhave them, because they make it mucheasier for programmers to write soft-ware that understands the markup anddoes intelligent things with it.

Programmers will also need a set ofguidelines that describe, in human lan-

LAU

RIE

GRA

CE

XML HYPERLINK can open a menu of several op-tions. One option might insert an image, such as aplane seating chart, into the current page (red arrow).Others could run a small program to book a flight(yellow arrow) or reveal hidden text (green arrow).The links can also connect to other pages (blue arrow).


guage, what all the tags mean. HTML,for instance, has a DTD but also hun-dreds of pages of descriptive prose thatprogrammers refer to when they writebrowsers and other Web software.

A Question of Style

For users, it is what those programsdo, not what the descriptions say,

that is important. In many cases, peoplewill want software to display XML-en-coded information to human readers.But XML tags offer no inherent cluesabout how the information should lookon screen or on paper.

This is actually an advantage for pub-lishers, who would often like to “writeonce and publish everywhere”—to dis-till the substance of a publication andthen pour it into myriad forms, bothprinted and electronic. XML lets themdo this by tagging content to describeits meaning, independent of the displaymedium. Publishers can then apply rulesorganized into “stylesheets” to reformatthe work automatically for various de-vices. The standard now being devel-oped for XML stylesheets is called theExtensible Stylesheet Language, or XSL.

The latest versions of several Webbrowsers can read an XML document,fetch the appropriate stylesheet, and useit to sort and format the informationon the screen. The reader might neverknow that he is looking at XML ratherthan HTML—except that XML-basedsites run faster and are easier to use.

People with visual disabilities gain afree benefit from this approach to pub-lishing. Stylesheets will let them renderXML into Braille or audible speech. Theadvantages extend to others as well:commuters who want to surf the Webin their cars may also find it handy tohave pages read aloud.

Although the Web has been a boon toscience and to scholarship, it is com-merce (or rather the expectation of fu-ture commercial gain) that has fueled itslightning growth. The recent surge in re-tail sales over the Web has drawn muchattention, but business-to-business com-merce is moving on-line at least as quick-ly. The flow of goods through the manu-facturing process, for example, begs forautomation. But schemes that rely oncomplex, direct program-to-program in-teraction have not worked well in prac-tice, because they depend on a uniformi-ty of processing that does not exist.

For centuries, humans have success-fully done business by exchanging stan-

dardized documents: purchase orders,invoices, manifests, receipts and so on.Documents work for commerce be-cause they do not require the parties in-volved to know about one another’s in-ternal procedures. Each record exposesexactly what its recipient needs toknow and no more. The exchange ofdocuments is probably the right way todo business on-line, too. But this wasnot the job for which HTML was built.

XML, in contrast, was designed fordocument exchange, and it is becomingclear that universal electronic commercewill rely heavily on a flow of agree-ments, expressed in millions of XMLdocuments pulsing around the Internet.

Thus, for its users, the XML-pow-ered Web will be faster, friendlier and a

better place to do business. Web site de-signers, on the other hand, will find itmore demanding. Battalions of program-mers will be needed to exploit new XMLlanguages to their fullest. And althoughthe day of the self-trained Web hackeris not yet over, the species is endangered.Tomorrow’s Web designers will need tobe versed not just in the production ofwords and graphics but also in the con-struction of multilayered, interdepen-dent systems of DTDs, data trees, hy-perlink structures, metadata and style-sheets—a more robust infrastructure forthe Web’s second generation.

The Authors

JON BOSAK and TIM BRAY played crucial roles in the development of XML. Bosak, anon-line information technology architect at Sun Microsystems in Mountain View, Calif.,organized and led the World Wide Web Consortium working group that created XML. Heis currently chair of the W3C XML Coordination Group and a representative to the Orga-nization for the Advancement of Structured Information Standards. Bray is co-editor of theXML 1.0 specification and the related Namespaces in XML and serves as co-chair of theW3C XML Syntax Working Group. He managed the New Oxford English DictionaryProject at the University of Waterloo in 1986, co-founded Open Text Corporation in 1989and launched Textuality, a programming firm in Vancouver, B.C., in 1996.

New Languages for Science

XML offers a particularly convenient way for scien-tists to exchange theories, calculations and ex-

perimental results. Mathematicians, among others, havelong been frustrated by Web browsers’ ablity to displaymathematical expressions only as pictures. MathML now

allows them to insert equations into their Web pages with a few lines of simpletext. Readers can then paste those expressions directly into algebra software forcalculation or graphing.

Chemists have gone a step further, developing new browser programs for theirXML-based Chemical Markup Language (CML) that graphically render the molec-ular structure of compounds described in CML Web pages. Both CML and Astron-omy Markup Language will help researchers sift quickly through reams of journalcitations to find just the papers that apply to the object of their study. As-tronomers, for example, can enter the sky coordinates of a galaxy to pull up a list ofimages, research papers and instrument data about that heavenly body.

XML will be helpful for running experiments as well as analyzing their results.National Aeronautics and Space Administration engineers began work last year onAstronomical Instrument ML (AIML) as a way to enable scientists on the groundto control the SOFIA infrared telescope as it flies on a Boeing 747. AIML shouldeventually allow astronomers all over the world to control telescopes and perhapseven satellites through straightforward Internet browser software.

Geneticists may soon be using Biosequence ML (BSML) to exchange and ma-nipulate the flood of information produced by gene-mapping and gene-sequenc-ing projects. A BSML browser built and distributed free by Visual Genomics inColumbus, Ohio, lets researchers search through vast databases of genetic codeand display the resulting snippets as meaningful maps and charts rather than asobtuse strings of letters. —The Editors

SA


Every day the World Wide Webgrows by roughly a millionelectronic pages, adding to the

hundreds of millions already on-line.This staggering volume of informationis loosely held together by more than abillion annotated connections, calledhyperlinks. For the first time in history,millions of people have virtually instantaccess from their homes and offices tothe creative output of a significant—and growing—fraction of the planet’spopulation.

But because of the Web’s rapid, chaot-ic growth, the resulting network of in-formation lacks organization and struc-ture. In fact, the Web has evolved into aglobal mess of previously unimaginedproportions. Web pages can be writtenin any language, dialect or style by indi-viduals with any background, educa-tion, culture, interest and motivation.Each page might range from a fewcharacters to a few hundred thousand,containing truth, falsehood, wisdom,propaganda or sheer nonsense. How,then, can one extract from this digitalmorass high-quality, relevant pages inresponse to a specific need for certaininformation?

In the past, people have relied onsearch engines that hunt for specificwords or terms. But such text searchesfrequently retrieve tens of thousands ofpages, many of them useless. How canpeople quickly locate only the informa-tion they need and trust that it is au-thentic and reliable?

We have developed a new kind ofsearch engine that exploits one of theWeb’s most valuable resources—its myr-iad hyperlinks. By analyzing these inter-connections, our system automaticallylocates two types of pages: authoritiesand hubs. The former are deemed to bethe best sources of information on aparticular topic; the latter are collec-

tions of links to those locations. Ourmethodology should enable users to lo-cate much of the information they de-sire quickly and efficiently.

The Challenges of Search Engines

Computer disks have become in-creasingly inexpensive, enabling the

storage of a large portion of the Web ata single site. At its most basic level, asearch engine maintains a list, for everyword, of all known Web pages contain-ing that word. Such a collection of listsis known as an index. So if people areinterested in learning about acupunc-ture, they can access the “acupuncture”list to find all Web pages containing thatword.

Creating and maintaining this index ishighly challenging [see “Searching theInternet,” by Clifford Lynch; Scien-tific American, March 1997], and de-termining what information to return inresponse to user requests remainsdaunting. Consider the unambiguousquery for information on “Nepal Air-ways,” the airline company. Of theroughly 100 (at the time of this writing)Web pages containing the phrase, howdoes a search engine decide which 20or so are the best? One difficulty is thatthere is no exact and mathematicallyprecise measure of “best”; indeed, it liesin the eye of the beholder.

Search engines such as AltaVista, Info-seek, HotBot, Lycos and Excite useheuristics to determine the way in whichto order—and thereby prioritize—pages.These rules of thumb are collectively

known as a ranking function, whichmust apply not only to relativelyspecific and straightforward queries(“Nepal Airways”) but also to muchmore general requests, such as for “air-craft,” a word that appears in morethan a million Web pages. How shoulda search engine choose just 20 fromsuch a staggering number?

Simple heuristics might rank pages bythe number of times they contain thequery term, or they may favor instancesin which that text appears earlier. Butsuch approaches can sometimes failspectacularly. Tom Wolfe’s book TheKandy-Kolored Tangerine-Flake Stream-line Baby would, if ranked by suchheuristics, be deemed very relevant tothe query “hernia,” because it beginsby repeating that word dozens of times.Numerous extensions to these rules ofthumb abound, including approachesthat give more weight to words that ap-pear in titles, in section headings or in alarger font.

Such strategies are routinely thwartedby many commercial Web sites that de-sign their pages in certain ways specifi-cally to elicit favorable rankings. Thus,one encounters pages whose titles are“cheap airfares cheap airfares cheap air-fares.” Some sites write other carefullychosen phrases many times over in col-ors and fonts that are invisible to hu-man viewers. This practice, called spam-ming, has become one of the main rea-sons why it is currently so difficult tomaintain an effective search engine.

Spamming aside, even the basic as-sumptions of conventional text searches

Hypersearching the WebWith the volume of on-line information in cyberspace growing at abreakneck pace, more effective search tools are desperately needed.

A new technique analyzes how Web pages are linked together

by Members of the Clever Project

WEB PAGES (white dots) are scattered over the Internet with little structure, making itdifficult for a person in the center of this electronic clutter to find only the informationdesired. Although this diagram shows just hundreds of pages, the World Wide Webcurrently contains more than 300 million of them. Nevertheless, an analysis of the wayin which certain pages are linked to one another can reveal a hidden order.

ALL

ILLU

STRA

TIO

NS

BY B

RYA

N C

HRI

STIE




are suspect. To wit, pages that are high-ly relevant will not always contain thequery term, and others that do may beworthless. A major cause of this prob-lem is that human language, in all itsrichness, is awash in synonymy (differ-ent words having the same meaning)and polysemy (the same word havingmultiple meanings). Because of the for-mer, a query for “automobile” will missa deluge of pages that lack that wordbut instead contain “car.” The lattermanifests itself in a simple query for“jaguar,” which will retrieve thousandsof pages about the automobile, the jun-gle cat and the National FootballLeague team, among other topics.

One corrective strategy is to augmentsearch techniques with stored informa-tion about semantic relations betweenwords. Such compilations, typically con-structed by a team of linguists, are some-times known as semantic networks, fol-lowing the seminal work on the Word-Net project by George A. Miller and hiscolleagues at Princeton University. Anindex-based engine with access to a se-mantic network could, on receiving thequery for “automobile,” first determinethat “car” is equivalent and then re-trieve all Web pages containing eitherword. But this process is a double-edged sword: it helps with synonymybut can aggravate polysemy.

Even as a cure for synonymy, the so-lution is problematic. Constructing andmaintaining a semantic network that isexhaustive and cross-cultural (after all,the Web knows no geographical bound-aries) are formidable tasks. The processis especially difficult on the Internet,where a whole new language is evolv-ing—words such as “FAQs,” “zines”and “bots” have emerged, whereas oth-er words such as “surf” and “browse”have taken on additional meanings.

Our work on the Clever project atIBM originated amid this perplexing ar-ray of issues. Early on, we realized thatthe current scheme of indexing and re-trieving a page based solely on the textit contained ignores more than a billioncarefully placed hyperlinks that revealthe relations between pages. But howexactly should this information be used?

When people perform a search for“Harvard,” many of them want tolearn more about the Ivy Leagueschool. But more than a million loca-tions contain “Harvard,” and the uni-versity’s home page is not the one thatuses it the most frequently, the earliestor in any other way deemed especiallysignificant by traditional ranking func-tions. No entirely internal feature ofthat home page truly seems to reveal itsimportance.

Indeed, people design Web pageswith all kinds of objectives in mind. Forinstance, large corporations want theirsites to convey a certain feel and projecta specific image—goals that might bevery different from that of describingwhat the company does. Thus, IBM’shome page does not contain the word“computer.” For these types of situa-tions, conventional search techniquesare doomed from the start.

To address such concerns, human ar-chitects of search engines have beentempted to intervene. After all, they be-lieve they know what the appropriateresponses to certain queries should be,and developing a ranking function that

will automatically produce those resultshas been a troublesome undertaking. Sothey could maintain a list of queries like“Harvard” for which they will overridethe judgment of the search engine withpredetermined “right” answers.

This approach is being taken by anumber of search engines. In fact, a ser-vice such as Yahoo! contains only hu-man-selected pages. But there arecountless possible queries. How, with alimited number of human experts, canone maintain all these lists of precom-puted responses, keeping them reason-ably complete and up-to-date, as theWeb meanwhile grows by a millionpages a day?

Searching with Hyperlinks

In our work, we have been attackingthe problem in a different way. We

have developed an automatic techniquefor finding the most central, authorita-tive sites on broad search topics bymaking use of hyperlinks, one of theWeb’s most precious resources. It is thehyperlinks, after all, that pull togetherthe hundreds of millions of pages into aweb of knowledge. It is through theseconnections that users browse, serendip-itously discovering valuable informationthrough the pointers and recommenda-tions of people they have never met.

The underlying assumption of ourapproach views each link as an implicitendorsement of the location to which it

FINDING authorities and hubs can be tricky because of the circular way in which theyare defined: an authority is a page that is pointed to by many hubs; a hub is a site thatlinks to many authorities. The process, however, can be performed mathematically.Clever, a prototype search engine, assigns initial scores to candidate Web pages on aparticular topic. Clever then revises those numbers in repeated series of calculations,with each iteration dependent on the values of the previous round. The computationscontinue until the scores eventually settle on their final values, which can then be usedto determine the best authorities and hubs.

AUTHORITIES AND HUBS help to organize information on the Web, however infor-mally and inadvertently. Authorities ( ) are sites that other Web pages happen to link tofrequently on a particular topic. For the subject of human rights, for instance, the homepage of Amnesty International might be one such location. Hubs ( ) are sites that tendto cite many of those authorities, perhaps in a resource list or in a “My Favorite Links”section on a personal home page.


points. Consider the Web site of a hu-man-rights activist that directs peopleto the home page of Amnesty Interna-tional. In this case, the reference clearlysignifies approval.

Of course, a link may also exist purelyfor navigational purposes (“Click hereto return to the main menu”), as a paidadvertisement (“The vacation of yourdreams is only a click away”) or as astamp of disapproval (“Surf to this siteto see what this fool says”). We believe,however, that in aggregate—that is, whena large enough number is considered—Web links do confer authority.

In addition to expert sites that havegarnered many recommendations, theWeb is full of another type of page:hubs that link to those prestigious loca-tions, tacitly radiating influence out-ward to them. Hubs appear in guisesranging from professionally assembledlists on commercial sites to inventoriesof “My Favorite Links” on personalhome pages. So even if we find it difficultto define “authorities” and “hubs” inisolation, we can state this much: a re-spected authority is a page that is re-ferred to by many good hubs; a usefulhub is a location that points to manyvaluable authorities.

These definitions look hopelessly cir-cular. How could they possibly lead toa computational method of identifyingboth authorities and hubs? Thinking ofthe problem intuitively, we devised thefollowing algorithm. To start off, welook at a set of candidate pages about aparticular topic, and for each one wemake our best guess about how good ahub it is and how good an authority itis. We then use these initial estimates tojump-start a two-step iterative process.

First, we use the current guesses aboutthe authorities to improve the estimatesof hubs—we locate all the best authori-ties, see which pages point to them andcall those locations good hubs. Second,we take the updated hub informationto refine our guesses about the authori-ties—we determine where the best hubspoint most heavily and call these thegood authorities. Repeating these stepsseveral times fine-tunes the results.

We have implemented this algorithmin Clever, a prototype search engine.For any query of a topic—say, acupunc-ture—Clever first obtains a list of 200pages from a standard text index suchas AltaVista. The system then augmentsthese by adding all pages that link toand from that 200. In our experience,the resulting collection, called the root

set, will typically contain between 1,000and 5,000 pages.

For each of these, Clever assigns ini-tial numerical hub and authority scores.The system then refines the values: theauthority score of each page is updatedto be the sum of the hub scores of otherlocations that point to it; a hub score isrevised to be the sum of the authorityscores of locations to which a pagepoints. In other words, a page that hasmany high-scoring hubs pointing to itearns a higher authority score; a loca-tion that points to many high-scoringauthorities garners a higher hub score.Clever repeats these calculations untilthe scores have more or less settled ontheir final values, from which the bestauthorities and hubs can be deter-mined. (Note that the computations donot preclude a particular page fromachieving a top rank in both categories,as sometimes occurs.)

The algorithm might best be under-stood in visual terms. Picture the Webas a vast network of innumerable sites,all interconnected in a seemingly ran-dom fashion. For a given set of pagescontaining a certain word or term,Clever zeroes in on the densest patternof links between those pages.

As it turns out, the iterative summa-tion of hub and authority scores can beanalyzed with stringent mathematics.Using linear algebra, we can representthe process as the repeated multiplica-tion of a vector (specifically, a row ofnumbers representing the hub or au-thority scores) by a matrix (a two-di-mensional array of numbers represent-ing the hyperlink structure of the rootset). The final results of the process arehub and authority vectors that haveequilibrated to certain numbers—valuesthat reveal which pages are the besthubs and authorities, respectively. (Inthe world of linear algebra, such a stabi-lized row of numbers is called an eigen-vector; it can be thought of as the solu-tion to a system of equations defined bythe matrix.)

With further linear algebraic analysis,we have shown that the iterative pro-cess will rapidly settle to a relativelysteady set of hub and authority scores.For our purposes, a root set of 3,000pages requires about five rounds of cal-

culations. Furthermore, the results aregenerally independent of the initial esti-mates of scores used to start the pro-cess. The method will work even if thevalues are all initially set to be equal to 1.So the final hub and authority scores areintrinsic to the collection of pages in theroot set.

A useful by-product of Clever’s itera-tive processing is that the algorithm nat-urally separates Web sites into clusters.A search for information on abortion,for example, results in two types of lo-cations, pro-life and pro-choice, becausepages from one group are more likely tolink to one another than to those fromthe other community.

From a larger perspective, Clever’s al-gorithm reveals the underlying structureof the World Wide Web. Although theInternet has grown in a hectic, willy-nilly fashion, it does indeed have an in-herent—albeit inchoate—order based onhow pages are linked.

The Link to Citation Analysis

Methodologically, the Clever algo-rithm has close ties to citation

analysis, the study of patterns of howscientific papers make reference to oneanother. Perhaps the field’s best-knownmeasure of a journal’s importance is the“impact factor.” Developed by EugeneGarfield, a noted information scientistand founder of Science Citation Index,the metric essentially judges a publicationby the number of citations it receives.

On the Web, the impact factor wouldcorrespond to the ranking of a page sim-ply by a tally of the number of links thatpoint to it. But this approach is typicallynot appropriate, because it can favoruniversally popular locations, such asthe home page of the New York Times,regardless of the specific query topic.

Even in the area of citation analysis,researchers have attempted to improveGarfield’s measure, which counts eachreference equally. Would not a betterstrategy give additional weight to cita-tions from a journal deemed more im-portant? Of course, the difficulty withthis approach is that it leads to a circu-lar definition of “importance,” similarto the problem we encountered in speci-fying hubs and authorities. As early as

CYBERCOMMUNITIES (shown in different colors) populate the Web. An explorationof this phenomenon has uncovered various groups on topics as arcane as oil spills off thecoast of Japan, fire brigades in Australia and resources for Turks living in the U.S. TheWeb is filled with hundreds of thousands of such finely focused communities.



1976 Gabriel Pinski and Francis Narinof CHI Research in Haddon Heights,N.J., overcame this hurdle by develop-ing an iterated method for computing astable set of adjusted scores, which theytermed influence weights. In contrast toour work, Pinski and Narin did not in-voke a distinction between authoritiesand hubs. Their method essentially pass-es weight directly from one good author-ity to another.

This difference raises a fundamentalpoint about the Web versus traditionalprinted scientific literature. In cyber-space, competing authorities (for exam-ple, Netscape and Microsoft on thetopic of browsers) frequently do not ac-knowledge one another’s existence, sothey can be connected only by an inter-mediate layer of hubs. Rival prominentscientific journals, on the other hand,typically do a fair amount of cross-cita-tion, making the role of hubs much lesscrucial.

A number of groups are also investi-gating the power of hyperlinks forsearching the Web. Sergey Brin andLawrence Page of Stanford University,for instance, have developed a searchengine dubbed Google that implementsa link-based ranking measure related tothe influence weights of Pinski and Nar-in. The Stanford scientists base their ap-proach on a model of a Web surfer whofollows links and makes occasional hap-hazard jumps, arriving at certain placesmore frequently than others. Thus,Google finds a single type of universallyimportant page—intuitively, locationsthat are heavily visited in a randomtraversal of the Web’s link structure. Inpractice, for each Web page Google ba-sically sums the scores of other loca-

tions pointing to it. So, when presentedwith a specific query, Google can re-spond by quickly retrieving all pages con-taining the search text and listing themaccording to their preordained ranks.

Google and Clever have two main dif-ferences. First, the former assigns initialrankings and retains them independentlyof any queries, whereas the latter assem-bles a different root set for each searchterm and then prioritizes those pages inthe context of that particular query. Con-sequently, Google’s approach enablesfaster response. Second, Google’s basicphilosophy is to look only in the forwarddirection, from link to link. In contrast,Clever also looks backward from an au-thoritative page to see what locationsare pointing there. In this sense, Clevertakes advantage of the sociological phe-nomenon that humans are innately moti-vated to create hublike content express-ing their expertise on specific topics.

The Search Continues

We are exploring a number ofways to enhance Clever. A fun-

damental direction in our overall ap-proach is the integration of text and hy-perlinks. One strategy is to view certainlinks as carrying more weight than oth-ers, based on the relevance of the text inthe referring Web location. Specifically,we can analyze the contents of thepages in the root set for the occurrencesand relative positions of the query topicand use this information to assign nu-merical weights to some of the connec-tions between those pages. If the querytext appeared frequently and close to alink, for instance, the correspondingweight would be increased.

Our preliminary experiments suggestthat this refinement substantially in-creases the focus of the search results.(A shortcoming of Clever has been thatfor a narrow topic, such as Frank LloydWright’s house Fallingwater, the systemsometimes broadens its search and re-trieves information on a general subject,such as American architecture.) We areinvestigating other improvements, andgiven the many styles of authorship onthe Web, the weighting of links mightincorporate page content in a variety ofways.

We have also begun to construct listsof Web resources, similar to the guidesput together manually by employees ofcompanies such as Yahoo! and Info-seek. Our early results indicate that au-tomatically compiled lists can be com-petitive with handcrafted ones. Further-more, through this work we have foundthat the Web teems with tightly knitgroups of people, many with offbeat com-mon interests (such as weekend sumo en-thusiasts who don bulky plastic outfitsand wrestle each other for fun), and weare currently investigating efficient andautomatic methods for uncovering thesehidden communities.

The World Wide Web of today is dra-matically different from that of just fiveyears ago. Predicting what it will be likein another five years seems futile. Willeven the basic act of indexing the Websoon become infeasible? And if so, willour notion of searching the Web undergofundamental changes? For now, the onething we feel certain in saying is that theWeb’s relentless growth will continue togenerate computational challenges forwading through the ever increasing vol-ume of on-line information.

The Authors

THE CLEVER PROJECT: Soumen Chakrabarti, Byron Dom, S. RaviKumar, Prabhakar Raghavan, Sridhar Rajagopalan and Andrew Tomkinsare research staff members at the IBM Almaden Research Center in SanJose, Calif. Jon M. Kleinberg is an assistant professor in the computerscience department at Cornell University. David Gibson is completinghis Ph.D. at the computer science division at the University of California,Berkeley.

The authors began their quest for exploiting the hyperlink structure ofthe World Wide Web three years ago, when they first sought to developimproved techniques for finding information in the clutter of cyberspace.Their work originated with the following question: If computation werenot a bottleneck, what would be the most effective search algorithm? Inother words, could they build a better search engine if the processingdidn’t have to be instantaneous? The result was the algorithm describedin this article. Recently the research team has been investigating the Webphenomenon of cybercommunities.

Further Reading

Search Engine Watch (www.searchenginewatch.com) containsinformation on the latest progress in search engines. TheWordNet project is described in WordNet: An Electronic Lex-ical Database (MIT Press, 1998), edited by Christiane Fell-baum. The iterative method for determining hubs and author-ities first appeared in Jon M. Kleinberg’s paper “AuthoritativeSources in a Hyperlinked Environment” in Proceedings of the9th ACM-SIAM Symposium on Discrete Algorithms, editedby Howard Karloff (SIAM/ACM–SIGACT, 1998). Improve-ments to the algorithm are described at the Web site of theIBM Almaden Research Center (www.almaden.ibm.com/cs/k53/clever.html). Introduction to Informetrics (Elsevier Sci-ence Publishers, 1990), by Leo Egghe and Ronald Rousseau,provides a good overview of citation analysis. Information onthe Google project at Stanford University can be obtainedfrom www.google.com on the World Wide Web.


byTIM BERNERS-LEE, JAMES HENDLER andORA LASSILA

PHOTOILLUSTRATIONS BY MIGUEL SALMERON

THE

WEBS E M A N T I C

A n e w f o r m o f W e b c o n t e n t t h a t i s m e a n i n g f u l t o c o m p u t e r s

w i l l u n l e a s h a r e v o l u t i o n o f n e w p o s s i b i l i t i e s


The entertainment system wasbelting out the Beatles’ “We Can Work ItOut” when the phone rang. When Peteanswered, his phone turned the sounddown by sending a message to all the oth-er local devices that had a volume control.His sister, Lucy, was on the line from thedoctor’s office: “Mom needs to see a spe-cialist and then has to have a series ofphysical therapy sessions. Biweekly orsomething. I’m going to have my agent setup the appointments.” Pete immediatelyagreed to share the chauffeuring.

At the doctor’s office, Lucy instruct-ed her Semantic Web agent through herhandheld Web browser. The agentpromptly retrieved information aboutMom’s prescribed treatment from thedoctor’s agent, looked up several lists ofproviders, and checked for the ones in-plan for Mom’s insurance within a 20-mile radius of her home and with a rat-ing of excellent or very good on trustedrating services. It then began trying to finda match between available appointmenttimes (supplied by the agents of individ-ual providers through their Web sites) andPete’s and Lucy’s busy schedules. (The em-phasized keywords indicate terms whosesemantics, or meaning, were defined forthe agent through the Semantic Web.)

In a few minutes the agent presentedthem with a plan. Pete didn’t like it—Uni-versity Hospital was all the way acrosstown from Mom’s place, and he’d be dri-ving back in the middle of rush hour. Heset his own agent to redo the search withstricter preferences about location andtime. Lucy’s agent, having complete

trust in Pete’s agent in the context of thepresent task, automatically assisted bysupplying access certificates and shortcutsto the data it had already sorted through.

Almost instantly the new plan waspresented: a much closer clinic and earli-er times—but there were two warningnotes. First, Pete would have to reschedulea couple of his less important appoint-ments. He checked what they were—not aproblem. The other was something aboutthe insurance company’s list failing to in-clude this provider under physical ther-apists: “Service type and insurance planstatus securely verified by other means,”the agent reassured him. “(Details?)”

Lucy registered her assent at about thesame moment Pete was muttering, “Spareme the details,” and it was all set. (Ofcourse, Pete couldn’t resist the details andlater that night had his agent explain howit had found that provider even though itwasn’t on the proper list.)

Expressing Meaningpete and lucy could use their agents tocarry out all these tasks thanks not to theWorld Wide Web of today but rather theSemantic Web that it will evolve into to-morrow. Most of the Web’s content to-day is designed for humans to read, notfor computer programs to manipulatemeaningfully. Computers can adeptlyparse Web pages for layout and routineprocessing—here a header, there a link toanother page—but in general, computershave no reliable way to process the se-mantics: this is the home page of the Hart-man and Strauss Physio Clinic, this link

goes to Dr. Hartman’s curriculum vitae.The Semantic Web will bring structure tothe meaningful content of Web pages,creating an environment where softwareagents roaming from page to page canreadily carry out sophisticated tasks forusers. Such an agent coming to the clinic’sWeb page will know not just that the pagehas keywords such as “treatment, medi-cine, physical, therapy” (as might be en-coded today) but also that Dr. Hartmanworks at this clinic on Mondays,Wednesdays and Fridays and that thescript takes a date range in yyyy-mm-dd format and returns appointmenttimes. And it will “know” all this with-out needing artificial intelligence on thescale of 2001’s Hal or Star Wars’s C-3PO. Instead these semantics were en-coded into the Web page when the clinic’soffice manager (who never took CompSci 101) massaged it into shape using off-the-shelf software for writing SemanticWeb pages along with resources listed onthe Physical Therapy Association’s site.

The Semantic Web is not a separateWeb but an extension of the current one,in which information is given well-definedmeaning, better enabling computers andpeople to work in cooperation. The firststeps in weaving the Semantic Web intothe structure of the existing Web are al-ready under way. In the near future, thesedevelopments will usher in significantnew functionality as machines becomemuch better able to process and “under-stand” the data that they merely displayat present. The essential property of theWorld Wide Web is its universality. The

■ To date, the World Wide Web has developed most rapidly as a medium of documents for people rather than ofinformation that can be manipulated automatically. By augmenting Web pages with data targeted atcomputers and by adding documents solely for computers, we will transform the Web into the Semantic Web.

■ Computers will find the meaning of semantic data by following hyperlinks to definitions of key terms and rulesfor reasoning about them logically. The resulting infrastructure will spur the development of automated Webservices such as highly functional agents.

■ Ordinary users will compose Semantic Web pages and add new definitions and rules using off-the-shelfsoftware that will assist with semantic markup.

Overview / Semantic Web


power of a hypertext link is that “any-thing can link to anything.” Web tech-nology, therefore, must not discriminatebetween the scribbled draft and the pol-ished performance, between commercialand academic information, or among cul-tures, languages, media and so on. Infor-mation varies along many axes. One ofthese is the difference between informa-tion produced primarily for human con-sumption and that produced mainly formachines. At one end of the scale we haveeverything from the five-second TV com-mercial to poetry. At the other end wehave databases, programs and sensor out-put. To date, the Web has developed mostrapidly as a medium of documents forpeople rather than for data and informa-tion that can be processed automatically.The Semantic Web aims to make up forthis.

Like the Internet, the Semantic Webwill be as decentralized as possible. SuchWeb-like systems generate a lot of excite-ment at every level, from major corpora-tion to individual user, and provide bene-fits that are hard or impossible to predictin advance. Decentralization requirescompromises: the Web had to throw awaythe ideal of total consistency of all of its in-terconnections, ushering in the infamousmessage “Error 404: Not Found” but al-lowing unchecked exponential growth.

Knowledge Representationfor the semantic web to function,computers must have access to structuredcollections of information and sets of in-ference rules that they can use to conductautomated reasoning. Artificial-intelli-gence researchers have studied such sys-tems since long before the Web was de-veloped. Knowledge representation, asthis technology is often called, is current-ly in a state comparable to that of hyper-text before the advent of the Web: it isclearly a good idea, and some very nicedemonstrations exist, but it has not yetchanged the world. It contains the seedsof important applications, but to realizeits full potential it must be linked into asingle global system.

Traditional knowledge-representa-tion systems typically have been central-

ized, requiring everyone to share exactlythe same definition of common conceptssuch as “parent” or “vehicle.” But centralcontrol is stifling, and increasing the sizeand scope of such a system rapidly be-comes unmanageable.

Moreover, these systems usually care-fully limit the questions that can be askedso that the computer can answer reliably—

or answer at all. The problem is reminis-cent of Gödel’s theorem from mathemat-ics: any system that is complex enough tobe useful also encompasses unanswerablequestions, much like sophisticated ver-sions of the basic paradox “This sentenceis false.” To avoid such problems, tradi-tional knowledge-representation systemsgenerally each had their own narrow andidiosyncratic set of rules for making infer-ences about their data. For example, a ge-nealogy system, acting on a database offamily trees, might include the rule “a wifeof an uncle is an aunt.” Even if the datacould be transferred from one system toanother, the rules, existing in a complete-ly different form, usually could not.

Semantic Web researchers, in contrast,accept that paradoxes and unanswerablequestions are a price that must be paid toachieve versatility. We make the languagefor the rules as expressive as needed to al-low the Web to reason as widely as de-sired. This philosophy is similar to that ofthe conventional Web: early in the Web’sdevelopment, detractors pointed out thatit could never be a well-organized library;without a central database and tree struc-ture, one would never be sure of findingeverything. They were right. But the ex-pressive power of the system made vastamounts of information available, andsearch engines (which would have seemedquite impractical a decade ago) now pro-duce remarkably complete indices of a lotof the material out there.

The challenge of the Semantic Web,therefore, is to provide a language thatexpresses both data and rules for reason-ing about the data and that allows rulesfrom any existing knowledge-representa-tion system to be exported onto the Web.

Adding logic to the Web—the meansto use rules to make inferences, choosecourses of action and answer questions—

is the task before the Semantic Web com-munity at the moment. A mixture ofmathematical and engineering decisionscomplicate this task. The logic must bepowerful enough to describe complexproperties of objects but not so power-ful that agents can be tricked by beingasked to consider a paradox. Fortunate-ly, a large majority of the information wewant to express is along the lines of “ahex-head bolt is a type of machine bolt,”which is readily written in existing lan-guages with a little extra vocabulary.

Two important technologies for de-veloping the Semantic Web are already inplace: eXtensible Markup Language(XML) and the Resource DescriptionFramework (RDF). XML lets everyonecreate their own tags—hidden labels suchas <zip code> or <alma mater> that an-notate Web pages or sections of text on apage. Scripts, or programs, can make useof these tags in sophisticated ways, butthe script writer has to know what thepage writer uses each tag for. In short,XML allows users to add arbitrary struc-ture to their documents but says nothingabout what the structures mean [see“XML and the Second-Generation Web,”by Jon Bosak and Tim Bray; ScientificAmerican, May 1999].

Meaning is expressed by RDF, whichencodes it in sets of triples, each triple be-ing rather like the subject, verb and objectof an elementary sentence. These triplescan be written using XML tags. In RDF,a document makes assertions that partic-ular things (people, Web pages or what-ever) have properties (such as “is a sisterof,” “is the author of”) with certain val-ues (another person, another Web page).This structure turns out to be a naturalway to describe the vast majority of thedata processed by machines. Subject andobject are each identified by a UniversalResource Identifier (URI), just as used ina link on a Web page. (URLs, UniformResource Locators, are the most commontype of URI.) The verbs are also identifiedby URIs, which enables anyone to definea new concept, a new verb, just by defin-ing a URI for it somewhere on the Web.

Human language thrives when usingthe same term to mean somewhat differ-


ent things, but automation does not.Imagine that I hire a clown messenger ser-vice to deliver balloons to my customerson their birthdays. Unfortunately, theservice transfers the addresses from mydatabase to its database, not knowingthat the “addresses” in mine are wherebills are sent and that many of them arepost office boxes. My hired clowns endup entertaining a number of postal work-ers—not necessarily a bad thing but cer-tainly not the intended effect. Using a dif-ferent URI for each specific concept solvesthat problem. An address that is a mailingaddress can be distinguished from one thatis a street address, and both can be distin-guished from an address that is a speech.

The triples of RDF form webs of in-formation about related things. BecauseRDF uses URIs to encode this informa-tion in a document, the URIs ensure thatconcepts are not just words in a docu-ment but are tied to a unique definitionthat everyone can find on the Web. For

example, imagine that we have access toa variety of databases with informationabout people, including their addresses.If we want to find people living in a spe-cific zip code, we need to know whichfields in each database represent namesand which represent zip codes. RDF canspecify that “(field 5 in database A) (is afield of type) (zip code),” using URIsrather than phrases for each term.

Ontologiesof course, this is not the end of thestory, because two databases may usedifferent identifiers for what is in fact thesame concept, such as zip code. A pro-gram that wants to compare or combineinformation across the two databases hasto know that these two terms are beingused to mean the same thing. Ideally, theprogram must have a way to discoversuch common meanings for whateverdatabases it encounters.

A solution to this problem is provid-

ed by the third basic component of theSemantic Web, collections of informa-tion called ontologies. In philosophy, anontology is a theory about the nature ofexistence, of what types of things exist;ontology as a discipline studies such the-ories. Artificial-intelligence and Web re-searchers have co-opted the term for theirown jargon, and for them an ontology isa document or file that formally definesthe relations among terms. The most typ-ical kind of ontology for the Web has ataxonomy and a set of inference rules.

The taxonomy defines classes of ob-jects and relations among them. For ex-ample, an address may be defined as atype of location, and city codes may bedefined to apply only to locations, andso on. Classes, subclasses and relationsamong entities are a very powerful toolfor Web use. We can express a largenumber of relations among entities by as-signing properties to classes and allowingsubclasses to inherit such properties. Ifcity codes must be of type city andcities generally have Web sites, we candiscuss the Web site associated with acity code even if no database links a citycode directly to a Web site.

Inference rules in ontologies supplyfurther power. An ontology may expressthe rule “If a city code is associated witha state code, and an address uses that citycode, then that address has the associatedstate code.” A program could then read-ily deduce, for instance, that a CornellUniversity address, being in Ithaca, mustbe in New York State, which is in theU.S., and therefore should be formattedto U.S. standards. The computer doesn’ttruly “understand” any of this informa-tion, but it can now manipulate the termsmuch more effectively in ways that areuseful and meaningful to the human user.

With ontology pages on the Web, so-lutions to terminology (and other) prob-lems begin to emerge. The meaning ofterms or XML codes used on a Web pagecan be defined by pointers from the pageto an ontology. Of course, the same prob-lems as before now arise if I point to anontology that defines addresses as con-taining a zip code and you point to onethat uses postal code. This kind of con-

HTML: Hypertext Markup Language. The language used to encode formatting, links and other features on Web pages. Uses standardized “tags” such as <H1> and<BODY> whose meaning and interpretation is set universally by the World Wide Web Consortium.XML: eXtensible Markup Language. A markup language like HTML that lets individuals define and use their own tags. XML has no built-in mechanism to conveythe meaning of the user’s new tags to other users.RESOURCE: Web jargon for any entity. Includes Web pages, parts of a Web page, devices, people and more.URL: Uniform Resource Locator. The familiar codes (such ashttp://www.sciam.com/index.html) that are used in hyperlinks.URI: Universal Resource Identifier. URLs are the most familiar type of URI. A URIdefines or specifies an entity, not necessarily by naming its location on the Web.RDF: Resource Description Framework. A scheme for defining information on the Web.RDF provides the technology for expressing the meaning of terms and concepts in aform that computers can readily process. RDF can use XML for its syntax and URIs tospecify entities, concepts, properties and relations.ONTOLOGIES: Collections of statements written in a language such as RDF thatdefine the relations between concepts and specify logical rules for reasoning about them. Computers will “understand” the meaning of semantic data on a Webpage by following links to specified ontologies.AGENT: A piece of software that runs without direct human control or constantsupervision to accomplish goals provided by a user. Agents typically collect, filter andprocess information found on the Web, sometimes with the help of other agents.SERVICE DISCOVERY: The process of locating an agent or automated Web-basedservice that will perform a required function. Semantics will enable agents to describeto one another precisely what function they carry out and what input data are needed.

Glossary


fusion can be resolved if ontologies (orother Web services) provide equivalencerelations: one or both of our ontologiesmay contain the information that my zipcode is equivalent to your postal code.

Our scheme for sending in the clownsto entertain my customers is partiallysolved when the two databases point todifferent definitions of address. Theprogram, using distinct URIs for differ-ent concepts of address, will not con-fuse them and in fact will need to discov-er that the concepts are related at all. Theprogram could then use a service thattakes a list of postal addresses (definedin the first ontology) and converts it intoa list of physical addresses (the secondontology) by recognizing and removingpost office boxes and other unsuitableaddresses. The structure and semanticsprovided by ontologies make it easier for an entrepreneur to provide such a service and can make its use completelytransparent.

Ontologies can enhance the func-tioning of the Web in many ways. Theycan be used in a simple fashion to im-prove the accuracy of Web searches—thesearch program can look for only thosepages that refer to a precise concept in-stead of all the ones using ambiguouskeywords. More advanced applicationswill use ontologies to relate the informa-tion on a page to the associated knowl-edge structures and inference rules. Anexample of a page marked up for suchuse is online at http://www.cs.umd.edu/~hendler. If you send your Web browserto that page, you will see the normal Webpage entitled “Dr. James A. Hendler.” Asa human, you can readily find the link toa short biographical note and read there

that Hendler received his Ph.D. fromBrown University. A computer programtrying to find such information, howev-er, would have to be very complex toguess that this information might be in abiography and to understand the Englishlanguage used there.

For computers, the page is linked toan ontology page that defines informa-tion about computer science depart-ments. For instance, professors work atuniversities and they generally have doc-torates. Further markup on the page (notdisplayed by the typical Web browser)uses the ontology’s concepts to specifythat Hendler received his Ph.D. from theentity described at the URI http://www.brown.edu/—the Web page for Brown.Computers can also find that Hendler isa member of a particular research pro-ject, has a particular e-mail address, andso on. All that information is readilyprocessed by a computer and could beused to answer queries (such as whereDr. Hendler received his degree) that cur-rently would require a human to siftthrough the content of various pagesturned up by a search engine.

In addition, this markup makes itmuch easier to develop programs thatcan tackle complicated questions whoseanswers do not reside on a single Webpage. Suppose you wish to find the Ms.Cook you met at a trade conference lastyear. You don’t remember her first name,but you remember that she worked forone of your clients and that her son wasa student at your alma mater. An intelli-gent search program can sift through all the pages of people whose name is“Cook” (sidestepping all the pages relat-ing to cooks, cooking, the Cook Islands

and so forth), find the ones that mentionworking for a company that’s on yourlist of clients and follow links to Webpages of their children to track down ifany are in school at the right place.

Agentsthe real power of the Semantic Webwill be realized when people create manyprograms that collect Web content fromdiverse sources, process the informationand exchange the results with other pro-grams. The effectiveness of such softwareagents will increase exponentially as moremachine-readable Web content and auto-mated services (including other agents) be-come available. The Semantic Web pro-motes this synergy: even agents that werenot expressly designed to work togethercan transfer data among themselves whenthe data come with semantics.

An important facet of agents’ func-tioning will be the exchange of “proofs”written in the Semantic Web’s unifyinglanguage (the language that expresses log-ical inferences made using rules and infor-mation such as those specified by ontolo-gies). For example, suppose Ms. Cook’scontact information has been located byan online service, and to your great sur-prise it places her in Johannesburg. Nat-urally, you want to check this, so yourcomputer asks the service for a proof ofits answer, which it promptly provides bytranslating its internal reasoning into theSemantic Web’s unifying language. An in-ference engine in your computer readilyverifies that this Ms. Cook indeed match-es the one you were seeking, and it canshow you the relevant Web pages if youstill have doubts. Although they are stillfar from plumbing the depths of the Se-mantic Web’s potential, some programscan already exchange proofs in this way,using the current preliminary versions ofthe unifying language.

Another vital feature will be digitalsignatures, which are encrypted blocks ofdata that computers and agents can useto verify that the attached informationhas been provided by a specific trustedsource. You want to be quite sure that astatement sent to your accounting pro-gram that you owe money to an onlineretailer is not a forgery generated by the

TIM BERNERS-LEE, JAMES HENDLER and ORA LASSILA are individually and collectively obsessedwith the potential of Semantic Web technology. Berners-Lee is director of the World Wide WebConsortium (W3C) and a researcher at the Laboratory for Computer Science at the Massachu-setts Institute of Technology. When he invented the Web in 1989, he intended it to carry moresemantics than became common practice. Hendler is professor of computer science at theUniversity of Maryland at College Park, where he has been doing research on knowledge rep-resentation in a Web context for a number of years. He and his graduate research group de-veloped SHOE, the first Web-based knowledge representation language to demonstrate manyof the agent capabilities described in this article. Hendler is also responsible for agent-basedcomputing research at the Defense Advanced Research Projects Agency (DARPA) in Arlington,Va. Lassila is a research fellow at the Nokia Research Center in Boston, chief scientist of NokiaVenture Partners and a member of the W3C Advisory Board. Frustrated with the difficulty ofbuilding agents and automating tasks on the Web, he co-authored W3C’s RDF specification,which serves as the foundation for many current Semantic Web efforts.

THE

AU

THO

RS


computer-savvy teenager next door.Agents should be skeptical of assertionsthat they read on the Semantic Web un-til they have checked the sources of in-formation. (We wish more people wouldlearn to do this on the Web as it is!)

Many automated Web-based servicesalready exist without semantics, but oth-er programs such as agents have no wayto locate one that will perform a specificfunction. This process, called service dis-covery, can happen only when there is acommon language to describe a service ina way that lets other agents “under-stand” both the function offered and howto take advantage of it. Services and agentscan advertise their function by, for ex-ample, depositing such descriptions in di-rectories analogous to the Yellow Pages.

Some low-level service-discoveryschemes are currently available, such asMicrosoft’s Universal Plug and Play,which focuses on connecting differenttypes of devices, and Sun Microsystems’sJini, which aims to connect services.These initiatives, however, attack theproblem at a structural or syntactic leveland rely heavily on standardization of apredetermined set of functionality de-scriptions. Standardization can only goso far, because we can’t anticipate allpossible future needs.

The Semantic Web, in contrast, ismore flexible. The consumer and pro-ducer agents can reach a shared under-standing by exchanging ontologies,which provide the vocabulary needed fordiscussion. Agents can even “bootstrap”new reasoning capabilities when they dis-cover new ontologies. Semantics alsomakes it easier to take advantage of a ser-vice that only partially matches a request.

A typical process will involve the cre-ation of a “value chain” in which sub-assemblies of information are passed fromone agent to another, each one “addingvalue,” to construct the final product re-quested by the end user. Make no mistake:to create complicated value chains auto-matically on demand, some agents will ex-ploit artificial-intelligence technologies inaddition to the Semantic Web. But the Se-mantic Web will provide the foundationsand the framework to make such tech-nologies more feasible.

Putting all these features together re-sults in the abilities exhibited by Pete’sand Lucy’s agents in the scenario thatopened this article. Their agents wouldhave delegated the task in piecemeal fash-ion to other services and agents discov-ered through service advertisements. Forexample, they could have used a trustedservice to take a list of providers and de-termine which of them are in-plan for aspecified insurance plan and course oftreatment. The list of providers wouldhave been supplied by another search ser-vice, et cetera. These activities formedchains in which a large amount of data

distributed across the Web (and almostworthless in that form) was progressive-ly reduced to the small amount of data ofhigh value to Pete and Lucy—a plan ofappointments to fit their schedules andother requirements.

In the next step, the Semantic Web willbreak out of the virtual realm and extendinto our physical world. URIs can point toanything, including physical entities,which means we can use the RDF lan-guage to describe devices such as cellphones and TVs. Such devices can adver-tise their functionality—what they can doand how they are controlled—much likeXP

LAN

E

Lucy issues instructions

Her agent follows hyperlinks in the request to ontologies where key terms are defined. Links to ontologies are used at every step

After getting treatment info from the doctor’s computer and schedule info from Lucy’s and Pete’s computers, the agent goes to a provider finder service

The finder service sends out its own agents to look at semantics-enhanced insurance company lists and provider sites

Lucy’s agent and the finder service negotiate using ontologies and agree on payment for its service

Lucy’s agent interacts with the selected individual provider sites to find one with suitable open appointment times, which it tentatively reserves

The agent sends the appointment plan to Lucy and Pete at Pete’s home (per Lucy’s request) for their approval

1

2

3

4

5

6

7

Dr.’s Office

Ontologies

Insurance Co.Lists Provider Sites

Provider FinderService

IndividualProvider Site

SOFTWARE AGENTS will be greatly facilitated by semantic content on the Web. In the depicted scenario,

Lucy’s agent tracks down a physical therapy clinic for her mother that meets a combination of criteria and

has open appointment times that mesh with her and her brother Pete’s schedules. Ontologies that define

the meaning of semantic data play a key role in enabling the agent to understand what is on the Semantic

Web, interact with sites and employ other automated services.


software agents. Being much more flexiblethan low-level schemes such as UniversalPlug and Play, such a semantic approachopens up a world of exciting possibilities.

For instance, what today is calledhome automation requires careful config-uration for appliances to work together.Semantic descriptions of device capabili-ties and functionality will let us achievesuch automation with minimal human in-tervention. A trivial example occurs whenPete answers his phone and the stereosound is turned down. Instead of havingto program each specific appliance, hecould program such a function once andfor all to cover every local device that ad-vertises having a volume control—theTV, the DVD player and even the mediaplayers on the laptop that he broughthome from work this one evening.

The first concrete steps have alreadybeen taken in this area, with work on de-veloping a standard for describing func-tional capabilities of devices (such asscreen sizes) and user preferences. Builton RDF, this standard is called Compos-ite Capability/Preference Profile (CC/PP).Initially it will let cell phones and othernonstandard Web clients describe theircharacteristics so that Web content canbe tailored for them on the fly. Later,when we add the full versatility of lan-guages for handling ontologies and log-ic, devices could automatically seek out

and employ services and other devices foradded information or functionality. It isnot hard to imagine your Web-enabledmicrowave oven consulting the frozen-food manufacturer’s Web site for opti-mal cooking parameters.

Evolution of Knowledgethe semantic web is not “merely” thetool for conducting individual tasks thatwe have discussed so far. In addition, ifproperly designed, the Semantic Web canassist the evolution of human knowledgeas a whole.

Human endeavor is caught in an eter-nal tension between the effectiveness ofsmall groups acting independently andthe need to mesh with the wider commu-nity. A small group can innovate rapidlyand efficiently, but this produces a sub-culture whose concepts are not under-stood by others. Coordinating actionsacross a large group, however, is painful-ly slow and takes an enormous amountof communication. The world worksacross the spectrum between these ex-tremes, with a tendency to start small—from the personal idea—and move to-ward a wider understanding over time.

An essential process is the joining to-gether of subcultures when a wider com-mon language is needed. Often two groupsindependently develop very similar con-cepts, and describing the relation between

them brings great benefits. Like a Finnish-English dictionary, or a weights-and-mea-sures conversion table, the relations allowcommunication and collaboration evenwhen the commonality of concept has not(yet) led to a commonality of terms.

The Semantic Web, in naming everyconcept simply by a URI, lets anyone ex-press new concepts that they invent withminimal effort. Its unifying logical lan-guage will enable these concepts to beprogressively linked into a universal Web.This structure will open up the knowl-edge and workings of humankind tomeaningful analysis by software agents,providing a new class of tools by whichwe can live, work and learn together.

What Is the Killer App?

M O R E T O E X P L O R EWeaving the Web: The Original Design andUltimate Destiny of the World Wide Web by ItsInventor. Tim Berners-Lee, with Mark Fischetti. Harper SanFrancisco, 1999.

World Wide Web Consortium (W3C): www.w3.org/

W3C Semantic Web Activity: www.w3.org/2001/sw/

An introduction to ontologies:www.SemanticWeb.org/knowmarkup.html

Simple HTML Ontology Extensions FrequentlyAsked Questions (SHOE FAQ):www.cs.umd.edu/projects/plus/SHOE/faq.html

DARPA Agent Markup Language (DAML) home page:www.daml.org/

AFTER WE GIVE a presentation about the Semantic Web, we’re often asked, “Okay, so what is the killer application of the SemanticWeb?” The “killer app” of any technology, of course, is the application that brings a user to investigate the technology and startusing it. The transistor radio was a killer app of transistors, and the cell phone is a killer app of wireless technology.

So what do we answer? “The Semantic Web is the killer app.”At this point we’re likely to be told we’re crazy, so we ask a question in turn: “Well, what’s the killer app of the World Wide Web?”

Now we’re being stared at kind of fish-eyed, so we answer ourselves: “The Web is the killer app of the Internet. The Semantic Web isanother killer app of that magnitude.”

The point here is that the abilities of the Semantic Web are too general to be thought about in terms of solving one key problemor creating one essential gizmo. It will have uses we haven’t dreamed of.

Nevertheless, we can foresee some disarming (if not actually killer) apps that will drive initial use. Online catalogs withsemantic markup will benefit both buyers and sellers. Electronic commerce transactions will be easier for small businesses to setup securely with greater autonomy. And one final example: you make reservations for an extended trip abroad. The airlines, hotels,soccer stadiums and so on return confirmations with semantic markup. All the schedules load directly into your date book and allthe expenses directly into your accounting program, no matter what semantics-enabled software you use. No more laboriouscutting and pasting from e-mail. No need for all the businesses to supply the data in half a dozen different formats or to create andimpose their own standard format.


from work and goes to her PC to check e-mail, the PC isn’t justsitting there. It’s working for a biotech company, matching genesequences to a library of protein molecules. Its DSL connectionis busy downloading a block of radio telescope data to be ana-lyzed later. Its disk contains, in addition to Mary’s own files,encrypted fragments of thousands of other files. Occasionallyone of these fragments is read and transmitted; it’s part of amovie that someone is watching in Helsinki. Then Mary movesthe mouse, and this activity abruptly stops. Now the PC and itsnetwork connection are all hers.

This sharing of resources doesn’t stop at her desktop com-puter. The laptop computer in her satchel is turned off, but itsdisk is filled with bits and pieces of other people’s files, as partof a distributed backup system. Mary’s critical files are backedup in the same way, saved on dozens of disks around the world.

Later, Mary watches an independent film on her Internet-connected digital television, using a pay-per-view system. Themovie is assembled on the fly from fragments on several hun-dred computers belonging to people like her.

Mary’s computers are moonlighting for other people. But

they’re not giving anything away for free. As her PC works,pennies trickle into her virtual bank account. The paymentscome from the biotech company, the movie system and thebackup service. Instead of buying expensive “server farms,”these companies are renting time and space, not just on Mary’stwo computers but on millions of others as well. It’s a win-winsituation. The companies save money on hardware, which en-ables, for instance, the movie-viewing service to offer obscuremovies. Mary earns a little cash, her files are backed up, andshe gets to watch an indie film. All this could happen with anInternet-scale operating system (ISOS) to provide the neces-sary “glue” to link the processing and storage capabilities ofmillions of independent computers.

Internet-Scale ApplicationsALTHOUGH MARY’S WORLD is fictional—and an Internet-scale operating system does not yet exist—developers have al-ready produced a number of Internet-scale, or peer-to-peer,applications that attempt to tap the vast array of underutilizedmachines available through the Internet [see box on page 42]. P

HIL

IP H

OW

E

By David P. Anderson and John Kubiatowicz

The

When Mary gets home

WorldwideComputer

An operating systemspanning the Internet would bring the power ofmillions of the world’sInternet-connected PCs to everyone’s fingertips



These applications accomplish goals thatwould be difficult, unaffordable or im-possible to attain using dedicated com-puters. Further, today’s systems are justthe beginning: we can easily conceive ofarchival services that could be relied onfor hundreds of years and intelligent

search engines for tomorrow’s SemanticWeb [see “The Semantic Web,” by TimBerners-Lee, James Hendler and Ora Las-sila; Scientific American, May 2001].

Unfortunately, the creation of Inter-net-scale applications remains an impos-

ing challenge. Developers must buildeach new application from the groundup, with much effort spent on technicalmatters, such as maintaining a databaseof users, that have little to do with the ap-plication itself. If Internet-scale applica-tions are to become mainstream, these in-

frastructure issues must be dealt withonce and for all.

We can gain inspiration for eliminat-ing this duplicate effort from operatingsystems such as Unix and MicrosoftWindows. An operating system provides

a virtual computing environment in whichprograms operate as if they were in solepossession of the computer. It shieldsprogrammers from the painful details ofmemory and disk allocation, communi-cation protocols, scheduling of myriadprocesses, and interfaces to devices for

data input and output. An operating sys-tem greatly simplifies the development ofnew computer programs. Similarly, anInternet-scale operating system wouldsimplify the development of new distrib-uted applications.

More than 150 MILLION hosts are connected to theInternet, and the number is GROWING exponentially.

COMPUTINGGIMPS (Great Internet Mersenne Prime Search):www.mersenne.org/Searches for large prime numbers. About 130,000 people aresigned up, and five new primes have been found, including thelargest prime known, which has four million digits.

distributed.net: www.distributed.net/Has decrypted several messages by using brute-force searchesthrough the space of possible encryption keys. More than 100billion keys are tried each second on its current decryptionproject. Also searches for sets of numbers called optimal Golombrulers, which have applications in coding and communications.

SETI@home (Search for Extraterrestrial Intelligence):http://setiathome.berkeley.edu/Analyzes radio telescope data, searching for signals ofextraterrestrial origin. A total of 3.4 million users have devotedmore than 800,000 years of processor time to the task.

folding@home: http://folding.stanford.edu/Run by Vijay Pande’s group in the chemistry department atStanford University, this project has about 20,000 computersperforming molecular-dynamics simulations of how proteinsfold, including the folding of Alzheimer amyloid-beta protein.

Intel/United Devices cancer research project:http://members.ud.com/projects/cancer/Searches for possible cancer drugs by testing which of 3.5 billionmolecules are best shaped to bind to any one of eight proteinsthat cancers need to grow.

STORAGENapster: www.napster.com/Allowed users to share digital music. A central database stored thelocations of all files, but data were transferred directly betweenuser systems. Songwriters and music publishers brought a class-action lawsuit against Napster. The parties reached an agreementwhereby rights to the music would be licensed to Napster andartists would be paid, but the new fee-based service had notstarted as of January 2002.

Gnutella: www.gnutella.com/Provides a private, secure shared file system. There is no centralserver; instead a request for a file is passed from each computerto all its neighbors.

Freenet: http://freenetproject.org/Offers a similar service to Gnutella but uses a better file-locationprotocol. Designed to keep file requesters and suppliers anony-mous and to make it difficult for a host owner to determine or beheld responsible for the Freenet files stored on his computer.

Mojo Nation: www.mojonation.net/Also similar to Gnutella, but files are broken into small piecesthat are stored on different computers to improve the rate atwhich data can be uploaded to the network. A virtual paymentsystem encourages users to provide resources.

Fasttrack P2P Stack: www.fasttrack.nu/A peer-to-peer system in which more powerful computers becomesearch hubs as needed. This software underlies the Grokster,MusicCity (“Morpheus”) and KaZaA file-sharing services.

Existing Distributed Systems


An ISOS consists of a thin layer ofsoftware (an ISOS agent) that runs oneach “host” computer (such as Mary’s)and a central coordinating system thatruns on one or more ISOS server com-plexes. This veneer of software wouldprovide only the core functions of allo-cating and scheduling resources for eachtask, handling communication amonghost computers and determining the re-imbursement required for each machine.This type of operating system, called amicrokernel, relegates higher-level func-tions to programs that make use of theoperating system but are not a part of it.For instance, Mary would not use theISOS directly to save her files as piecesdistributed across the Internet. She mightrun a backup application that used ISOSfunctions to do that for her. The ISOSwould use principles borrowed from eco-nomics to apportion computing re-sources to different users efficiently andfairly and to compensate the owners ofthe resources.

Two broad types of applications mightbenefit from an ISOS. The first is distrib-uted data processing, such as physicalsimulations, radio signal analysis, genet-ic analysis, computer graphics renderingand financial modeling. The second is dis-tributed online services, such as file stor-age systems, databases, hosting of Websites, streaming media (such as onlinevideo) and advanced Web search engines.

What’s Mine Is YoursCOMPUTING TODAY operates pre-dominantly as a private resource; orga-nizations and individuals own the sys-tems that they use. An ISOS would facil-itate a new paradigm in which it wouldbe routine to make use of resources allacross the Internet. The resource pool—hosts able to compute or store data andnetworks able to transfer data betweenhosts—would still be individually owned,but they could work for anyone. Hostswould include desktops, laptops, servercomputers, network-attached storage de-vices and maybe handheld devices.

The Internet resource pool differsfrom private resource pools in several im-portant ways. More than 150 millionhosts are connected to the Internet, and

the number is growing exponentially.Consequently, an ISOS could provide avirtual computer with potentially 150million times the processing speed andstorage capacity of a typical single com-puter. Even when this virtual computer isdivided up among many users, and afterone allows for the overhead of running

the network, the result is a bigger, fasterand cheaper computer than the userscould own privately. Continual upgrad-ing of the resource pool’s hardware caus-es the total speed and capacity of thisüber-computer to increase even fasterthan the number of connected hosts.Also, the pool is self-maintaining: when

XPLA

NE

An Internet-scale operating system (ISOS) coordinates all the participating computers and pays them for their work.

When Mary gets back on her PC, the work for the network is automatically suspended.

Her laptop stores backup copies of encrypted fragments of other users' files. The laptop is connected only occasionally, but that suffices.

Mary’s home computer works while she’s away. It’s one of millions of PCs that are crunching data and delivering file fragments for the network.

With Internet-scale applications, PCs around the world can work during times when they would otherwise sit idle. Here’s how it works:

Later, Mary watches an obscure indie movie that is consolidated from file fragments delivered by the network.4.

3.

2.

1.

5.

DAVID P. ANDERSON and JOHN KUBIATOWICZ are both associated with the University of Cal-ifornia, Berkeley. Anderson was on the faculty of the computer science department from1985 to 1991. He is now director of the SETI@home project and chief science officer of Unit-ed Devices, a provider of distributed computing software that is allied with the distrib-uted.net project. Kubiatowicz is an assistant professor of computer science at Berkeleyand is chief architect of OceanStore, a distributed storage system under development withmany of the properties required for an ISOS.

THE

AU

THO

RS

MOONLIGHTING COMPUTERS


XPLA

NE

a computer breaks down, its owner even-tually fixes or replaces it.

Extraordinary parallel data transmis-sion is possible with the Internet resourcepool. Consider Mary’s movie, being up-loaded in fragments from perhaps 200hosts. Each host may be a PC connectedto the Internet by an antiquated 56k mo-dem—far too slow to show a high-quali-ty video—but combined they could deliv-er 10 megabits a second, better than a ca-ble modem. Data stored in a distributedsystem are available from any location(with appropriate security safeguards)and can survive disasters that knock outsections of the resource pool. Great secu-rity is also possible, with systems thatcould not be compromised without break-ing into, say, 10,000 computers.

In this way, the Internet-resource par-adigm can increase the bounds of what ispossible (such as higher speeds or largerdata sets) for some applications, where-as for others it can lower the cost. Forcertain applications it may do neither—

it’s a paradigm, not a panacea. And de-signing an ISOS also presents a numberof obstacles.

Some characteristics of the resourcepool create difficulties that an ISOS mustdeal with. The resource pool is heteroge-neous: Hosts have different processortypes and operating systems. They havevarying amounts of memory and diskspace and a wide range of Internet con-nection speeds. Some hosts are behindfirewalls or other similar layers of soft-ware that prohibit or hinder incomingconnections. Many hosts in the pool areavailable only sporadically; desktop PCsare turned off at night, and laptops andsystems using modems are frequently notconnected. Hosts disappear unpredict-ably—sometimes permanently—and newhosts appear.

The ISOS must also take care not toantagonize the owners of hosts. It musthave a minimal impact on the non-ISOSuses of the hosts, and it must respect lim-itations that owners may impose, such asallowing a host to be used only at nightor only for specific types of applications.Yet the ISOS cannot trust every host toplay by the rules in return for its owngood behavior. Owners can inspect andmodify the activities of their hosts. Cu-

rious and malicious users may attempt todisrupt, cheat or spoof the system. Allthese problems have a major influence onthe design of an ISOS.

Who Gets What?AN INTERNET-SCALE operating sys-tem must address two fundamental is-sues—how to allocate resources and howto compensate resource suppliers. Amodel based on economic principles inwhich suppliers lease resources to con-sumers can deal with both issues at once.In the 1980s researchers at Xerox PARCproposed and analyzed economic ap-proaches to apportioning computer re-sources. More recently, Mojo Nation de-veloped a file-sharing system in whichusers are paid in a virtual currency(“mojo”) for use of their resources andthey in turn must pay mojo to use the sys-tem. Such economic models encourageowners to allow their resources to beused by other organizations, and theoryshows that they lead to optimal alloca-tion of resources.

Even with 150 million hosts at its dis-posal, the ISOS will be dealing in “scarce”

ACME

ACME

Send Marymovie

Movie ServiceAgents

ISOS Agent

Movie FileFragments

Movie

ISOS AgentProgram

Internet-Scale Operating System (ISOS) Server Complex

Acme Movie Service wants to distribute movies to viewers. Acme requests hosts from the ISOS server, the system’s traffic cop. The ISOS server sends a host list, for which Acme pays.

1.

ISOS tells its agent programs on the host computers to perform tasks for Acme. ISOS pays the hosts for the use of their resources.

2.

Acme sends its movie service agent program tothe hosts. Acme splits its movie into fragments and also sends them to hosts.

3.

Mary orders an Acme movie and pays Acme directly.

4.

Movie service instructs its agents to send Mary the movie. ISOS pays the hosts for their work.

5.

Hundreds of hosts send small pieces of the movie file to Mary’s Internet-enabled TV.

6.

The movie is assembled, and Mary is free to enjoy her Acme movie.

7.

Workfor

Acme

Movieorder

Requesthost list

Host List

HOW A DISTRIBUTED SERVICE WOULD OPERATE


resources, because some tasks will requestand be capable of using essentially un-limited resources. As it constantly decideswhere to run data-processing jobs andhow to allocate storage space, the ISOSmust try to perform tasks as cheaply aspossible. It must also be fair, not allowingone task to run efficiently at the expenseof another. Making these criteria pre-cise—and devising scheduling algorithmsto achieve them, even approximately—

are areas of active research.The economic system for a shared

network must define the basic units of a

resource, such as the use of a megabyte ofdisk space for a day, and assign valuesthat take into account properties such asthe rate, or bandwidth, at which the stor-age can be accessed and how frequentlyit is available to the network. The systemmust also define how resources arebought and sold (whether they are paidfor in advance, for instance) and howprices are determined (by auction or by aprice-setting middleman).

Within this framework, the ISOS mustaccurately and securely keep track of re-source usage. The ISOS would have aninternal bank with accounts for suppliersand consumers that it must credit or deb-it according to resource usage. Partici-pants can convert between ISOS curren-cy and real money. The ISOS must alsoensure that any guarantees of resourceavailability can be met: Mary doesn’twant her movie to grind to a halt part-way through. The economic system letsresource suppliers control how their re-sources are used. For example, a PCowner might specify that her computer’sprocessor can’t be used between 9 A.M.

and 5 P.M. unless a very high price is paid.Money, of course, encourages fraud,

and ISOS participants have many waysto try to defraud one another. For in-stance, resource sellers, by modifying orfooling the ISOS agent program runningon their computer, may return fictitiousresults without doing any computation.

Researchers have explored statisticalmethods for detecting malicious or mal-functioning hosts. A recent idea for pre-venting unearned computation credit isto ensure that each work unit has a num-ber of intermediate results that the serv-er can quickly check and that can be ob-tained only by performing the entirecomputation. Other approaches are need-ed to prevent fraud in data storage andservice provision.

The cost of ISOS resources to endusers will converge to a fraction of thecost of owning the hardware. Ideally, this

fraction will be large enough to encour-age owners to participate and smallenough to make many Internet-scale ap-plications economically feasible. A typi-cal PC owner might see the system as abarter economy in which he gets free ser-vices, such as file backup and Web host-ing, in exchange for the use of his other-wise idle processor time and disk space.

A Basic ArchitectureWE ADVOCATE two basic principles inour ISOS design: a minimal core operat-ing system and control by central servers.

Curious and malicious USERS may attempt toDISRUPT, CHEAT or spoof the system.

Primes and CrimesBy Graham P. Collins

NO ONE HAS SEEN signs of extraterrestrials using a distributed computation project(yet), but people have found the largest-known prime numbers, five-figure rewardmoney—and legal trouble.

The Great Internet Mersenne Prime Search (GIMPS), operating since 1996, hasturned up five extremely large prime numbers so far. The fifth and largest wasdiscovered in November 2001 by 20-year-old Michael Cameron of Owen Sound, Ontario.Mersenne primes can be expressed as 2P – 1, where P is itself a prime number.Cameron’s is 213,466,917 – 1, which would take four million digits to write out. Hiscomputer spent 45 days discovering that his number is a prime; altogether the GIMPSnetwork expended 13,000 years of computer time eliminating other numbers that couldhave been the 39th Mersenne.

The 38th Mersenne prime, a mere two million digits long, earned its discoverer(Nayan Hajratwala of Plymouth, Mich.) a $50,000 reward for being the first prime withmore than a million digits. A prime with 10 million digits will win someone $100,000.

A Georgia computer technician, on the other hand, has found nothing but troublethrough distributed computation. In 1999 David McOwen installed the client program forthe “distributed.net” decryption project on computers in seven offices of the DeKalbTechnical Institute, along with Y2K upgrades. During the Christmas holidays, thecomputers’ activity was noticed, including small data uploads and downloads each day.In January 2000 McOwen was suspended, and he resigned soon thereafter.

Case closed? Case barely opened: The Georgia Bureau of Investigation spent 18months investigating McOwen as a computer criminal, and in October 2001 he wascharged with eight felonies under Georgia’s computer crime law. The one count ofcomputer theft and seven counts of computer trespass each carry a $50,000 fine andup to 15 years in prison. On January 17, a deal was announced whereby McOwen willserve one year of probation, pay $2,100 in restitution and perform 80 hours ofcommunity service unrelated to computers or technology.

Graham P. Collins is a staff writer and editor.


A computer operating system that pro-vides only core functions is called a micro-kernel. Higher-level functions are built ontop of it as user programs, allowing themto be debugged and replaced more easi-ly. This approach was pioneered in acad-emic research systems and has influencedsome commercial systems, such as Win-dows NT. Most well-known operatingsystems, however, are not microkernels.

The core facilities of an ISOS includeresource allocation (long-term assign-ment of hosts’ processing power andstorage), scheduling (putting jobs intoqueues, both across the system and with-in individual hosts), accounting of re-source usage, and the basic mechanismsfor distributing and executing applica-tion programs. The ISOS should not du-plicate features of local operating systemsrunning on hosts.

The system should be coordinated byservers operated by the ISOS provider,which could be a government-funded or-ganization or a consortium of companiesthat are major resource sellers and buy-ers. (One can imagine competing ISOSproviders, but we will keep things simpleand assume a unique provider.) Central-ization runs against the egalitarian ap-

proach popular in some peer-to-peer sys-tems, but central servers are needed toensure privacy of sensitive data, such asaccounting data and other informationabout the resource hosts. Centralizationmight seem to require a control systemthat will become excessively large andunwieldy as the number of ISOS-con-nected hosts increases, and it appears tointroduce a bottleneck that will choke thesystem anytime it is unavailable. Thesefears are unfounded: a reasonable num-ber of servers can easily store informa-tion about every Internet-connected hostand communicate with them regularly.Napster, for example, handled almost 60million clients using a central server. Re-dundancy can be built into the servercomplex, and most ISOS online servicescan continue operating even with theservers temporarily unavailable.

The ISOS server complex wouldmaintain databases of resource descrip-tions, usage policies and task descrip-tions. The resource descriptions include,for example, the host’s operating system,processor type and speed, total and freedisk space, memory space, performancestatistics of its network connections, andstatistical descriptions of when it is pow-

ered on and connected to the network.Usage policies spell out the rules an own-er has dictated for using her resources.Task descriptions include the resourcesassigned to an online service and thequeued jobs of a data-processing task.

To make their computers available tothe network, resource sellers contact theserver complex (for instance, through aWeb site) to download and install anISOS agent program, to link resources totheir ISOS account, and so on. The ISOSagent manages the host’s resource usage.Periodically it obtains from the ISOSserver complex a list of tasks to perform.

Resource buyers send the servers taskrequests and application agent programs(to be run on hosts). An online serviceprovider can ask the ISOS for a set ofhosts on which to run, specifying its re-source requirements (for example, a dis-tributed backup service could use spo-radically connected resource hosts—

Mary’s laptop—which would cost lessthan constantly connected hosts). TheISOS supplies the service with addressesand descriptions of the granted hosts andallows the application agent program tocommunicate directly between hosts onwhich it is running. The service can re- XP

LAN

E

$

By harnessing the massive unused computing resources of the global network, an ISOS would make short work of daunting number-crunching tasks and data storage. Here are just a few of the possibilities:

3-D rendering and animation

WorldComputerNetwork

File backup and archiving for hundreds of years

Matching genesequences

Streaming mediapay-per-view service

Financial modeling

SETI: Analysis of celestialradio signals

WHAT AN INTERNET-SCALE OPERATING SYSTEM COULD DO


quest new hosts when some become un-available. The ISOS does not dictate howclients make use of an online service, howthe service responds or how clients arecharged by the service (unlike the ISOS-controlled payments flowing from re-source users to host owners).

An Application ToolkitIN PRINCIPLE, the basic facilities of theISOS—resource allocation, schedulingand communication—are sufficient toconstruct a wide variety of applications.Most applications, however, will haveimportant subcomponents in common. Itis useful, therefore, to have a software

toolkit to further assist programmers inbuilding new applications. Code for thesefacilities will be incorporated into appli-cations on resource hosts. Examples ofthese facilities include:

Location independent routing. Applica-tions running with the ISOS can spreadcopies of information and instances ofcomputation among millions of resourcehosts. They have to be able to accessthem again. To facilitate this, applica-tions name objects under their purviewwith Globally Unique Identifiers (GUIDs).These names enable “location indepen-dent routing,” which is the ability to sendqueries to objects without knowing theirlocation. A simplistic approach to loca-tion independent routing could involve adatabase of GUIDs on a single machine,but that system is not amenable to han-dling queries from millions of hosts. In-stead the ISOS toolkit distributes thedatabase of GUIDs among resourcehosts. This kind of distributed system isbeing explored in research projects suchas the OceanStore persistent data storageproject at the University of California atBerkeley.

Persistent data storage. Informationstored by the ISOS must be able to sur-vive a variety of mishaps. The persistent

data facility aids in this task with mech-anisms for encoding, reconstructing andrepairing data. For maximum survivabil-ity, data are encoded with an “m-of-n”code. An m-of-n code is similar in princi-ple to a hologram, from which a smallpiece suffices for reconstructing thewhole image. The encoding spreads in-formation over n fragments (on n re-source hosts), any m of which are suffi-cient to reconstruct the data. For instance,the facility might encode a document into64 fragments, any 16 of which suffice toreconstruct it. Continuous repair is alsoimportant. As fragments fail, the repairfacility would regenerate them. If prop-

erly constructed, a persistent data facili-ty could preserve information for hun-dreds of years.

Secure update. New problems arisewhen applications need to update storedinformation. For example, all copies of theinformation must be updated, and the ob-ject’s GUID must point to its latest copy.An access control mechanism must pre-vent unauthorized persons from updat-ing information. The secure update facil-ity relies on Byzantine agreement proto-cols, in which a set of resource hosts cometo a correct decision, even if a third of

them are trying to lead the process astray.Other facilities. The toolkit also assists

by providing additional facilities, such asformat conversion (to handle the hetero-geneous nature of hosts) and synchro-nization libraries (to aid in cooperationamong hosts).

An ISOS suffers from a familiarcatch-22 that slows the adoption ofmany new technologies: Until a wide userbase exists, only a limited set of applica-tions will be feasible on the ISOS. Con-versely, as long as the applications arefew, the user base will remain small. Butif a critical mass can be achieved by con-vincing enough developers and users of

the intrinsic usefulness of an ISOS, thesystem should grow rapidly.

The Internet remains an immense un-tapped resource. The revolutionary rise inpopularity of the World Wide Web hasnot changed that—it has made the re-source pool all the larger. An Internet-scale operating system would free pro-grammers to create applications thatcould run on this World Wide Computerwithout worrying about the underlyinghardware. Who knows what will result?Mary and her computers will be doingthings we haven’t even imagined.

The Ecology of Computation. B. A. Huberman. North-Holland, 1988.

The Grid: Blueprint for a New Computing Infrastructure. Edited by Ian Foster and Carl Kesselman.Morgan Kaufmann Publishers, 1998.

Peer-to-Peer: Harnessing the Power of Disruptive Technologies. Edited by Andy Oram. O’Reilly & Associates, 2001.

Many research projects are working toward an Internet-scale operating system, including:

Chord: www.pdos.lcs.mit.edu/chord/

Cosm: www.mithral.com/projects/cosm/

Eurogrid: www.eurogrid.org/

Farsite: http://research.microsoft.com/sn/farsite/

Grid Physics Network (Griphyn): www.griphyn.org/

OceanStore: http://oceanstore.cs.berkeley.edu/

Particle Physics Data Grid: www.ppdg.net/

Pastry: www.research.microsoft.com/~antr/pastry/

Tapestry: www.cs.berkeley.edu/~ravenben/tapestry/

M O R E T O E X P L O R E

A typical PC owner might see the system as a BARTER ECONOMY that provides free services in

exchange for PROCESSOR TIME and DISK SPACE.