30
Workshop Report on a Future Information Infrastructure for the Physical Sciences The Facts of the Matter: Finding, understanding, and using information about our physical world Hosted by the Department of Energy at the National Academy of Sciences May 30-31, 2000 Preface Forty years ago it took days, weeks or even months for information regarding an interesting discovery to be communicated to the relevant community of scientists and engineers. At that time, most of us kept a collection of postcards that we used to request reprints of articles as they appeared in the journals we read. This was the situation at the time that Ted Maiman reported his results using ruby as a medium to make a laser. Some twenty years later, this time interval was shortened to days by fax machines when Müller and Bednorz revealed their experiments demonstrating high temperature superconductivity. Today, with the Internet, that interval has shrunk to seconds, minutes, at most, hours. I suspect that some astronomical observations are announced within seconds of the time it takes to type a note and launch it on the Internet to a few tens or hundreds of one’s closest colleagues. This situation makes it imperative to have a system in place that allows the rapid communication of information of value to scientists and engineers who are engaged in what has become an intensely competitive research environment. This Workshop seeks to cast light on the problem of communication and dissemination of information within the physical sciences community, and to make practical suggestions regarding steps that can be taken to improve the situation. Alvin W. Trivelpiece Formerly, Director of ORNL

Workshop Report on a Future Information … · and engineers. At that time, most ... The Biomedical Model: ... magazine, data item, or reference document... by simply clicking a mouse...”

  • Upload
    phamnga

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Workshop Report on a FutureInformation Infrastructure for the

Physical Sciences

The Facts of the Matter: Finding, understanding, andusing information about our physical world

Hosted by the Department of Energy at the National Academy of Sciences

May 30-31, 2000

PrefaceForty years ago it took days, weeks or even months for information regarding aninteresting discovery to be communicated to the relevant community of scientistsand engineers. At that time, most of us kept a collection of postcards that we used torequest reprints of articles as they appeared in the journals we read. This was thesituation at the time that Ted Maiman reported his results using ruby as a mediumto make a laser. Some twenty years later, this time interval was shortened to days byfax machines when Müller and Bednorz revealed their experiments demonstratinghigh temperature superconductivity. Today, with the Internet, that interval hasshrunk to seconds, minutes, at most, hours. I suspect that some astronomicalobservations are announced within seconds of the time it takes to type a note andlaunch it on the Internet to a few tens or hundreds of one’s closest colleagues. Thissituation makes it imperative to have a system in place that allows the rapidcommunication of information of value to scientists and engineers who are engagedin what has become an intensely competitive research environment.

This Workshop seeks to cast light on the problem of communication anddissemination of information within the physical sciences community, and to makepractical suggestions regarding steps that can be taken to improve the situation.

Alvin W. TrivelpieceFormerly, Director of ORNL

AcknowledgementThe Department of Energy sponsors of the Workshop gratefully acknowledge the enthusiasticparticipation and contributions of the Workshop Chair and Panelists who developed thisreport.

Alvin Trivelpiece, ChairR. Stephen Berry, University of Chicago

Martin Blume, American Physical SocietyJose-Marie Griffiths, University of Michigan

Lee Holcomb, National Aeronautics and Space AdministrationKirk McDonald, Princeton University

Krishna Rajan, Rensselaer Polytechnic InstituteKent Smith, National Library of Medicine

Derek Winstanley, Illinois State Water Survey

Table of ContentsExecutive Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

The Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

The Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Major Themes and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Scope of the Initiative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

Information Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

Infrastructure, Products, and Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

Archiving, Preservation and Access to Information . . . . . . . . . . . . . . . . . . . . .12

Research, Education, and The Public Interest . . . . . . . . . . . . . . . . . . . . . . . . .13

Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

Participation of Sectors of the Economy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

Leadership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

Funding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

Reward Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

International Diplomacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

A Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

The Biomedical Model: Validation of the Concept . . . . . . . . . . . . . . . . . . . . . .17

Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Doing Better At What We’re Doing Now (FY 01) . . . . . . . . . . . . . . . . . . . . . . .18

Mobilizing For What Is Possible Tomorrow (FY02-03) . . . . . . . . . . . . . . . . . . .19

Realizing The Future Potential (Out Years) . . . . . . . . . . . . . . . . . . . . . . . . . . .19

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Appendix A, Workshop Panelists and Participants . . . . . . . . . . . . . . . . . . . . .21

Appendix B, Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

Appendix C, Working Group Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

Appendix D, The Biomedical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

Appendix E Suggested R&D for the Future Information Infrastructure for thePhysical Sciences–Dr. Krishna Rajan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

Executive Summary

On May 30–31, 2000, a Workshop washeld at the National Academy of

Sciences to address questions regarding an“Information Infrastructure for the PhysicalSciences” to increase the productivity of thescientific enterprise in the United States.

Chaired by Dr. Alvin W. Trivelpiece,formerly, Director of Oak Ridge NationalLaboratory, the Workshop was composed of apanel of experts in science, science policy,information science, and scientificpublishing. Other participants includedrepresentatives from the community ofpotential stakeholders in such an enterprise.

The concept of an “InformationInfrastructure for the Physical Sciences” isbuilt on a history of studies that have calledfor a national information resource in thesciences. The study by the President’sInformation Technology Advisory Committee(PITAC)1 states a vision that “An individualcan access, query, or print any ... magazine,data item, or reference document... by simplyclicking a mouse...” The Loken Report 2 notesthat this is a “time of true revolution in thecommunication of scientific information,”and discussed the concept of a “worldwidePhysics Information System”. It references a“National Library of Science,” and concludeswith the recommended goal of a scientificinformation system from which “all theworld’s formal scientific literature isavailable, on-line, to scientific workersthroughout the world, for a world scientificdatabase.” These reports served as the baseto begin discussions. For well over 50 years,since the introduction of the computer,studies have called for a better focus on theinformation problem.

The rapid advances in informationtechnology that are dramatically altering thenature of scientific communications wererecognized. Traditional means of access tothe scholarly record are no longer sufficient

to meet researchers’ needs and expectationsor even to follow the rapid pace of scientificdevelopments. New concepts to fill theknowledge gap are emerging. Today we arein a new and unique environment. Theability to compete is first based on ourability to know quickly. The value is not inhaving the knowledge, but in how it is used.Scientists have already been changing theway science is being done; institutions areripe for change. The Internet and distributedinformation technologies now provide us ameans to realize what were only visions inthe past. These are basic assumptions thatwere supported by the results of thisWorkshop.

Scope of the InitiativeBegin with the Physical Sciences sothat it is doable, but engineer it sothat it is scalable.

Information TypesWe need a conceptual change thatallows us to better integrateinformation types (e.g., text, data,images, animations) as well asinformation at various stages in theanalysis process (raw data, partiallyprocessed data, text summaries ofanalyzed data in varying degrees ofanalyses, and unreviewed or peer-reviewed documents). Such aconceptual change would facilitatethe serendipity and insights thatcan be gained by dealing with thesemultiple information types and theirinterrelationships.

Information Products and ServicesInformation Infrastructure has to bemore than a storage and retrievalcapability. Rather, in the long term,it has to support fundamentally newways of doing science.

1

1 “Information Technology Research: Investing in Our Future,” President’s Information Technology Advisory Committee,February 1999, p. 13. 2 Report of the American Physical Society (APS) Task Force on Electronic Information Systems, popularly know as “TheLoken Report”, published in the Bulletin of the American Physical Society, Vol. 36, No. 4, p. 1119, 1991; or availableonline at http://publish.aps.org/eprint/reports/lokenrep.html. Note: APS has commissioned an update to this report.

Archiving, Preservation andAccess to Information

Archiving of data is one of the keyconcerns. The electronic record ismuch more fragile than the mediathat came before (print, microforms)because it comes faster, in biggervolume and is gone more quickly.There is a real need for assurance ofportability, regardless of mediadevelopments.

Research, Education, and thePublic Interest

The physical science informationinfrastructure of the future mustbegin by focusing on the producersof the information; i.e., the scientificcommunity, but must recognize thepotential for positively impactingeducation and the general publicinterest.

QualityQuality in data and data processingis one of the cornerstones of success.The system must take into accountthe quality, the reliability, andusability aspects of the information.

Participation of Sectors of theEconomy

We must continue to strive for broadand innovative collaborations.Stability comes from diversity.Though we may of necessity need tobuild the infrastructure with a focuson specific tasks and disciplines, wemust encourage the broadestparticipation at the planning tableto allow for future involvements.

LeadershipIt is clear that we need a point ofconvergence and leadership todevelop the common agenda andmove it forward.

FundingThe government has a responsibilityto disseminate the results offederally sponsored research as

broadly as possible as a public good.The amount of resources neededshould be benchmarked againstmodel efforts.

TimingThe time is now, the need is now.

Infrastructure ElementsAs a result of the Workshop and focused

discussion, findings support the need for: A common knowledge base that seeks

in an integrated approach to providecomprehensive access and facilitate thereuse of worldwide sources of physicalsciences information, regardless of wherethey reside, what platform(s) they reside on,or what format or data structure theyemploy.

A point of convergence for ensuring theawareness, availability, use, anddevelopment of information technologies andtools to facilitate information assimilation,data analyses, peer communication andcollaboration, sharing of preliminaryresearch results, remote experimentation,validation of experimental results, etc; and

An openly available source ofinformation to serve all users, from studentsto scientists to concerned citizens, in ahighly efficient electronic environment, withtools to assist users in their quest forinformation and ultimately knowledge.

ImplementationIn considering implementation, the

requirements for the infrastructure mustdeal with three time horizons:

• Doing Better at What We’re Doing Now(FY01)

• Mobilizing for What is PossibleTomorrow (FY02–03)

• Realizing the Future Potential (Out Years)

Some immediate activities were cited thatconfirmed what coordination and leadershipin the near term could accomplish. Otherefforts are beginning to come on thelandscape and can be moved forward now. Ofutmost importance is continuation of theplanning process to include other agenciesand other stakeholders.

2

ConclusionsWhen comparing The National Library of

Medicine’s amazing success in the delivery ofmedical information as a benchmark, it isclear that much can be done in the physicalsciences to positively impact research andpractice in the physical sciences.

The overall conclusion of the workshopwas an enthusiastic endorsement of a vision

of a national infrastructure that benefits notjust the scientific community but thenational good. It could ultimately impact notonly research and development (R&D), butalso education and applications to oureveryday lives. It would be a step tointegrate the whole of science to provide abasis to improve society, the economy, andthe environment.

3

4

The rapid advances in informationtechnology have dramatically altered the

nature of scientific communications.Traditional means of access to the scholarlyrecord are no longer sufficient to meetresearchers needs and expectations or evento follow the rapid pace of scientificdevelopments. Though scientists andengineers have well-established individualpatterns for information discovery, oftenthese patterns are focused in areas ofspecialization and not attuned tointerdisciplinary opportunities. More oftenthan not, scientists and engineers areoverwhelmed with a mass of informationwithin their own specialty and are evenmore hard pressed to stay abreast of workdone in other areas that could contribute totheir efforts. This often results in missedopportunities or wasted resources. Newconcepts to fill the knowledge gap areemerging.

The current concept of a FutureInformation Infrastructure for the PhysicalSciences, initially proposed by theDepartment of Energy, is built on a historyof national studies that have called for anational information resource in thesciences. The study by the President’sInformation Technology Advisory Committee(PITAC)3 states a vision that “An individual

can access, query, or print any ... magazine,data item, or reference document... by simplyclicking a mouse...”. The Loken Report4

notes that this is a “time of true revolutionin the communication of scientificinformation,” and discusses the concept of a“worldwide Physics Information System.” Itreferences a “National Library of Science,”and concludes with the recommended goal ofa scientific information system from which“all the world’s formal scientific literature isavailable, on-line, to scientific workersthroughout the world, for a world scientificdatabase.” These reports served as the basesto begin discussions.

For well over 50 years, since theintroduction of the computer, studies havecalled for a better focus on the informationproblem. So, why now? The difference todayis that we are in a new and uniqueenvironment. Our ability to compete is firstbased on our ability to know quickly. Thevalue is not only in having the knowledge,but in using it. Scientists have already beenchanging the way science is being done;institutions are ripe for change. The Internetand distributed information technologiesnow provide us a means to realize what wereonly visions in the past. These are basicassumptions that were supported by theresults of this Workshop.

5

3 “Information Technology Research: Investing in Our Future, “ President’s Information Technology Advisory Committee,February 1999, p.13.4 Report of the American Physical Society (APS) Task Force on Electronic Information Systems, popularly know as “TheLoken Report”, published in the Bulletin of the American Physical Society, Vol. 36, No. 4, p. 1119, 1991; or availableonline at http://publish.aps.org/eprint/reports/lokenrep.html. Note: APS has commissioned an update to this report.

T his is the report of a Workshop on “A Future Information Infrastructure for thePhysical Sciences” held at the National Academy of Sciences on May 30–31,

2000. The Department of Energy was the host for the Workshop, and it was chairedby Dr. Alvin Trivelpiece, formerly, Director of the Oak Ridge National Laboratory,with support from expert panelists representing major disciplines in the physicalsciences as well as various professional roles in the scientific and technicalinformation life cycle. The other participants in the Workshop were drawn from thecommunity of potential stakeholders in such an enterprise. A list of the panelistsand participants is provided in Appendix A. The Workshop agenda is provided inAppendix B.

The Challenge

Our Physical WorldPhysical processes are all around us and

the scientific research we do and the datathat support it have dramatic impacts on thesocial and economic strength and security ofour country.

Business and industry convert researchresults into the tools and products we oftentake for granted. It is essential that thelinkage from research results and rate oftransfer to and from the business andindustry communities keep pace with theglobal communication processes that areevolving through the use of the Internet.

It is clear that the U.S. is losing ground inpatents issued, in comparison to patentsissued to foreign citizens. While the numberof patents issued to U.S. citizens has risen220 percent in the period of 1963–1998,patents issued to foreign citizens during thesame period has risen 790 percent.5

Further, the percent change in Real GrossDomestic Product for the period of1970–1996 shows a similar pattern. U.S.percent change during this period has showna positive 210 percent (1996 GDP was $7.6T)while the East Asia/Pacific Rim (1996 GDPwas $10.4T) has shown a positive 456percent change in Real Gross DomesticProduct.6

Both indicators would suggest that whilewe are continuing to show positive increasesin both areas, our rate of improvement is notkeeping pace with world competitors.

In addition, there are many incidents thatcan be cited where the absence of good dataapplied effectively has cost millions and

caused significant harm. We can think of theChallenger accident where data were knownabout the temperature effects on the “O”ring, but they were not brought to bear inthe decision making process.

Our industries are impacted by effectiveaccess and use of data. Five percent of the$135 billion in the U.S. chemical industrycosts could be saved if data needs werebrought to a level of completenesscomparable to that of other design tools.7

Finally, the impact of the use of good datacan be brought home to our everyday lives.For example, in 1982, $119 billion was spentin prevention or as a result of the fracture ofmaterials in the U.S., and it is estimatedthat $35 billion could have been saved byuse of best practices and technology. In 1975,$70 billion was spent on corrosion. Fifteenpercent of this expenditure could have beenavoided through available knowledge.8Fracture and corrosion impact the cars we drive,the buildings we live in, and the tools we use.

Our Human ResourcesIn 1996, there were over 3 million

scientist and engineer jobs in the U.S. It isprojected that this number will grow to 4.4 million by 2006, which is a 44 percentincrease.9 During the 1996–2006 time-period, the demand for scientists andengineers is expected to increase at morethan three times the rate for all otheroccupations.10 Yet, the growth in the numberof scientists and engineers to meet this needis not apparent. Data from 1997 show 5,500Masters degrees and 4,500 Doctoral degreesin the physical sciences awarded by U.S.

6

5 Workshop presentation by R.L. Scott, Director for Project and Program Development, U.S. Department of Energy, Officeof Scientific and Technical Information, May 30, 20006 Ibid, Scott7 D.W.H. Roth, Jr., Allied Corporation, Morristown, NJ, "The Chemical Industry." (Presentation at workshop coordinatedby U.S. House Committee on Science and Technology, Congressional Research Service, and the Numerical Data AdvisoryBoard of the National Academy of Sciences, Towards a National S&T Data Policy, Washington, DC, April 1983.)8 a) R.P. Reed et al., “The Economic Effects of Fracture in the United States,” (Part 1—A Synopsis of the September 30,1982 Report to NBS by Battelle Columbus Laboratories, NBS, Spec. Publ. 647-1, U.S. Government Printing Office,Washington, DC, 1983). b) L.H. Bennett et al., “Economic Effects of Metallic Corrosion in the United States, Part I,” (A Report to the Congress by

the National Bureau of Standards, NBS Spec. Publ. 511-1, U.S. Government Printing Office, Washington, DC, 1978).9 Ibid, Scott10 Ibid, Scott

The Purpose

Major Themes and Conclusions

academic institutions.11 This number of postsecondary degrees in the physical scienceshas remained fairly constant over the last 30 years, despite increases in population andincreasing reliance on new technology as akey component of our economy. Given thispossible shortage in talent, more efficienttools and processes to collect, organize, andsynthesize physical science information mustbe used to optimize the human resources wewill have available.12

It is estimated that a scientist spendsbetween 35 and 50 percent of his or herproductive time using and producingscientific and technical information.13 Giventhe dramatic improvements in informationtechnology, it should be possible to reducethis time and to make it far more productive.Yet, most agencies allocate far less than apercent of the federal investment in R&D toensure that information is produced,obtained and utilized productively.

7

11 Ibid, Scott12 Ibid, Scott13 Wood, Fred, Office of Technology Assessment, “Helping America Compete: The Role of Federal Scientific & TechnicalInformation,” Report to US Congress, 1990.

The purpose of the Workshop was to“obtain input from the scientific

community regarding the merits of theconcept of a ‘Future InformationInfrastructure for the Physical Sciences’ thatwould offer both a comprehensive collectionof scientific and technical information in thephysical sciences and services that wouldfacilitate scientific communication and

increase the productivity of the scientificenterprise in the United States. TheInfrastructure would impact science methodsand science education as well as thescientific record as a public good.”

The Workshop consisted of generalsessions and other discussions. A table ofthemes produced by the working groups isprovided in Appendix C.

The Workshop addressed questions inseveral key areas regarding an

information infrastructure for the physicalsciences. How could scientists benefit fromsuch an initiative? What scientificinformation/communication needs could suchan initiative serve? What kinds ofinformation should be included? Are thereany useful infrastructure models availablethat would facilitate the concept? Whatmechanisms might exist or be developed forsecuring future scientific community input?The overall conclusion of the workshop was

an enthusiastic endorsement of a vision of anational infrastructure that benefits not justthe scientific community but the nationalgood. It could ultimately impact not onlyR&D, but also education and applications toour everyday lives. It would be a step tointegrate the whole of science to provide abasis to improve society, the economy, andthe environment. The following text includesconclusions that address these and otherquestions, along with key points andexamples, which collectively endorse andexpand upon the overall concept.

Scope of the InitiativeBegin with the Physical Sciences sothat it is doable, but engineer it sothat it is scalable

The boundaries between physical sciencesand other sciences in terms of advancing ourunderstanding and solving scientificproblems are increasingly less defined. Theinterfaces and the interactions amongdisciplines are often where the importantdiscoveries are made. Ultimately a nationalinformation infrastructure for all of scienceis needed. We should not be looking atdisciplines like stovepipes. Rather we shouldlook across disciplines to determine whatrelated or contributory work has been doneor is underway that supports a researcher’sdisciplinary focus. Interdisciplinary issueslike information technology, nano-technologyor biotechnology are current examples.Equally recognized is the importance ofbringing in stakeholders from other fields atan early stage.

Starting with selected areas within thephysical sciences is a scalable approach thatallows the effort to gain experience as itdraws in stakeholders. From the openingtalk of the Workshop, the question of whylimit this to the physical sciences was raised.The need to start with a defined scope thatis doable but scalable was recognized.

Information TypesWe need a conceptual change thatallows us to better integrateinformation types (e.g. text, data,images, animations) as well asinformation at various stages in theanalysis process (raw data, partiallyprocessed data, text summaries ofanalyzed data in varying degrees ofanalyses, and unreviewed or peer-reviewed documents). Such aconceptual change would facilitatethe serendipity and insights thatcan be gained by dealing with thesemultiple information types and theirinterrelationships.

The information community has developedsignificantly advanced systems to provide

access to corpora of publisheddocumentation. Especially using Internettechnology, science communities havecreated an explosion of new types ofinformation to be created and exchanged.Much of this had been done by spontaneousefforts of interested parties. Figure Aprovides examples of information exchangethat the DOE physicist community hascreated. The long-range challenge in accessto information is with non-published text,with data and particularly images. Thechallenge is to build on the spontaneousefforts and support their integration throughan accessible information infrastructure. Afew research projects have been leaders inplowing this ground.14 Much more workneeds to be done and government-fundedinitiatives must continue to be supported.

Even by the physics community, science isnot as systematic as some would like tothink. In addition to non-textualinformation, scientists also need types ofinformation other than the highest levels ofpeer-reviewed journals and the commerciallypublished record. They need more grayliterature, they need prepublications(preprints), and they need to know wheredata may be wrong or ideas are half-filtered.We need to have systems that can help uswith lessons learned from others’experiences. These can be both successesand failures. In some cases, failures are atleast as important to know about assuccesses.

The project to build a neutrino factory, apossible future DOE particle acceleratorfacility, is a case in point. This projectinvolves over 200 physicists worldwide, andhas produced hundreds of documents ofwhich less than 10 have been published inpeer-reviewed journals. Exchange ofinformation among the geographicallydistributed workers is essential for successof such a project and has resulted in thecreation of several special-purpose, Internet-based information archives. More desirablewould be an information infrastructurebeyond that of particular projects, but whichis flexible enough to provide easy creation ofand access to “libraries” of technical

8

14 See summary of dli1 and dli2 at http://www.dli2.nsf.gov.

documents related to a specific project.Examples of what might be integrated in anational infrastructure is included in Figure B.

Increasingly, scientists want access toknowledge resources ranging from raw datato published information to interpretationsof information, all through a commondesktop interface. Systems should alsoaccess processing theories so the user canhave the whole paradigm at the desktop.

In addition to primary databases,metadata and derivative databases are alsoneeded to help organize the way data areidentified and, therefore, retrieved and asindicators of data quality and datastructures.

Infrastructure, Products, and Services

Information Infrastructure has to bemore than a storage and retrievalcapability. Rather, in the long term,it has to support fundamentally newways of doing science.

Information infrastructure can totallychange the way we do research andeducation and the way we bring things tomarket. We need to go beyond traditionalproducts and perspectives. Researchers wantaccess to a spectrum of knowledge resourcesand a range of all types of digital objects.They want access to other researchers.Scientists also want desktop access to

9

Figure A. Examples of Information Exchange in PhysicsThe well-known LANL archive must be regarded as a prototype for many key aspects of theproposed Information Infrastructure Initiative. This archive is, however, limited to documentsprepared according to a specified electronic standard. The URL is:

http://xxx.lanl.gov/Two information search engines that are fairly specific to elementary particle physics are SLACSPIRES and the CERN Library catalogue:

http://www.slac.stanford.edu/spires/http://weblib.cern.ch//Home/Library_Catalogue/

Another kind of useful information is catalogs of Conferences:

http://physics.web.cern.ch/Physics/Events/#otherlistsResearch Institutions:

http://physics.web.cern.ch/Physics/HEPWebSites.htmlParticle Accelerators:

http://www-elsa.physik.uni-bonn.de/accelerator_list.htmlA Catalog of Catalogs:

http://www.hep.net/Special-purpose, public-domain computer codes for the creation of new information. Some of thescattered sites are:

http://www.beamtheory.nscl.msu.edu/cosy/http://www-ap.fnal.gov/MARS/

http://laacg1.lanl.gov/laacg/services/parmela.htmlhttp://laacg1.lanl.gov/laacg/services/possup.htmlhttp://wwwinfo.cern.ch/asd/geant4/geant4.html

A text-based survey of this problem is

http://wwwslap.cern.ch/collective/bibliography/bib2html/codes.htmlbut a better solution is needed for the 21st century!

facilities from the local research laboratoriesto those at the North Pole so thatinstrument data acquisition and processingcan be easily integrated in the analysis andcollaboration processes.

They want remote computational toolswhere a problem is launched on the net andit finds its solution that can be anywhere.Information must become a more active toolin science. The peer review process shouldtake place as part of the work in progress.Further, scientists need an electroniclaboratory notebook to be shared “on the fly.”Tools for value added processing of data suchas data mining and recombining need to befocused, applied to the advancement of thescientific process, and provided morepervasively as a tool for research. Thereneeds to be capabilities to visualize datafrom 2D to immersion. Scientists should beable to query for what they don’t know, notwhat they know.

The following are some examples ofexperimental efforts and prototypes. Thetechnologies are being developed throughprojects around the world. The informationinfrastructure needs to provide a focal pointto help catalog these initiatives and bring

them together with the information sources. • Today, we do remote experiments

through Internet connections withinstruments. Lehigh University taps thisresearch process allowing real timeinteraction. Students should be able toobserve this scientific process in actionas part of the learning experience.

• The University of Tennessee developed asystem that finds the best availablecomputing facility nation-wide to solvevery complex parallel processing/supercomputing algorithm problems. Thedecision is made automatically andtransparently to the researcher.

• The Space and Aeronomy ResearchCollaboratory (SPARC)[http://www.windows.umich.edu/sparc/]at the University of Michigan usesremote instrumentation incollaboratories, which can move throughtheoretical models with visualizations,observing both empirical and theoreticalresults. Theorists and experimentalistsinteract on models in real time. This hasbeen changing the way scientists seetheir science.

10

Figure B. Examples of Information Archives for the Neutrino FactoryThe URL for the entire neutrino factory project is:

http://www.cap.bnl.gov/mumu/mu_home_page.htmlThe Fermilab-based neutrino-factory document archive is:

http://www-mucool.fnal.gov/htbin/mcnote1LinePrintThe CERN-based neutrino-factory document archive is:

http://molat.home.cern.ch/molat/neutrino/nfnotes.htmlA panel member, Dr. K. McDonald, maintains photos (and one video clip) as well as links to textdocuments:

http://www.hep.princeton.edu/~mcdonald/mumu/http://www.hep.princeton.edu/~mcdonald/mumu/target

http://www.hep.princeton.edu/~mcdonald/nufact/Special-topic bibliographies include:

http://www.hep.princeton.edu/~mcdonald//mumu/nuphys/http://www.hep.princeton.edu/~mcdonald/mumu/physics/

Another example of a very useful kind of Internet-based information archive is:

http://www.hep.anl.gov/ndk/hypertext/nu_industry.htmlThis provides access to theory and results related to neutrino physics, but also gives links tothe experiments themselves for those interested in grubbier details.

Scientists need systems that integratesimulation data with empirical andtheoretical data. In the area of materialssciences, scientists have phenomenologicaldatabases in such areas as crystallography,materials properties, and thermodynamics.Some began as early as the 1920s. Each ofthese databases are independent and createdfor a certain use, but not for repurposing andreuse. They often don’t physically ormathematically interrelate. If new materialsare desired, the scientist must interrelatethem. The paradigm must change. There canbe great temptation to use such data incontexts where they may be inapplicable. Asthese databases become available to widercircles of users, it becomes increasinglyimportant to supplement them withmetadata describing their ranges of testedvalidity, and with tests to determine theirvalidity for use in new contexts. An example:model multidimensional potential surfacesdeveloped for interpreting spectra maycontain serious deviations from reality thatonly become apparent if the surfaces areused for other purposes; notably, to studyscattering processes.

Synthetic organic chemistry is a field wherea shift in approach has proven enormouslysuccessful. In the past, the synthesis of eachnew chemical was an independent invention.The major change in organic synthesis camewhen computers began to predict therational paths to follow for syntheses—topredefine processes to pursue. This modelneeds expansion to other disciplines, andinnovative deployment of informationtechnology is the solution to this integration.

The infrastructure further needs toencourage a diversity of scientific andtechnical information supplied by thepublishing community. The coordinatinglayer needs to help account for security, legaland property rights, and provenance andquality of the suppliers. The Internet hasforever changed the dynamic of intellectualproperty rights. There have been dramaticchanges in the relationship of the producer,publisher, user and vendor. Licensing hassuperseded the first sale doctrine. Balancesmust be struck between the nature ofscientific information as a public good andinformation as a commodity. There will need

to be ways to access both toll roads andfreeways on the information highway forscientists.

One must go from specific contentproducts to linking to create a knowledgeenvironment. And the validation of contentfrom the front end is critical to this. Thefront end must provide for selectivity andquality indicators to help find the rightinformation from the increasing volume ofdata available. The National Library ofMedicine has developed such a model for thebiomedical sciences, which is also needed inthe physical sciences.

The infrastructure will provideopportunity for others to actively participatewith new and better value-added servicesand capabilities. It is important that theneed for open standards be a continuingresponsibility and a point of convergence forthe information infrastructure. The need foropen standards ranges from technology tocontent. In the best interests of its researchenterprise, the government should continueto actively promote open systems for theexchange of scientific data.

It is important that information aboutgiven topics can be associated through timeeven as jargon and vocabularies change.This use of terminology needs a leadershipvision and can have substantial impact onthe progress of scholarship. To address sucha need in biomedicine, the National Libraryof Medicine developed the Unified MedicalLanguage. Such a system requires asubstantial commitment of resources todevelop, but once done, opens up connectionsthrough time and disciplines wherevocabulary was once a barrier to knowledgediscovery and retrieval.

To build the information infrastructure, astrong information science community forthe physical sciences is needed. Library andinformation scientists must gain knowledgeof the physical sciences so they can helpenvision what is possible, and the physicalscientists need to be information andinformation technology literate to providethe applications perspective. The advent ofthe Internet and many associatedinformation technologies has brought thecommunities together, and the infrastructureneeds this cooperation.

11

Archiving, Preservation andAccess to Information

Archiving of data is one of the keyconcerns. The electronic record ismuch more fragile than the mediathat came before (print, microforms)because it comes faster, in biggervolume and is gone more quickly.15

There is a real need for assurance ofportability, regardless of mediadevelopments.

The critical need for a reliable, stablearchive, which is trusted by all parties, is aprimary concern in information managementtoday. For government information, there isa national responsibility to protect thetaxpayers’ investment. With regard to otherinformation that is produced withgovernment funds, it is important to addressthe reliability of the private sector as acustodian of such resources.

There has been an active discussion in theinformation industry of roles in archiving. Atone time, publishers looked to libraries forarchiving. Now there is economic value inthe collection; so publishers are taking onarchival functions, or at least controlling theuse of electronic archives in ways not donewith paper backfiles. Libraries in turn areconcerned about long-term preservation. Inrecent negotiations between publishers andthe government, the publishers are lookingat economic models where they are willing toallow third party archives since most of theirrevenues are coming through current issues.New models are evolving, but we must haveinterim measures and work rapidly to seelong-term options before valuable knowledgeis lost. One example of an innovative approachis that of the American Geophysical Union,which has set up a trust fund for datamigration to aid in long-term preservation.

Preservation is a prerequisite but notequal to access. To be fully useful, themassive amounts of information must beaccessible from anywhere. The concept of adistributed archive or repository is under

development by the Corporation for NationalResearch Initiatives (CNRI) with NationalScience Foundation funds under theleadership of Dr. Robert Kahn. CENDI16 (aninteragency group of Scientific and TechnicalInformation managers of the nine majorscience and technology agencies in thefederal government) is working with CNRIon concepts of federated repositories forgovernment information. This work may becritical to the open systems standards thatwill allow for the massive and distributedscientific information archives of the future.Many of the agencies that are involved inthis effort would also be key participants inthe information infrastructure for thephysical sciences, implying that futurecoordination should be facilitated by currentrelationships.

In addition to electronic information,there is a critical need for stewardship of allcollections, regardless of medium. It isimportant to address issues of legacyinformation, that is, information that is notin useable digital form. Old data are still insignificant demand. The Defense TechnicalInformation Center reports that 11.4 percentof their 62,100 requests in the last year wasfor material over 25 years old. The NationalTechnical Information Service reportssimilar demands for older material.

All of these initiatives just scratch thesurface of capturing the knowledge base inthe physical sciences. There are lessonslearned, there are laboratory notebooks andthere are experiences that are notdocumented. There are also new forms ofdigital objects such as visualizations, resultsof simulations, and processes of collaborationsthat are important to the scholarly scientificrecord that we have not yet begun toeffectively address in terms of capture andpreservation. Leadership is needed to explorethe new issues raised by new informationtechnologies and new ways of doing science.The coordinating role of the infrastructureinitiative should assert leadership inexploring these archiving issues.

12

15 “Digital Electronic Archiving: The State of the Art and The State of the Practice,” Carroll, Bonnie C. and Gail M. Hodge.Report sponsored by International Council for Scientific and Technical Information [http://www.icsti.org] and CENDI (seefootnote 14), April 1999.16 CENDI members are: Commerce, Energy, EPA, NASA, National Libraries of Medicine, Agriculture, and Education,Defense, and Interior. For more information the URL is www.dtic.mil/cendi/.

Research, Education, andThe Public Interest

The physical science informationinfrastructure of the future mustbegin by focusing on the producersof the information, i.e. the scientificcommunity, but must also recognizethe potential for positivelyimpacting education and thegeneral public interest.

Audiences for the informationinfrastructure for the physical sciences existat every level of technical sophistication.Prioritization, however, is a must if theproject is to have any realistic chance ofsuccess. From the practical perspective, theearly adopters and the audience that willallow infrastructure development to proceedinitially is the scientific community.

However, the physical science informationinfrastructure should, in the very earlystages, consider effective ways to benefitboth the education community and thepublic. It should focus on higher education,high-end initiatives such as Internet2 (ahigh-end consortium of universities, with amission to advance telecommunicationstechnologies in support of academicresearch), and particularly on methods forinvolving diverse populations.

In the global economy of the 21st century,educators must have the information theyneed to optimize their performance in boththe classroom and the laboratory. In a 1998report, it was identified that U.S. scienceand math achievement for grade 12 studentsfalls substantially below what we wouldexpect when compared to other countries. Inscience, we rank 16th, with a score of 480and in math, we rank 19th, with a score of461, significantly below the world average of500.17 It is the middle school years that arecritical in preparing for a strong and diversepool of scientists and engineers for thefuture. In order for the seed corn ofeducation to take root, children must becomeinterested in science and technology at anearly age, and we must strive to hold theirinterest by making it come alive and by

providing access to age-appropriateinformation. It is one thing to be computerliterate as a “gamer;” it is quite another touse the tool to expand the intellectualcapacity of a child directed toward productiveends.18 How do we keep them interested? Theinfrastructure should provide for activelyinvolving these groups.

We need to find ways to get science to thepublic. NASA ensures that every programmust have outreach and education. Theinformation infrastructure should become aviable tool for supporting this outreach.

At the bottom line, science needs to bepublicized and democratized. From theeducation perspective, a physical scienceinfrastructure with resources for educationaluse presents wonderful opportunities. Theadvances in the infrastructure will besupportive of and paralleled by advances ineducation, learning, and data simulations inways we can hardly imagine today.

QualityQuality in data and data processingis one of the cornerstones of success.The system must take into accountthe quality, the reliability, andusability aspects of the information.

The infrastructure has to help be selectivein the quality of information that is madeaccessible. Given that the volume of databeing produced today out paces our ability toabsorb it, we need to ensure we have qualitydiscriminators built into the informationinfrastructure.

It was noted that the irony of theinformation age is that it gives credibility touninformed opinion. The value of data lies inits use and we must insure against “garbagein-garbage out” or, even worse, “garbage in,gospel out,” or “garbage in, garbage atgreater speeds.” In medicine, this can easilyand directly mean the difference between lifeand death. In the application of othersciences, it can also directly impact ourpersonal lives as we note failures in thephysical environment we have created.

13

17 Ibid, Scott18 Ibid, Scott

New models for data quality are emergingas early experiments with new paradigmsshow results. For example, the LANLpreprint server (http://xxx.lanl.gov/) was a“quasi spontaneous change” and is nowbecoming a trusted source for physicalscience information, even with the caveatsthat limit peer review. Some researchers inthe physical sciences have shown they valueunrefereed articles. The informationinfrastructure for the physical sciences mustunderstand the dynamics of these changesand capitalize on them.

While we generate and manipulate dataever more easily, we must not lose theknowledge about what the data mean. High-quality metadata are critical in knowing thegenesis of the information content. Throughmetadata and other means, we must try toensure that the data and systems aretransparent. We must work toward makingthe scientific expertise involved with thecreation and manipulation of data properlycaptured in the interpretation andapplication processes. Attention to this issuewill require national leadership.

Participation of Sectors ofthe Economy

We must continue to strive for broadand innovative collaborations.Stability comes from diversity.Though we may of necessity need tobuild the infrastructure with a focuson specific tasks and disciplines, wemust encourage the broadestparticipation at the planning tableto allow for future involvements.

Given that we now live in an informationeconomy and one whose economic models ofpublishing and communication have beendramatically shaken by the new informationtechnologies, the development of the physicalsciences information infrastructure must besensitive to the value-added roles of all theparticipants in the information industry.There are different parts of the industry, e.g.primary versus secondary publishing, whichhave very different issues as the life cycle ofscientific information becomes moreelectronic and more distributed.

Historically, there has been a symbiosisbetween publishers and scientists. As thediversity and complexity of the informationindustry has grown, this relationship haschanged through the increased use of theInternet and the tools available. Thegovernment role in this, because it is amajor funder of scientists, has beenchanging.

The Department of Energy’s recentdevelopment of PubSCIENCE is an excellentexample of an innovative and effectivecollaboration between a government agencyand the private sector publishing industry tobring science information to a worldwideuser community. This new tool providesincreased visibility of scientific articlecitations while providing revenue potentialfor the publishing community throughexpanded subscriptions and pay-per-viewfull text services.

Another government-funded initiative isthat sponsored by the National Institutes ofHealth (NIH). NIH is planning to provide apublishing infrastructure for originalarticles. It will be opened to all publishers.NIH’s original thinking about this systemwas to provide better access to government-generated information. The functionality ofthe system has undergone a number ofchanges as the public and private sectorshave expressed their interests and concerns.

The compression of product cycle time is akey issue in the relationship among thesectors. In the past, certain developments ininformation systems were pre-commercial,and prototypes were developed withingovernment. In the rapidly enablinginformation technology environment oftoday, the government must move rapidlyand innovatively forward in providingsystems to meet the needs of its scientistsand engineers. It has no motivation forduplicating what already is done well in theprivate sector. However, the government hasa responsibility to move forward and be goodstewards of its information capital.

It is difficult to generalize regarding whena product duplicates an offering by theprivate sector. There are issues of A-76studies and the Economy Act of 1932, whichfocus on the roles of the pubic sector and

14

agencies in getting the job of governmentdone and place different demands onagencies. OMB A-130, which focuses oninformation policy, also has many conflictingdirections (e.g., calling for diversity ofsources, requiring the government to go toelectronic information management anddissemination, and restricting pricing to theincremental costs of dissemination).Government requirements are complex, butunderstanding them is a prerequisite to goodrelations among the stakeholder communities.Similarly, the private sector, both for profitand not for profit (which both havesignificant roles in scientific and technicalinformation), has its own inherently distinctinterests. Finally, the academic sector alsohas its own unique and changing nature andmust be a major player in any scientificenterprise. Clearly, an education processmust be an integral part of partnershipbuilding. To facilitate communication andcooperation, clear and open notification ofdevelopments and plans are essential.

At the bottom line, the worst course is toclose off paths of action or cooperation. Weshould focus on how to construct aninformation system to encourage broadparticipation. We must look for innovativepartnerships between all sectors andcollaboration will be a cornerstone of success.

LeadershipIt is clear that we need a point ofconvergence and leadership todevelop the common agenda andmove it forward.

Given that the U.S. government spendsover $12 billion in research and developmentin the physical sciences, we must ensurethat the results of taxpayers’ investment isproperly mobilized in the national interest.Given that the volume of information, the“information explosion,” is increasing atunprecedented rates and outstripping ourability to deal with it in human terms,19 wemust ensure quality information is deliveredin a useful and timely manner. And giventhe potential to take advantage of the powerof new information technologies, we musthave infrastructural resources to meet

researchers’ needs and expectations and addnew value to the advancing processes ofscience and engineering.

We must actively shape the future. Thereis a need to have a point of convergence andleadership to mobilize the stakeholdercommunities to provide their value-addedcontributions in the physical sciencesparticularly and in science in general.

The Department of Energy (DOE),because it is the largest funder of researchand development in the physical sciences,because it has already invested in theinformation infrastructure and coordinationrole, and because it is a clear champion ofthe vision, should be tasked to continue tomove the agenda forward.

The structure of the informationinfrastructure that is envisioned will makeaccess to the comprehensive collection ofcontent and services from all sectorsseamless. The nature of the infrastructurewill go beyond the mechanics of access to thecooperative, distributed model in whichgovernance is coordinated and certainservices are offered centrally, butimplementation is distributed.

Given the size of the DOE commitment tophysical science research and the leadershipthat DOE’s Office of Scientific and TechnicalInformation (OSTI) has played, it isdesirable for them to take the lead, but otherorganizations or groups should be encouragedto participate from the earliest stages.

From the top and from the beginning,there is a need to operate with a structuredmeans of obtaining ongoing senior scientificadvice. The National Library of Medicinehas a Board of Regents engaging the topmedical and informatics people in thecountry, and this has proven invaluable totheir planning processes. In the interests oftime and to begin the process in the physicalsciences, it will be good to go through anexisting organizational infrastructure forsenior scientific input. There are severalcommittees dealing with physical sciencecoordination. Within DOE, the mostappropriate advisory body is the Office ofAdvanced Scientific Computing andResearch Advisory Committee.

15

19 “Beyond Databases and E-mail,” Science, Vol. 261, 13 August 1993, p.841.

FundingThe government has a responsibilityto disseminate the results offederally sponsored research asbroadly as possible as a public good.The amount of resources neededshould be benchmarked againstmodel efforts.

R&D is funded federally to produce apublic good. As an integral part of the R&Dprocess, information is both an input and anoutput. The value of this process comes fromthe use of the knowledge output and isenhanced by the effective use of the existingknowledge base. The Web makes it possiblefor information to be searched at essentiallyzero or marginal cost once the information ismade available. However, loading theinformation on a server, making itsearchable, and providing the gatekeeperfunction to promote reliability all requirecapital investment. Adding in the tools togain the added processing and collaborationtechnologies also requires coordination andinvestment. Managing the coordination andproviding leadership requires staffinginvestments. And finally, there is a role toadvance the state of the art and practice inphysical science informatics. These are someof the key roles of the informationinfrastructure initiative, and they require anational commitment to their continuity.

Although it was beyond the scope of theWorkshop to estimate the cost of initiatingand developing the concept proposed, otherbenchmarks were reviewed and a budgetadequate to create an informationinfrastructure to serve the size of the R&Denterprise is required. The NLM, a model foran area of science has a FY2000 budget ofover $214 million. It supports the researchinterests of the $17.8 billion NIH budget.This ratio provides a loose parameter. Thebudget for physical science research in theU.S. is over $12 billion.

TimingThe time is now, the need is now!

The time is right for such an initiativebecause the challenge is a national one, theneed for leadership and convergence ispervasively felt, technology is enabling, and

the positive impact on the public good wouldbe significant.

For well over 50 years, since theintroduction of the computer, studies havecalled for better management of the nation’sscientific and technical informationresources.

So why now? Because the window ofopportunity for the scientific community toinfluence the growth of the Internet and theWeb is still open as these are used more andmore for commerce and entertainment. Thatwindow will shrink (and may even close) ifthe next generation Internet does not comeinto being.

So why not now? Progress and change areincredibly rapid today. We have an economythat is based on information. Time to marketcan be measured in days. We need aninformation support structure where theimpetus is in tune with the rate of changeand the conduct of research. Science is globaland the competition is intense. In anelectronic world, the facts of the matter areno longer conveyed on paper or material butrather by electrons. These are notretrievable by human senses (as is theprinted word) without the help of theinformation technology infrastructure. Oncemissed, they are gone. The most currentinformation and the best technology servicescan now be in the same desktop toolbox. Ourscientists need these tools to continue tocompete in this increasingly competitivemarketplace.

Reward StructureAny analysis, economic or social, that

might be developed to guide the constructionof a scientific information system mustrecognize and accommodate the rewardsystem of the sciences. The immediaterewards of doing science come fromdissemination and acceptance of ideas, notfrom immediate monetary recompense. Thisoccurs at all levels, from the apprenticeshipof graduate school and postdoctoral workthrough the journeyman stage of pre-tenuredfaculty to the master craftsman level oftenured faculty. Understanding this valuesystem is central to effective design of thefuture information infrastructure for thephysical sciences.

16

Findings

International DiplomacyThe proposed information infrastructure

for the physical sciences can be an asset ortool in international diplomacy. For example,the United States is spending considerablesums of money to work with retraining theRussian scientific establishment from militaryto civilian applications. An information systemcould be a tool in facilitating that development.

Today sending data across the ocean canbe a problem. International cooperation,including mirror sites, becomes a tool forUnited States as well as foreign researchers.

A NameThe name of the initiative is important

and it should reflect the key nature of whatneeds to be achieved. The initiative is morethan a library, more than a physicalinfrastructure, and must help to advance the

state of the art of physical science informatics.The concept of “institute” serves as anumbrella for advancing comprehensivenessof content, the infrastructure, and theenabling tools. Institute for PhysicalSciences Information (IPSI) was selected asthe operating moniker.

The Biomedical Model:Validation Of The ConceptThroughout the Workshop, the life

sciences model was held up as a target.Clearly, the National Library of Medicine,focused on the electronic delivery ofinformation to the desktop, provides anexcellent model from the life sciences for theoverall initiative. So what makes this modelso successful? An analysis of NLM’s successfactors with current points of reference in thephysical sciences is provided in Appendix D.

17

20 Ibid, Scott; (Note that openly available is not necessarily free of charge. As access to open government information getsincreasingly integrated with private sector resources, we will need to find ways that toll roads and freeways cancomfortably coexist and the user will determine the best path for his/her needs.)

As a result of the Workshop and focuseddiscussion, the infrastructure envisioned

by the Department of Energy was in factconfirmed and findings support theestablishment of the following:

A common knowledge base that seeksin an integrated approach to providecomprehensive access and facilitate thereuse of worldwide sources of scientificinformation, initially in the physicalsciences, regardless of where they reside,what platform(s) they reside on, or whatformat or data structure they employ.

A point of convergence for ensuring theawareness, availability, use, and

development of information technologies andtools to facilitate information assimilation,data analyses, peer communication andcollaboration, sharing of preliminaryresearch results, remote experimentation,validation of experimental results, and otheruses not yet envisioned.

An openly available source ofinformation to serve all users, from studentsto scientists to concerned citizens, in ahighly efficient electronic environment, withtools to assist users in their quest forinformation and ultimately knowledge.20

Implementation

The requirements for the Institute forPhysical Sciences Information (IPSI) (the

operating moniker) must deal with threetime horizons, each of which presents itsown challenges:

• Doing better at what we’re doing now(FY01)

• Mobilizing for what is possible tomorrow(FY02–03)

• Realizing the future potential (Out Years)Development cannot effectively proceed

without resources and an implementationplan should be proposed based on theconcepts outlined in the May 2000 ConceptPaper, modified by the findings of theWorkshop.21 The amount of resources neededto accomplish the development of theinfrastructure should be benchmarkedagainst model efforts of other disciplines andinitiatives. This report of the Workshopprovides a vision and some goals. Form here,we must flesh out the strategies and specificactions to reach these goals.

Doing Better At What We’reDoing Now (FY 01)

Great progress has already been made inthe delivery of textual information to thedesktop.22 A Department of Energy systemprovides access to the full text of over 1,000 journals from 24 publishers through abibliographic front end that allows effectivesearching and identification of articles ofinterest. Another system provides desktopaccess to the gray literature produced by theDepartment of Energy. In fact, as a result ofthe Workshop itself, an understanding inprinciple was reached between NASA, DOE,and DOD to create a Gray Literature23

(GrayLIT) Network. It will be a combinationof the full text gray literature that NASA,DOD, and DOE make available on the web.

DOE’s Office of Scientific and TechnicalInformation will overlay a distributed searchtool on top of already existing databases atthe three Agencies. There is a preprintnetwork that provides access to over 1,000 preprint servers, including the leaderthat was developed for physical sciences byLos Alamos National Laboratory. Now thatthere is proof of concept and a critical mass,many new sites are being offered forconnection.

These are some of the immediateactivities that can be cited that confirm whatcoordination and leadership in the near termcan accomplish.

Other immediate opportunities forenhancement include increasedcomprehensiveness of journal and grayliterature. As was suggested at theWorkshop, we need to go further and beginto build databases on the fractal model:begin with one and extend and connect andextend and connect. We need to optimize ourcurrent systems with the goal that we canreach relevant content within three clicks.As we extend our systems and theheterogeneity of resources and participantsinvolved, we need to be able to assure thatour interfaces do not lead to dead ends andfalse paths.

There are other efforts that are beginningto come on the landscape that can be movedforward now. For example, joining papersfrom different journals in differentdisciplines in such topics as nanoscaletechnology. The American Association for theAdvancement of Science (AAAS) is alreadyworking in this area, and these effortsshould be visibly supported. Effectively, thisbuilds virtual journals of interdisciplines,especially for hot topics. The AmericanPhysical Society would welcomeparticipation in such experimental efforts.

18

21 “Future Information Infrastructure for the Physical Sciences: Concept Paper, May 2000,” White paper distributed atWorkshop, May 30-31, 2000, 5 pages.22 Examples of DOE systems were given in the handout package and by Walter Warnick in his talk. These are furtherdescribed at http://www.osti.gov.23 Gray literature is defined by GreyNet, the Grey Literature Network Service (Farace, Dominic,1997) as, “that which isproduced on all levels of government, academics, business and industry in print and electronic formats, but which is notcontrolled by commercial publishers.”

And, finally, we must continue theplanning process and be inclusive. In thefirst stage of the plan, DOE should have acall for participation from other agencies andother stakeholders. This Workshop was thefirst step.

Mobilizing For What IsPossible Tomorrow

(FY02-03)We need to create a capability fortestbeds so that the system willremain involved in the cuttingdevelopments and can evolve as newexperiments are tried and new waysof advancing science with theknowledge bases are developed.

The Institute for Physical SciencesInformation should be set up ready forchange, with the opportunity to adapt builtin to the model. To move ahead at theforefront of informatics opportunities, weneed to build on what leaders and centers ofexcellence are doing in other disciplines. Inparticular, the physical sciences should findareas where there are good intersectionswith the biomedical sciences. Joint testbedsshould be initiated to test and deploy theresults of cutting-edge informationtechnology R&D programs.

There was a generally recognized needthat the process must be kick started andsome additional resources need to be broughtto bear.

As a rapidly evolving area, devisingspecific agendas in the mid-term is difficult.Under such dynamic circumstances, it is

imperative that particular attention be paidto guidance from the scientific community.

Realizing The FuturePotential (Out Years)

As part of the IPSI, we must haveboth intra-and extra-mural researchcapabilities in the developing fieldsof disciplinary informatics. We mustalso focus on a commitment toextend the boundaries of theinfrastructure to interface withother science areas’ informationresources, including the biomedicaland environmental sciences.

Agencies such as DARPA and NSF fundthe basic research in these areas. Themission agencies such as DOE, NASA orDOD can bring the results forward throughprototyping and testing.

We need R&D to invent this; we need tobuild testbeds to prove this; and we need the loop back to education to fuel this.Appendix E provides some suggestions forthe necessary R&D from a Panel member.

And we must actively complete thelinkages with the infrastructures in theother sciences to realize the ultimate visionof a science information infrastructure forthe U.S. Other fields like biology andmedicine will lead in some areas. Thephysical sciences will innovate in otherareas. Throughout the development of all ofthese systems, interoperability andstandards compatibility will be a long-termgoal. Within the next four to six years, itshould become a reality.

19

20

Appendices

Appendix AWorkshop Panelists and Participants

PanelistsAlvin Trivelpiece, Chair R. Stephen BerryFormerly, Director, Oak Ridge National Laboratory The University of Chicago

Martin Blume Jose-Marie GriffithsAmerican Physical Society University of Michigan

Lee Holcomb Kirk McDonaldNASA Princeton University

Krishna Rajan Kent SmithRensselaer Polytechnic Institute National Library of Medicine

Derek WinstanleyIllinois State Water Survey

ParticipantsFran Buckley Eileen CollinsGovernment Printing Office National Science Foundation

Blane Dessy Denise DigginDepartment of Justice DOE Energy Library

Susan DiMattia Gail FeldmanSpecial Libraries Association Archive.org

Eleanor Frierson Laura GarwinNational Agricultural Library Nature Magazine

Daniel Greenstein Howard HarrisDigital Library Federation University of Maryland

Larry Lannom C. Diane MartinCorporation for National Research Initiatives National Science Foundation

Kurt Molholm Donna ScheederDefense Technical Information Center Special Libraries Association

Bill Sittig Mike SpinellaLibrary of Congress American Association for the

Advancement of Science

Paul Uhlir Greg WoodNational Research Council Internet2

Staff Facilitator/EditorBonnie C. Carroll

Information International Associates, Inc.

21

Appendix BWorkshop for a Future Information Infrastructure for the

Physical Sciences Agenda

Day 1, Tuesday, May 30, 20002:00 PM–5:15 PM

Welcome and Workshop Objectives Al TrivelpieceFormerly, Director of ORNL

A Future Information Infrastructure for the Walt WarnickPhysical Sciences: Concept and Assumptions Department of Energy

Internet 2 and the Changing Greg WoodNature of Scientific Communication Internet2

A Future Information Infrastructure for the RL ScottPhysical Sciences: Partner and User Considerations Department of Energy

The National Library of Medicine: A Case Study Kent SmithNational Library of Medicine

Discussion Panel: Views and Opportunities Jose-Marie GriffithsUniversity of Michigan

Krishna RajanRensselaer Polytechnic Institute

Lee HolcombNASA

Kirk McDonaldPrinceton University

Day 2, Wednesday, May 31, 20009:00 AM–12:00N

9:00–9:45 Plenary DiscussionPresent the list of considerations with opened discussion.

9:45–11:00 Discussion Sections

Why

• Understanding the Need

• Success Factors

What

• Content

• Services

Working Group Leaders: Stephen Berry, University of ChicagoDerek Winstanley, Illinois State Water Survey

11:00–12:00Reports of the Working Groups and Discussion

Wrap-up and Next Steps

22

Appendix CWorking Group Themes

What: Content and ServicesWorking mission statement—provide an infrastructure for the physical sciences withstandards, links, etc. . . . to facilitate resource location and access across sources and media forthe purpose of generating new knowledge through coordination and integration.

• Private vs. Public: Accessibility among sources should be collaborative, notexclusionary

• Portal that shows reliability• Long term stability• Assurance of portability and accessibility• Access to older literature — with fully searchable text — making adjustments for

changing vocabularies through time• Profile based automatic notification of information (individualized)• Means to identify others who have common interests who wish to be identified

(privacy issues acknowledged) — community of practice• Easy passage among different kinds of information — papers (text) to primary data to

simulations and back again (through different categories)• Function of scientific unity: meta-tools development, e.g. synthetic organic chemistry

— traditionally, each scientist figured out a unique synthesis path. Now computerprograms allow systematic development of pathways. Knowledge based program toconduct synthesis. Need to have systematic approach to looking for such tools.

• Testbeds to demonstrate applications of information infrastructures (education as aneasy model)

• Access to cross cutting fields — subservice of access to different vocabularies indifferent fields

• Visualization tools — how to make accessible to support scientific discovery• Reliable stewardship of unique bodies of information — content and systems for

accessibility• Need for helpdesks for finding and interpretation and analysis of information• Security of data, characterization of consistency, reliability (including authenticity),

validation information including later testing applied retroactively (forward andbackward referencing)

• Flexible infrastructure to allow for the continued evolution of services that are todaynot apparent

• Current customers and data providers envision new services; information science canenvision future possibilities — need to develop the informatics of the physical sciencesto move the agenda forward

• Adaptive Management Model

23

(Appendix C continued)

Why: Understanding the Need and Success Factors

Needs of the CommunitiesFunders• Facilitate resource location and access, across sources and media, for the purpose of

generating new knowledge through co-operation, co-ordination and integration• Increase national competitive advantage• Maximize taxpayers’ investmentCustomers• Verify reliability and quality• Access currently unavailable information• Archive reliably information and dataIndustry• “Pre-competitive” approach

Success Factors• Create strong advocacy groups (professional groups, policy makers, general public)• Identify champions• Network of collaborators• Integral part of the research process• Quality• Develop comprehensive plan• The planning process is key. It is the source of major new initiatives.• Organization and structure

24

Appendix DThe Biomedical Model

Throughout the Workshop the biomedical sciences was held up as a target. So what is it that makesthis model so successful? And what already exists in the physical sciences that could be ingredientsin a successful information infrastructure for the physical sciences? The following table providessome points of reference:

NLM’S Success Factors Points of Reference in the Physical Sciences

DOE has produced the world’s most comprehensive database in EnergyScience and Technology, now containing over 5.5 million records. The combi-nation of journal literature (PubSCIENCE), gray literature (InformationBridge), preprint literature (PrePRINT Network), and EnergyFiles beginsto lay the infrastructure for the near term base for the IPSI.

The DOE-enabling legislation (Atomic Energy Act of 1954), states “Thedissemination of scientific and technical information relating to atomicenergy should be permitted and encouraged so as to provide that freeinterchange of ideas and criticism which is essential to scientific andindustrial progress and public understanding and to enlarge the fund oftechnical information.” Although this is the basis for past work, it wouldneed significant strengthening to commit to the national leadershipneeded for the IPSI.

There is a strong network of science and technology libraries, withoutstanding resources in all of the National Laboratories. There is noexact equivalent to the Medical Library Association, but there aredivisions of the Special Library Association that help to support librarynetworking.

DOE has a strong history and record as a leader in these technologiesapplied to the physical sciences including ESNet and DOE highperformance computing. Most of the National Laboratories are centers ofhigh performance capabilities, particularly for scientific computing.

Within the DOE Office of Science, there is the Office of Scientific andTechnical Information. This office has been a champion of the vision forthe IPSI.

There needs to be more systematic involvement of the best and brightestin the physical sciences for this initiative just like the role the Board ofRegents plays for NLM.

The major Library Associations and the Federal Depository LibraryProgram are just the most rudimentary beginning of a strong advocacycoalition. This Workshop was a step in focusing on this issue.

Top DOE management commitment is becoming increasingly visible andthe media are recognizing the initiatives that have been taken. There isnow a rather famous picture of the Secretary of Energy doing the ribboncutting for PubSCIENCE.

DOE has the structure under the Office of Science with a majorinformation facility in Oak Ridge, Tennessee.

Fiscal budgets have been good for raw infrastructure such as highperformance computing and networking. Application specifically to themanagement of content has substantially lagged behind both thisinfrastructure and the funding of research and development.

Historically, the information products from the DOE have had worldwiderecognition and use. Bibliographic products have been enhanced andtransitioned to full text delivery mechanisms. New collaborativeinitiatives are partnering DOE with the private publishing industry. Andthe DOE has been recognized as a model agency in working with theGPO for electronic public access to scientific and technical information.

25

Rich History of Achievement

Strong Congressional Mandate

National Network of Librariesof Medicine

Applying the latest in computerand communicationtechnologies to health problems

Being an integral part of theresearch process

Dynamic Planning Process

Strong Advocacy Groups

Effective Publicity Program

Good Organizational Structure

Garner Necessary Resources—Dollars, People, and Facilities

Quality

(Appendix D continued)

Two additional factors were added to the original list presented by NLM. First, NLM began withpeer reviewed scientific journals in the highest quality bibliographic database in the industry. Inkeeping up with the pace of the times they were the pioneers in adding citation linking to theirbibliographic database in PubMed (their major web-based information service); they have addedfree public access to PubMed24; they have added new types of content to their journal literatureincluding information on clinical trials in process; and they have linked to websites to direct thepublic to quality information about health issues (PubMedPlus). Most recently they have enteredthe publishing arena with PubMedCentral as a direct reaction to the changing nature of scientificpublishing and as a reflection of the enabling Internet technologies.

Second, NLM has made a substantial investment in metadata standards and ontologies that help tostructure knowledge. Unified Medical Language allows consistency across time and perspective(i.e., doctor, researcher, nurse, medical records manager, pharmacist).

26

24 DOE/OSTI has followed the NLM PubMed lead with PubSCIENCE by adding citation linking to their bibliographicdatabase and working with the Government Printing Office to provide public access to this web-based informationproduct.

Appendix ESuggested R&D for the Future Information Infrastructure

for the Physical SciencesDr. Krishna Rajan, Professor of Materials Engineering

Rensselaer Polytechnical InstituteInformation Infrastructure: Managing Scientific Complexity for Knowledge Discovery

Traditionally, computers have been used as storage mediums for large volumes of data andas tools for carrying out extensive numerical computations and simulations. Recently,computers have started to take a more active role in guiding the scientist through theresearch and discovery process with the help of data mining methods for automaticdiscovery of patterns in large volumes of data. The challenge is now to make these datamining methods ubiquitous and an integral part of the data collection and verificationprocess. The physical sciences and engineering offer a unique challenge in data miningdue to the variety of data types, and their complex interconnections. For instance duringthe engineering design process, there is a need to integrate multiple, heterogeneousdatabases to reach new and even unexpected conclusions as well as to use databasesactively to design new processing strategies. This complex coupling of data models, dataanalysis methods and physical methods offer a unique computing challenge that has notyet been addressed sufficiently in information technology research. Some issues to focus onmay include:

• How do we combine heterogeneous databases so that we can discoverinteresting patterns from them? These databases contain data from differentlengths of scale, from simulations as well as experiments. For instance, in the field ofmaterials science, how does information on phase stability extracted fromthermochemical data compare with information derived from electronic structurecalculations? These databases need to be combined with information onmicrostructure and physical properties derived from experiment and simulations.From these independently organized databases, one can compare and search forassociations and patterns that can lead to ways of relating information among thesedifferent datasets.

• What are the most interesting patterns that can be extracted from theexisting data? The new patterns to be discovered should reflect the complexrelationships that exist in both spatial and temporal dimensions. One needs toexplore datasets on engineering performance. These may include for example,chemistry, crystal structure, microstructure, diffusion behavior, processingmethodologies among others. Such a pattern search process can potentially yieldassociations between seemingly disparate data sets as well as establish possiblecorrelations between parameters that are not easily studied experimentally in acoupled manner.

• How can we use mined associations from large volumes of data to guidefuture experiments and simulations? Data mining methods should beincorporated as part of design and testing methodologies to increase the efficiency ofmaterial application process. One of the challenges in mining the results of differentexperiments and simulations is the fact that results of many of them describeengineering problems at different scales and different forms (from visual, such as theform of microstructure, to numerical, like material properties). Coupled to this is thefact that properties and structure are time dependent. These temporal based datasets will not only come from reported experimental data but also from simulationswhich will permit us to explore a wider set of “virtual” data. This in turn requiresknowledge of time dependent behavior at different length scales ranging from atomicmotion to morphological changes in microstructure.

27