Upload
justin-melvyn-pearson
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1
Managing Scientific InformationMaking the Internet Work for Big Science
Professor Greg RiccardiFlorida State University
Department of Computer Science, UK National e-Science Centre
23 Oct, 2002 2
Overview
Information on the Web—Current Status Providing Information on the Internet General Conditions of the Web and
Internet Resources Needed for Big Science What is The Grid? How the Grid Might Support Databases Computer Science Challenges and
Research Opportunities
23 Oct, 2002 3
Is this the Internet?
23 Oct, 2002 4
Edinburgh on a Normal Day
23 Oct, 2002 5
23 Oct, 2002 6
Can We Find Web Information?
Use Google to find travel times from Edinburgh to Aberdeen Search for “railroad times Aberdeen Edinburgh
” Why was no usable information returned? Vocabulary problem
“railroad” is not a service (in UK) Search for “train times Aberdeen Edinburgh”
Even better, use a ticket service GNER.co.uk Thetrainline.com
23 Oct, 2002 7
Finding Information on the Web
Consider comparative shopping Provide a capability to compare prices Allow people to see pages of prices
Example of pricewatch.com Prices of memory Where do prices come from?
Can you extract the information content from the Web pages?
How can we make the Web provide information? Can we establish a way to share price info? Will vendors participate?
23 Oct, 2002 8
XML Creates Opportunity
Possibility: Use XML to represent information <item type=“PC3500 DDR”
<vendor name=“memorylabs.com”/><manuf name=“Samsung”/><size>512</size><price>169.00</price>
</item> Strategy for sharing
Industry creates standard XML schema Vendors create files of prices Comparison shopping sites grab files and create
presentations of information Purchasing agents
How would shopping sites and purchasing agents find the sources of information?
Would vendors agree to publish?
23 Oct, 2002 9
Semantic Web: Information on Web
The Semantic Web Tim Berners-Lee’s idea: definition from
http://www.w3.org/2001/sw/ The Semantic Web is the abstract representation of
data on the World Wide Web, based on the RDF standards and other standards to be defined.
Resource Description Framework (RDF) is an emerging standard for representing Web resources
http://www.w3.org/RDF/ The semantic Web requires sites to
provide documents marked up to define information content I.e. XML documents With an agreed ontology
23 Oct, 2002 10
Can the Semantic Web Work?
According to Henry S. Thompson, U. Edinburgh Talk given at Global Grid Forum July 2002
The Semantic Web is based on metadata Metadata describes resources systematically
Suppliers can record what a document or resource is for or about
Search engines can work with meaningful information What would we need to make Semantic Web work?
A standard syntax for metadata One or more standard vocabularies,
Allow search engines, producers, and consumers to speak the same language
Lots of documents and resources with metadata attached Attribution and trust Access and security
23 Oct, 2002 11
Web Services
Machine-to-machine exchange of information Web servers deliver XML in response to HTTP
requests Metadata defined with Web Services
Description Language (WSDL) Gives structure of information content of a service
UDDI commercial registry for services Possible use: try to create a contract to process
1000 credit card transactions Look for services, ask for prices, negotiate, etc. Microsoft, IBM, ebXML
23 Oct, 2002 12
Discovering Web Services
Again according to Henry S. Thompson, U. Edinburgh
The Semantic Web cannot work without inference The crucial missing step is the inference engine
Information sources say what they can provide Information users say what they want The 2 specifications are not obviously related!
The user must be able to Find the resources Determine their suitability (and cost) Create a request in the proper form Process the returned data
23 Oct, 2002 13
The reality of Web Services
Quoting Henry S. Thompson, U. Edinburgh Forget the headline stuff
Cars negotiating with petrol stations Agents choosing a specialist based on available
appointment slots The focus in practice is on exploiting the move to
asynchronous distributed applications Within the enterprise, not between enterprises Using pre-negotiated vocabularies, and little or no
discovery IT-intensive enterprises see Web Services primarily as a way
to reduce their EAI/middleware bills Big Science needs more than Web Services
Portal technology for presentation of information to people Open Grid Services Architecture (OGSA) to extend Web
Services with capabilities needed for scientific collaborations
23 Oct, 2002 14
Portal Technology
A portal is a Web page that collects and presents information from many sources Tailored for needs of publisher Tailorable for needs of consumer
JetSpeed Apache Portal Project FSU K12 Education portal
http://edtech.oddl.fsu.edu:8080/K12/ Indiana Community Grids portal
http://ptlportal.communitygrids.iu.edu/portal/
23 Oct, 2002 15
Big Science and Grid Technology
Big Science Massive data Massive computing Geographic distribution of people Geographic distribution of resources
Biggest emerging problems How to make people efficient
Example science fields Economics, earth sciences, astronomy,
mechanical engineering, aerospace, bioinformatics, medicine,
Grid Technology Software support for heterogeneous distributed
applications
23 Oct, 2002 16
What Does Big Science Need?
Massive amounts of computing and storage Distribution of computing and storage facilities Efficient movement of massive data sets Location-independent processing
Tools that are easy to use Management of data and computation Control over software development
Discovery of computing and information resources
Collaboration between geographically distributed people
23 Oct, 2002 17
How Big is Big?
What is happening to data sizes? 1990 at Jefferson Lab, USA: Planning for new facility
Estimated 10 megabytes/sec sustained data rate from equipment in 1998
1 terabytes per day 200 terabytes per year Data storage on £5 million tape silo Tape costs of £200,000 per year
Consumer examples Data from digital cameras
4 megapixel, 3 bytes per pixel 12 megabytes/picture
Compressed much less Data from digital video cameras
4 megabytes/sec Kazaa digital DVD video sharing
See effect on networking at Florida State University
23 Oct, 2002 18
Peer to Peer Networking (P2P)
P2P is sharing resources Directories of services
Centralized access to directories Directory search engines
Distributed interaction between Client and Server Exploit client-server locality (Sun JXTA) Mobile telephones plan to remove towers
Future DVD Distribution Scheme First purchasers buy and download from central site Subsequent purchasers download P2P Everyone must buy license before using
Be the first on your block to buy the new movie Server receives payment for every download Cable broadband has shared local bus structure
23 Oct, 2002 19
Jefferson Lab Hall D in 2007
23 Oct, 2002 20
Atlas Detector at CERN
Hall D at Jefferson Lab is not very big
23 Oct, 2002 21
Computation and Data Rates for Hall D
Expected date for full data collection 2007 Estimates based on expected CPU speeds
Raw data collection and analysis 15,000 events per second, 5 KB/event, 75 MB/sec At 1/3 duty factor, .75 PB/year To analyze each event twice, 50 CPUs Analyzed data 1.5 PB/year
Computational simulations of experiments Expect to need 5,000 events/sec Expect .1 CPU-sec/event Need 500 CPUs Simulated data .75 PB/year
Total data rate of 3 petabytes per year
23 Oct, 2002 22
Hall D Computing Tasks
First PassAnalysis
Data Mining
Physics Analysis
Partial WaveAnalysis
Physics Analysis
Acquisition
Monitoring
Slow Controls
Data Archival
Planning
Simulation
Publication
Calibrations
23 Oct, 2002 23
Hall D Collaboration Map
23 Oct, 2002 24
Meeting Computational Challenges
Moore’s law: Computer performance increases by a factor of 2 every 18 months.
Gilder’s Law: Network bandwidth triples every 12 months.
Solving the information management problems requires people working on the software and developing a workable computing environment.
Dennis’s Law: Neither Moore’s Law nor Gilder’s Law will solve our computing problems.
23 Oct, 2002 25
Database Research is not Dead
Consider relative speeds of devices Pentium 120, circa 1996
120 mhz processor 33 mhz memory bus 64 mbyte memory 10 GByte disk 10 mbit/sec ethernet
Today’s Pentium 4 2800 mhz processor (x 24) 400 mhz system bus
(x 12) 1 GByte memory (x 16) 300 GByte disk (x 30) 100 mbit/sec ethernet (x 10)
The speed and size of data storage is far outstripping the speed of processors
Hence: Data management is becoming more and more important
23 Oct, 2002 26
What Scientists Need
Self-Sustaining Infrastructure With regular well defined structure
Adequately Described Components Function, Behaviour, QoS, …
Models Supporting Analysis & Reasoning Finding appropriate components Determining how they compose
Tools for Composition, Diagnosis & Change Sustainable Economic Model Reason to Trust the System’s Dependability
23 Oct, 2002 27
Grid Technology
Virtual Organisations Sharing & Collaboration
Security Single Sign in, delegation
Distribution & fast FTP But Various Protocols
Resource Management Discovery Process Creation Scheduling Monitoring
Portability Ubiquitous APIs & Modules
Government Agency Buy in
Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual Organisations, Intl. J. Supercomputer Applications, 15(3), 2001
23 Oct, 2002 28
Provide identical access for all collaborators Utilize all intellectual resources
JLab, universities, remote sites Scientists, students
Maximize total funding resources while meeting the total computing need
Reduce systems’ complexity Partitioning of facility tasks, to manage and focus
resources Optimize computing resources to solve problems
Tier-n or “Grid” Model Reduce long-term computational management
problems
Grid Computing Advantages
23 Oct, 2002 29
Foundations for Grid Sites
GridServices
DataServices
ComputingServices
InformationServices
InteractiveServices
BatchServices
NeedsVery ReliableHardware &Software at
Remote Sites
Needs Very Reliable, Easy to Install
Software at Remote Sites
23 Oct, 2002 30
Open Grid Services Features
WSDL + WSIL Description Discovery
Tools & Platforms Apache Tomcat/Axis Globus
Invocation SOAP RPC EJB
Representations XML + Schema
Life Time Management Factories Transient & Persistent
GS GS Handles GS Records Soft State Notification
Authentication Certificates + Delegation
Change Management Platform Independence
Foster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration
23 Oct, 2002 31
Grid Data Services OGSA-DAIS
Common access for Physicists everywhere. Utilizing all intellectual resources
JLab, universities, remote sites Scientists, students
Maximize total funding resources while meeting the total computing need.
Reduce Systems’ complexity Partitioning of facility tasks, to manage and focus
resources. Optimization of computing resources to solve the
problem. Tier-n or “Grid” Model.
Reduce long-term computational management problems.
23 Oct, 2002 32
Sample Grid Data Service System
Registry Metadata about
Services and Factories Delivers GSH of factory
<Grid Data Service GSH>
<data document>
<find factory>
Client
<factory GSH>
<create Grid Data Service>
<data request>
Grid Data Service
Grid ServiceRegistry
Grid Data ServiceFactory
Database
Cre
ate
Grid
Dat
aS
ervi
ce
Factory Metadata about Services Creates GDS and returns
its GSH
Grid Data Service Provides access to
database
23 Oct, 2002 33
Requesting Data from a GDS
<GDSS><Body>
<Statement name =”xyz1”>SELECT m1, m2, m3 FROM T1 WHERE …
</Statement><Delivery>
<From> xyz1 </From><To> GSH of C </to>
</Delivery><Execute>xyz1</Execute>
</Body></GDSS>
<query specification>Requester A
Grid Data Service
GridDataServicePort
<data document>
23 Oct, 2002 34
Request with Separate Delivery
A makes request B and C receive resulting data
<query response>
<query specification>Requester A
Grid Data Service
GridDataServicePort
Requester C
Requester B<data document>
<transport specification>
<data document>
<transport specification>
queryidentifier
23 Oct, 2002 35
CS Challenges and Research
Semantic Web Ontologies Inference engines XML databases Efficient access to XML resources
Grid Architecture: OGSI Transport Repositories and replication Security and authorization
OGSA-DAI Registries and discovery Integration of DB and scripting languages XML database update and query Distributed query processing
23 Oct, 2002 36
Web References
Web search http://www.pricewatch.com http://www.google.com http://www.mysimon.com
Web Services http://www.w3.org/2001/sw/ and RDF/ ebXML: http://www.ebxml.org/ Microsoft UDDI repository: http://uddi.microsoft.com IBM UDDI:
http://www7b.boulder.ibm.com/wsdd/downloads/UDDIregistry.html Peer to Peer
Sun JXTA: http://www.jxta.org/ Kazaa: http://www.kazaa.com
National e-Science Centre Home: http://www.nesc.ed.uk/ Talks: http://umbriel.dcs.gla.ac.uk/NeSC/general/presentations/
Grid and Grid Computing Grid Forum: http://www.gridforum.org OGSA: http://www.gridforum.org/ogsi-wg/ OGSA-DAIS: http://www.gridforum.org/ogsi-wg/