1 Managing Scientific Information Making the Internet Work for Big Science Professor Greg Riccardi Florida State University Department of Computer Science,

1

Managing Scientific InformationMaking the Internet Work for Big Science

Professor Greg RiccardiFlorida State University

Department of Computer Science, UK National e-Science Centre

23 Oct, 2002 2

Overview

Information on the Web—Current Status Providing Information on the Internet General Conditions of the Web and

Internet Resources Needed for Big Science What is The Grid? How the Grid Might Support Databases Computer Science Challenges and

Research Opportunities

23 Oct, 2002 3

Is this the Internet?

23 Oct, 2002 4

Edinburgh on a Normal Day

23 Oct, 2002 5

23 Oct, 2002 6

Can We Find Web Information?

Use Google to find travel times from Edinburgh to Aberdeen Search for “railroad times Aberdeen Edinburgh

” Why was no usable information returned? Vocabulary problem

“railroad” is not a service (in UK) Search for “train times Aberdeen Edinburgh”

Even better, use a ticket service GNER.co.uk Thetrainline.com

23 Oct, 2002 7

Finding Information on the Web

Consider comparative shopping Provide a capability to compare prices Allow people to see pages of prices

Example of pricewatch.com Prices of memory Where do prices come from?

Can you extract the information content from the Web pages?

How can we make the Web provide information? Can we establish a way to share price info? Will vendors participate?

23 Oct, 2002 8

XML Creates Opportunity

Possibility: Use XML to represent information <item type=“PC3500 DDR”

<vendor name=“memorylabs.com”/><manuf name=“Samsung”/><size>512</size><price>169.00</price>

</item> Strategy for sharing

Industry creates standard XML schema Vendors create files of prices Comparison shopping sites grab files and create

presentations of information Purchasing agents

How would shopping sites and purchasing agents find the sources of information?

Would vendors agree to publish?

23 Oct, 2002 9

Semantic Web: Information on Web

The Semantic Web Tim Berners-Lee’s idea: definition from

http://www.w3.org/2001/sw/ The Semantic Web is the abstract representation of

data on the World Wide Web, based on the RDF standards and other standards to be defined.

Resource Description Framework (RDF) is an emerging standard for representing Web resources

http://www.w3.org/RDF/ The semantic Web requires sites to

provide documents marked up to define information content I.e. XML documents With an agreed ontology

23 Oct, 2002 10

Can the Semantic Web Work?

According to Henry S. Thompson, U. Edinburgh Talk given at Global Grid Forum July 2002

The Semantic Web is based on metadata Metadata describes resources systematically

Suppliers can record what a document or resource is for or about

Search engines can work with meaningful information What would we need to make Semantic Web work?

A standard syntax for metadata One or more standard vocabularies,

Allow search engines, producers, and consumers to speak the same language

Lots of documents and resources with metadata attached Attribution and trust Access and security

23 Oct, 2002 11

Web Services

Machine-to-machine exchange of information Web servers deliver XML in response to HTTP

requests Metadata defined with Web Services

Description Language (WSDL) Gives structure of information content of a service

UDDI commercial registry for services Possible use: try to create a contract to process

1000 credit card transactions Look for services, ask for prices, negotiate, etc. Microsoft, IBM, ebXML

23 Oct, 2002 12

Discovering Web Services

Again according to Henry S. Thompson, U. Edinburgh

The Semantic Web cannot work without inference The crucial missing step is the inference engine

Information sources say what they can provide Information users say what they want The 2 specifications are not obviously related!

The user must be able to Find the resources Determine their suitability (and cost) Create a request in the proper form Process the returned data

23 Oct, 2002 13

The reality of Web Services

Quoting Henry S. Thompson, U. Edinburgh Forget the headline stuff

Cars negotiating with petrol stations Agents choosing a specialist based on available

appointment slots The focus in practice is on exploiting the move to

asynchronous distributed applications Within the enterprise, not between enterprises Using pre-negotiated vocabularies, and little or no

discovery IT-intensive enterprises see Web Services primarily as a way

to reduce their EAI/middleware bills Big Science needs more than Web Services

Portal technology for presentation of information to people Open Grid Services Architecture (OGSA) to extend Web

Services with capabilities needed for scientific collaborations

23 Oct, 2002 14

Portal Technology

A portal is a Web page that collects and presents information from many sources Tailored for needs of publisher Tailorable for needs of consumer

JetSpeed Apache Portal Project FSU K12 Education portal

http://edtech.oddl.fsu.edu:8080/K12/ Indiana Community Grids portal

http://ptlportal.communitygrids.iu.edu/portal/

23 Oct, 2002 15

Big Science and Grid Technology

Big Science Massive data Massive computing Geographic distribution of people Geographic distribution of resources

Biggest emerging problems How to make people efficient

Example science fields Economics, earth sciences, astronomy,

mechanical engineering, aerospace, bioinformatics, medicine,

Grid Technology Software support for heterogeneous distributed

applications

23 Oct, 2002 16

What Does Big Science Need?

Massive amounts of computing and storage Distribution of computing and storage facilities Efficient movement of massive data sets Location-independent processing

Tools that are easy to use Management of data and computation Control over software development

Discovery of computing and information resources

Collaboration between geographically distributed people

23 Oct, 2002 17

How Big is Big?

What is happening to data sizes? 1990 at Jefferson Lab, USA: Planning for new facility

Estimated 10 megabytes/sec sustained data rate from equipment in 1998

1 terabytes per day 200 terabytes per year Data storage on £5 million tape silo Tape costs of £200,000 per year

Consumer examples Data from digital cameras

4 megapixel, 3 bytes per pixel 12 megabytes/picture

Compressed much less Data from digital video cameras

4 megabytes/sec Kazaa digital DVD video sharing

See effect on networking at Florida State University

23 Oct, 2002 18

Peer to Peer Networking (P2P)

P2P is sharing resources Directories of services

Centralized access to directories Directory search engines

Distributed interaction between Client and Server Exploit client-server locality (Sun JXTA) Mobile telephones plan to remove towers

Future DVD Distribution Scheme First purchasers buy and download from central site Subsequent purchasers download P2P Everyone must buy license before using

Be the first on your block to buy the new movie Server receives payment for every download Cable broadband has shared local bus structure

23 Oct, 2002 19

Jefferson Lab Hall D in 2007

23 Oct, 2002 20

Atlas Detector at CERN

Hall D at Jefferson Lab is not very big

23 Oct, 2002 21

Computation and Data Rates for Hall D

Expected date for full data collection 2007 Estimates based on expected CPU speeds

Raw data collection and analysis 15,000 events per second, 5 KB/event, 75 MB/sec At 1/3 duty factor, .75 PB/year To analyze each event twice, 50 CPUs Analyzed data 1.5 PB/year

Computational simulations of experiments Expect to need 5,000 events/sec Expect .1 CPU-sec/event Need 500 CPUs Simulated data .75 PB/year

Total data rate of 3 petabytes per year

23 Oct, 2002 22

Hall D Computing Tasks

First PassAnalysis

Data Mining

Physics Analysis

Partial WaveAnalysis

Physics Analysis

Acquisition

Monitoring

Slow Controls

Data Archival

Planning

Simulation

Publication

Calibrations

23 Oct, 2002 23

Hall D Collaboration Map

23 Oct, 2002 24

Meeting Computational Challenges

Moore’s law: Computer performance increases by a factor of 2 every 18 months.

Gilder’s Law: Network bandwidth triples every 12 months.

Solving the information management problems requires people working on the software and developing a workable computing environment.

Dennis’s Law: Neither Moore’s Law nor Gilder’s Law will solve our computing problems.

23 Oct, 2002 25

Database Research is not Dead

Consider relative speeds of devices Pentium 120, circa 1996

120 mhz processor 33 mhz memory bus 64 mbyte memory 10 GByte disk 10 mbit/sec ethernet

Today’s Pentium 4 2800 mhz processor (x 24) 400 mhz system bus

(x 12) 1 GByte memory (x 16) 300 GByte disk (x 30) 100 mbit/sec ethernet (x 10)

The speed and size of data storage is far outstripping the speed of processors

Hence: Data management is becoming more and more important

23 Oct, 2002 26

What Scientists Need

Self-Sustaining Infrastructure With regular well defined structure

Adequately Described Components Function, Behaviour, QoS, …

Models Supporting Analysis & Reasoning Finding appropriate components Determining how they compose

Tools for Composition, Diagnosis & Change Sustainable Economic Model Reason to Trust the System’s Dependability

23 Oct, 2002 27

Grid Technology

Virtual Organisations Sharing & Collaboration

Security Single Sign in, delegation

Distribution & fast FTP But Various Protocols

Resource Management Discovery Process Creation Scheduling Monitoring

Portability Ubiquitous APIs & Modules

Government Agency Buy in

Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual Organisations, Intl. J. Supercomputer Applications, 15(3), 2001

23 Oct, 2002 28

Provide identical access for all collaborators Utilize all intellectual resources

JLab, universities, remote sites Scientists, students

Maximize total funding resources while meeting the total computing need

Reduce systems’ complexity Partitioning of facility tasks, to manage and focus

resources Optimize computing resources to solve problems

Tier-n or “Grid” Model Reduce long-term computational management

problems

Grid Computing Advantages

23 Oct, 2002 29

Foundations for Grid Sites

GridServices

DataServices

ComputingServices

InformationServices

InteractiveServices

BatchServices

NeedsVery ReliableHardware &Software at

Remote Sites

Needs Very Reliable, Easy to Install

Software at Remote Sites

23 Oct, 2002 30

Open Grid Services Features

WSDL + WSIL Description Discovery

Tools & Platforms Apache Tomcat/Axis Globus

Invocation SOAP RPC EJB

Representations XML + Schema

Life Time Management Factories Transient & Persistent

GS GS Handles GS Records Soft State Notification

Authentication Certificates + Delegation

Change Management Platform Independence

Foster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration

23 Oct, 2002 31

Grid Data Services OGSA-DAIS

Common access for Physicists everywhere. Utilizing all intellectual resources

JLab, universities, remote sites Scientists, students

Maximize total funding resources while meeting the total computing need.

Reduce Systems’ complexity Partitioning of facility tasks, to manage and focus

resources. Optimization of computing resources to solve the

problem. Tier-n or “Grid” Model.

Reduce long-term computational management problems.

23 Oct, 2002 32

Sample Grid Data Service System

Registry Metadata about

Services and Factories Delivers GSH of factory

<Grid Data Service GSH>

<data document>

<find factory>

Client

<factory GSH>

<create Grid Data Service>

<data request>

Grid Data Service

Grid ServiceRegistry

Grid Data ServiceFactory

Database

Cre

ate

Grid

Dat

aS

ervi

ce

Factory Metadata about Services Creates GDS and returns

its GSH

Grid Data Service Provides access to

database

23 Oct, 2002 33

Requesting Data from a GDS

<GDSS><Body>

<Statement name =”xyz1”>SELECT m1, m2, m3 FROM T1 WHERE …

</Statement><Delivery>

<From> xyz1 </From><To> GSH of C </to>

</Delivery><Execute>xyz1</Execute>

</Body></GDSS>

<query specification>Requester A

Grid Data Service

GridDataServicePort

<data document>

23 Oct, 2002 34

Request with Separate Delivery

A makes request B and C receive resulting data

<query response>

<query specification>Requester A

Grid Data Service

GridDataServicePort

Requester C

Requester B<data document>

<transport specification>

<data document>

<transport specification>

queryidentifier

23 Oct, 2002 35

CS Challenges and Research

Semantic Web Ontologies Inference engines XML databases Efficient access to XML resources

Grid Architecture: OGSI Transport Repositories and replication Security and authorization

OGSA-DAI Registries and discovery Integration of DB and scripting languages XML database update and query Distributed query processing

23 Oct, 2002 36

Web References

Web search http://www.pricewatch.com http://www.google.com http://www.mysimon.com

Web Services http://www.w3.org/2001/sw/ and RDF/ ebXML: http://www.ebxml.org/ Microsoft UDDI repository: http://uddi.microsoft.com IBM UDDI:

http://www7b.boulder.ibm.com/wsdd/downloads/UDDIregistry.html Peer to Peer

Sun JXTA: http://www.jxta.org/ Kazaa: http://www.kazaa.com

National e-Science Centre Home: http://www.nesc.ed.uk/ Talks: http://umbriel.dcs.gla.ac.uk/NeSC/general/presentations/

Grid and Grid Computing Grid Forum: http://www.gridforum.org OGSA: http://www.gridforum.org/ogsi-wg/ OGSA-DAIS: http://www.gridforum.org/ogsi-wg/

Documents

1 Managing Scientific Information Making the Internet Work for Big Science Professor Greg Riccardi Florida State University Department of Computer Science,