Upload
happy
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Artemis: Integrating Scientific Data on the Grid. Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman. Outline. Motivation Data integration needs in scientific applications Distributed computing in grids Problem statement Artemis architecture Evaluation Related Work - PowerPoint PPT Presentation
Citation preview
11
Artemis: Integrating Artemis: Integrating Scientific Data on the GridScientific Data on the Grid
Rattapoom TuchindaRattapoom TuchindaSnehal ThakkarSnehal Thakkar
Yolanda GilYolanda GilEwa DeelmanEwa Deelman
22
OutlineOutline MotivationMotivation
Data integration needs in scientific applicationsData integration needs in scientific applicationsDistributed computing in gridsDistributed computing in grids
Problem statementProblem statement Artemis architecture Artemis architecture EvaluationEvaluation Related WorkRelated Work Conclusions and future workConclusions and future work
33
Scientific Data Integration Scientific Data Integration Large-scale, cross-disciplinary scientific Large-scale, cross-disciplinary scientific
data collection, storage, and analysis data collection, storage, and analysis exacerbates heterogeneity and dynamicsexacerbates heterogeneity and dynamicsNational Virtual Observatory (NVO)National Virtual Observatory (NVO)Earth System Grid (ESG)Earth System Grid (ESG)
44
Grid Computing Grid Computing [Foster & Kesselman 04][Foster & Kesselman 04]
Grids provide middleware services for distributed computing:Grids provide middleware services for distributed computing: Seamless integration and management of resources – OGSASeamless integration and management of resources – OGSA Job submission and execution management – CondorJob submission and execution management – Condor Resource availability & performance – Monitoring and Directory Svc (MDS) Resource availability & performance – Monitoring and Directory Svc (MDS) Data replication for robustness and efficiency – Replica Loc Svc (RLS)Data replication for robustness and efficiency – Replica Loc Svc (RLS) Descriptions of data sources – Metadata Catalog Services (MCS)Descriptions of data sources – Metadata Catalog Services (MCS)
RDiscovery
Many sourcesof data, services,computation
R
Registries organizeservices of interestto a community
Access
Data integration activities may require access to, & exploration/analysis of, data at many locations
Exploration & analysismay involve complex,multi-step workflows
RMRM
RMRM
RM
Resource managementis needed to ensureprogress & arbitrate competing demandsSecurity
serviceSecurityservice
PolicyservicePolicyservice
Security & policymust underlie access& managementdecisions
From [Kesselman 04]:
55
Scientific Data Storage and AccessScientific Data Storage and Access Data sources are Data sources are very heterogeneousvery heterogeneous
Data that results from various instruments, disciplines, and types of analysesData that results from various instruments, disciplines, and types of analyses Wide variety of data storage systems (files, DBs, servers, etc)Wide variety of data storage systems (files, DBs, servers, etc)
Data sources are Data sources are highly distributed highly distributed Data stored in different locations on the gridData stored in different locations on the grid Data is replicated in multiple locationsData is replicated in multiple locations
Data sources are Data sources are highly dynamichighly dynamic Data grows continuously, new data models are routineData grows continuously, new data models are routine New data sources regularly appear New data sources regularly appear Data sources may become unavailable sporadicallyData sources may become unavailable sporadically
Data available at Data available at unprecedented scaleunprecedented scale Very soon petabytesVery soon petabytes
These challenges are in the way of scientific progress These challenges are in the way of scientific progress in many disciplinesin many disciplines
66
Data Storage and Access in GridsData Storage and Access in Grids Data described with metadata attributesData described with metadata attributes
Attribute names may not be consistent across different Attribute names may not be consistent across different sourcessources
Metadata descriptions often stored separately from the Metadata descriptions often stored separately from the data itselfdata itself
Metadata Catalog Service (MCS) Metadata Catalog Service (MCS) [Moore et al 01, Singh [Moore et al 01, Singh et al 03]et al 03] Stores descriptive metadata and allows users to query Stores descriptive metadata and allows users to query
based on desired attributesbased on desired attributes Addresses heterogeneity of data source Addresses heterogeneity of data source
implementations and accessimplementations and access
77
Sample QuerySample Query search constraints: search constraints: keywords = "atmospheric data" or "climate data“ keywords = "atmospheric data" or "climate data“ or "climate model“ or "climate model“ model type = "CCSM" or "PCM“model type = "CCSM" or "PCM“ period = 2001period = 2001
search results: search results: Files, collections, or viewsFiles, collections, or views:: /CCSM2/b20.007/atm /CCSM2/b20.007/atm /PCM/B06.62/atm /PCM/B06.62/atm /PCM/B06.20/atm /PCM/B06.20/atm /PCM/B06.21/atm /PCM/B06.21/atm
88
Problem StatementProblem Statement Users should have seamless single point access Users should have seamless single point access
Should not have to formulate a different query for each sourceShould not have to formulate a different query for each source Should not manage the unavailability of data sourcesShould not manage the unavailability of data sources
Users need assistance formulating the queriesUsers need assistance formulating the queries Data models may have different attribute names and Data models may have different attribute names and
representations (even from the same source) representations (even from the same source) New data models/metadata attributes created all the timeNew data models/metadata attributes created all the time
MCS1
MCS2
MCS3
DB1
DB2
DB3
q1q2
q3
stimeetime
starttime
endtime
descrsub
currentlyunavailable
99
ArtemisArtemis A mixed-initiative data integration system that A mixed-initiative data integration system that
aims to:aims to: Abstracts users from diversity in attribute Abstracts users from diversity in attribute
representationsrepresentations Assists users to formulate queries step-by-stepAssists users to formulate queries step-by-step Manages the access and availability of dynamic Manages the access and availability of dynamic
collections of data sourcescollections of data sources Integrates and extends various AI techniques:Integrates and extends various AI techniques:
Data IntegrationData Integration OntologyOntology Dialogue wizardsDialogue wizards
1010
ApproachApproachstime
etime…
starttime
endtime
…
description
subject
stime starttime etime endtime
Time
Start time End time
ONTOLOGY
QueryMediatorQuery
FormulationWizard
Find files with Start time > 500000 ^ End time < 600000Start time > 500000 ^ End time < 600000
Data Source
MetadataCatalog2
Data Source
Data Source
MetadataCatalog3
MetadataCatalog1
1111
Artemis ArchitectureArtemis Architecture
Entityselection
Filters
MCS WizardDynamic
ModelGenerator
PrometheusQuery
Mediator
MetadataCatalogService
MetadataCatalogService
MetadataCatalogService
Data Source
Data Source
Data Source
Ontology
ModelMappings
Models
1212
MCS WizardMCS WizardBased on the Agent Wizard [Tuchinda Based on the Agent Wizard [Tuchinda
2003]2003]Domain experts create mappings between Domain experts create mappings between
Ontologies and meta-data attributesOntologies and meta-data attributesusers can then pick the ontology and the users can then pick the ontology and the
mappings relevant to their domain.mappings relevant to their domain.Guides the user through available Guides the user through available
operations and filters consistent with the operations and filters consistent with the models of the data. models of the data.
1313
Prometheus Query MediatorPrometheus Query Mediator Data integration system from earlier research Data integration system from earlier research
[Thakkar et. al. 2004] [Knoblock et al 2003][Thakkar et. al. 2004] [Knoblock et al 2003] Provides unified query interface to a wide variety of Provides unified query interface to a wide variety of
data sourcesdata sources Relational modelRelational model Requires pre-defined domain model relating sources Requires pre-defined domain model relating sources
to domain relationsto domain relations Extended in Artemis to support:Extended in Artemis to support:
Source relations: Various MCSsSource relations: Various MCSs Domain relationsDomain relations
File, View, CollectionFile, View, Collection Dynamic domain model based on availability of data Dynamic domain model based on availability of data
sourcessources
1414
Dynamic Model GenerationDynamic Model Generation Generate mediator model dynamically by Generate mediator model dynamically by
querying MCSsquerying MCSs Convert object oriented model of MCSs to relational Convert object oriented model of MCSs to relational
model of the mediatormodel of the mediator Handles dynamic nature of data by generating new Handles dynamic nature of data by generating new
domain models at query timedomain models at query time Intuitive ideaIntuitive idea
Query MCSs one at a time for all possible attributes of Query MCSs one at a time for all possible attributes of different objectsdifferent objects
Create domain relation for each object type with all Create domain relation for each object type with all possible attributespossible attributes
Create rules defining each MCS as data sourceCreate rules defining each MCS as data source Relate various data sources to domain relationsRelate various data sources to domain relations
1515
Dynamic Model Generator (Cont’d)Dynamic Model Generator (Cont’d) ExampleExample
MCS 1:MCS 1: File1(starttime, endtime, frequency), File2(starttime, endtime, frequency, File1(starttime, endtime, frequency), File2(starttime, endtime, frequency,
amplitude)amplitude) MCS 2:MCS 2:
File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, windspeed)windspeed)
Domain relationDomain relation File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)
Source relationsSource relations MCS1File(starttime, endtime, frequency, amplitude, name)MCS1File(starttime, endtime, frequency, amplitude, name) MCS2File(starttime, endtime, lat, lon, temp, windspeed, name)MCS2File(starttime, endtime, lat, lon, temp, windspeed, name)
Domain RulesDomain Rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :-
MCS1File(starttime, endtime, frequency, amplitude, name)^MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :-
MCS2File(starttime, endtime, lat, lon, temp, windspeed)^MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’)(frequency = ‘’) ^ (amplitude = ‘’)
1616
Query ProcessingQuery Processing When Prometheus receives a query it When Prometheus receives a query it
determines which MCSs are relevantdetermines which MCSs are relevant Relevant MCSs are determined by comparing Relevant MCSs are determined by comparing
the constraints of the query with the constraints the constraints of the query with the constraints of the MCSsof the MCSs
MCSs that do not satisfy constraints of the query MCSs that do not satisfy constraints of the query are not used in the queryare not used in the query For example, if the query asked for finding files that For example, if the query asked for finding files that
contained data for some lat, lon then MCS1 would not contained data for some lat, lon then MCS1 would not be queriedbe queried
1717
Query Processing: ExampleQuery Processing: Example Let’s say, the user uses the MCSWizard to form the following query.Let’s say, the user uses the MCSWizard to form the following query.
Q(name) :- Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, File(starttime, endtime, frequency, amplitude, lat, lon, temp,
windspeed, name)^ windspeed, name)^(lat > 33)^(lat < 34)^(lat > 33)^(lat < 34)^(lon < -118)^(lon > -119)^(lon < -118)^(lon > -119)^(starttime > 50000)^(endtime < 60000)(starttime > 50000)^(endtime < 60000)
The Prometheus mediator would generate a datalog program with The Prometheus mediator would generate a datalog program with the query and domain rulesthe query and domain rulesFile(starttime, endtime, frequency, amplitude, lat, lon, temp, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, windspeed, name) :- name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^MCS1File(starttime, endtime, frequency, amplitude, name)^
(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)
File(starttime, endtime, frequency, amplitude, lat, lon, temp, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, windspeed, name) :- name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^MCS2File(starttime, endtime, lat, lon, temp, windspeed)^
(frequency = ‘’) ^ (amplitude = ‘’)(frequency = ‘’) ^ (amplitude = ‘’)
1818
Query Processing: ExampleQuery Processing: Example Let’s say, the user uses the MCSWizard to form the following query.Let’s say, the user uses the MCSWizard to form the following query.
Q(name) :- Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)^ windspeed, name)^(lat > 33)^(lat < 34)^(lat > 33)^(lat < 34)^(lon < -118)^(lon > -119)^(lon < -118)^(lon > -119)^(starttime > 50000)^(endtime < 60000)(starttime > 50000)^(endtime < 60000)
The Prometheus mediator would generate a datalog program with the query The Prometheus mediator would generate a datalog program with the query and domain rulesand domain rulesFile(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^MCS1File(starttime, endtime, frequency, amplitude, name)^
(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)(lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’)
File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^MCS2File(starttime, endtime, lat, lon, temp, windspeed)^
(frequency = ‘’) ^ (amplitude = ‘’)(frequency = ‘’) ^ (amplitude = ‘’) The mediator determines that the order constraints in the rule one on lat and The mediator determines that the order constraints in the rule one on lat and
lon attribute are not compatible with the order constraints on lat and lon in lon attribute are not compatible with the order constraints on lat and lon in the query, so only MCS2 is queriedthe query, so only MCS2 is queried
1919
Artemis: Top level SelectionArtemis: Top level Selection
2020
Artemis: FilteringArtemis: Filtering
2121
EvaluationEvaluation Enabled users to query 12 different MCSsEnabled users to query 12 different MCSs Covering information from three different Covering information from three different
applicationsapplications LIGO, ESG, and Geo-spatial data warehouseLIGO, ESG, and Geo-spatial data warehouse
Covering 17,000 different filesCovering 17,000 different files Metadata consisted of about 300 different Metadata consisted of about 300 different
attributesattributes Simulated addition of metadata to MCSs and Simulated addition of metadata to MCSs and
failure of several MCSs while system was failure of several MCSs while system was runningrunning
2222
Related WorkRelated Work MCS [Singh et al 03]MCS [Singh et al 03]
Organize metadata about objects on the data grid Organize metadata about objects on the data grid Object oriented schema to support user defined metadata Object oriented schema to support user defined metadata
attributesattributes Difficult for users to keep track of diverse attribute namesDifficult for users to keep track of diverse attribute names No semantic information is attached to the attributesNo semantic information is attached to the attributes
Agent Wizard [Tuchinda et. al. 2003]Agent Wizard [Tuchinda et. al. 2003] Interactive application that guides user by dividing complex tasks Interactive application that guides user by dividing complex tasks
as series of simpler question answering tasksas series of simpler question answering tasks Challenge is to model complex task as set of simpler subtasksChallenge is to model complex task as set of simpler subtasks
Prometheus Mediator [Thakkar et. al. 2004]Prometheus Mediator [Thakkar et. al. 2004] Data integration system that can efficiently integrate data from a Data integration system that can efficiently integrate data from a
wide variety of data sourceswide variety of data sources Key restriction is that relational schema for data sources and Key restriction is that relational schema for data sources and
domain must be known in advancedomain must be known in advance
2323
Related Work (Cont’d)Related Work (Cont’d) Mygrid [Wroe 2003]Mygrid [Wroe 2003]
Model data sources as semantic web servicesModel data sources as semantic web services Integration of data sources is represented as a Integration of data sources is represented as a
workflowworkflow Requires that data sources have fixed schema and Requires that data sources have fixed schema and
associated semanticsassociated semantics Model-based mediator system for scientific data Model-based mediator system for scientific data
management [Ludascher 2003]management [Ludascher 2003] Data sources provide semantic information regarding Data sources provide semantic information regarding
their datatheir data The provided information is used to generate domain The provided information is used to generate domain
model for a mediator systemmodel for a mediator system Assumption is that semantic information is provided Assumption is that semantic information is provided
by different data sources of interestby different data sources of interest
2424
ConclusionsConclusions Contributions: Contributions:
Mixed-initiative approach to help scientists query Mixed-initiative approach to help scientists query objects on the data gridobjects on the data grid
Isolate users from heterogeneity of data sourcesIsolate users from heterogeneity of data sources Manage distributed dynamic dataManage distributed dynamic data
Future Work:Future Work: Algorithm to determine when to dynamically generate Algorithm to determine when to dynamically generate
domain modeldomain model Better support for specifying model mappingsBetter support for specifying model mappings Artemis available as a grid serviceArtemis available as a grid service More extensive testing and usability studiesMore extensive testing and usability studies
2525
??