21
UPPSALA DATABASE LABORATORY Managing Scientific Queries over Distributed Data in a Grid Environment Ruslan Fomkin

UPPSALA DATABASE LABORATORY Managing Scientific Queries over Distributed Data in a Grid Environment Ruslan Fomkin

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

UPPSALA DATABASE LABORATORY

Managing Scientific Queries over Distributed Data in a Grid

Environment

Ruslan Fomkin

January 20, 2006 NGN workshop Uppsala

2

UU- IT - UDBL Ruslan Fomkin

Uppsala DataBase Laboratory (UDBL)

Supervisor• prof. T. Risch

Database research• How to make extensible middleware query

processing allowing scalable and application oriented search to different kinds of wrapped information sources

http://www.it.uu.se/research/group/udbl/

January 20, 2006 NGN workshop Uppsala

3

UU- IT - UDBL Ruslan Fomkin

AMOS II

Virtual Mediator Database

Simulation Visualization Analysis

PatientMonitoring

GRID hist.Measurments

RelationalDatabases

Plug-ins

Wrappers

Queries and views

Queries

Data sources

Applications

Continuous

Queries

January 20, 2006 NGN workshop Uppsala

4

UU- IT - UDBL Ruslan Fomkin

Ongoing Research at UDBL

Stream Queries on BlueGeneErik Zeitler, MSc

FEM DatabasesKjell Orsborn, PhD

Mediating Web ServicesManivasakan Sabesan, BSc

Semantic Web Queriesto Hidden WebJohan Petrini, MSc

Stream Data ManagerMilena Ivanova, PhD

UDBL

Expensive GRID Queries Ruslan Fomkin, MSc

January 20, 2006 NGN workshop Uppsala

5

UU- IT - UDBL Ruslan Fomkin

Outline

Introduction The project Test application Developed framework Conclusion Future work

January 20, 2006 NGN workshop Uppsala

6

UU- IT - UDBL Ruslan Fomkin

Scientific Applications, Grid and Databases

A lot of scientific data• Complex structure• Stored in files distributed in Grid

Scientific analyses can be represented as declarative queries• Complex queries with numerical computations• Long running or batch queries

Utilization of computational resources of Grid

January 20, 2006 NGN workshop Uppsala

7

UU- IT - UDBL Ruslan Fomkin

Parallel Object Query System for Expensive Computations (POQSEC)

Query processor for scientific applications• high-level interface to specify the analyses• automatically generates execution plans and

evaluates them Requirements

• Scalable, efficient, flexible, transparent Properties

• Distributed and parallel

January 20, 2006 NGN workshop Uppsala

8

UU- IT - UDBL Ruslan Fomkin

Layered Architecture of the System

POQSEC provides• scientific query management

Grid provides• computation management• file management

NorduGrid Middleware Application area provides

• computational libraries• data management libraries

ROOT library

POQSEC

Applicationlibraries

Grid

Data Clusters

User

ROOT NorduGrid

January 20, 2006 NGN workshop Uppsala

9

UU- IT - UDBL Ruslan Fomkin

Our Test Application

From Particle Physics Analysis of collision events for presence of

Higgs particles Data produced by ATLAS simulation software

• stored in files • distributed in the Grid (e.g. NorduGrid)• managed by ROOT library

January 20, 2006 NGN workshop Uppsala

10

UU- IT - UDBL Ruslan Fomkin

Object-Relational Schema of the Application Data

Event Particle

Lepton

Muon Electron Jet

particles1 n

PxMiss PyMissPx Py Pz

Kf

Ee

inheritancerelationship

January 20, 2006 NGN workshop Uppsala

11

UU- IT - UDBL Ruslan Fomkin

General Query of the Analysis

Selection of those events that satisfy predicates containing numerical operations

SELECT ev FROM Event ev WHERE jetvetocut(ev) AND zvetocut(ev) AND topcut(ev) AND misseecuts(ev) AND leptoncuts(ev)AND threeleptoncut(ev);

Each predicate called cut in application area Predicates are defined as queries

January 20, 2006 NGN workshop Uppsala

12

UU- IT - UDBL Ruslan Fomkin

Example of a predicate: Z-veto cut

Either event does not have a pair of opposite charged leptons

or invariant mass of the pair is not close to the mass of a Z particle

CREATE FUNCTION zvetocut(Event ev)-> Event AS SELECT evWHERE NOTANY(oppositeLeptons(ev)) OR abs(invMass(oppositeLeptons(ev)) - zMass)

>= minZMass;

CREATE FUNCTION oppositeLeptons (Event ev) -> bag of <Lepton, Lepton> AS

SELECT l1, l2 FROM Lepton l1, Lepton l2WHERE l1 = particles(ev) AND l2 = particles(ev) AND Kf(l1) = -Kf(l2);

January 20, 2006 NGN workshop Uppsala

13

UU- IT - UDBL Ruslan Fomkin

Current Framework

Basic tool for utilizing NorduGrid through Advanced Resource Connector (ARC)

Submission mechanism• submit query • parallelize query to several subqueries• generate job scripts (one per subquery)

Babysitter functionality Data exchange mechanism through files

January 20, 2006 NGN workshop Uppsala

14

UU- IT - UDBL Ruslan Fomkin

Client and Coordinator PartPOQSEC client personal

database with application schema

ROOT wrapper

Coordinator server receives queries creates jobs

Grid Meta-Database computational

resources data files

Babysitter

Coordinatorserver

Grid Meta-Database

SubmissionDatabase

Job queue

QueryCoordinator

Local Storage

ARCClient

Grid ClientNode

POQSECClient

Submission Database received

submissions created jobs

Babysitter interactions with

ARC

January 20, 2006 NGN workshop Uppsala

15

UU- IT - UDBL Ruslan Fomkin

Query SubmissionQuery submission query file name

selection degree of

parallelism CPU time for

each job

Submission and its jobs saved in Submission Database

Created jobs added to Job queue Script files saved to Local Storage

Babysitter

Coordinatorserver

Grid Meta-Database

SubmissionDatabase

Job queue

QueryCoordinator

Local Storage

ARCClient

Grid ClientNode

POQSECClient

Coordinator server creates jobs same query partitions of data with equal size same CPU time provided by user corresponding job script files

January 20, 2006 NGN workshop Uppsala

16

UU- IT - UDBL Ruslan Fomkin

Jobs Submission

Babysitter

Coordinatorserver

Grid Meta-Database

SubmissionDatabase

Job queue

QueryCoordinator

Local Storage

ARCClient

Grid ClientNode

POQSECClient

Babysitter Takes jobs

from Job queue Submits each

job to ARC client

Change status of submitted jobs in Submission DB

ARC GridManager

CEARC GridManager

CEARC client finds Computing Element submits job to corresponding ARC

Grid manager

January 20, 2006 NGN workshop Uppsala

17

UU- IT - UDBL Ruslan Fomkin

Job Execution

ARC Grid Manager downloads input files submits job to Local Batch System

After some delay LBS starts Executor on allocated a CE node

Executor during execution execute given subquery accesses data through

ROOT wrapper saves result to files

on CE Storage

CE

CEStorage

Executor

wrapper

CE node

ARC GridManager

SE SE

LBS Queue

January 20, 2006 NGN workshop Uppsala

18

UU- IT - UDBL Ruslan Fomkin

Downloading Result

Babysitter

Coordinatorserver

Grid Meta-Database

SubmissionDatabase

Job queue

QueryCoordinator

Local Storage

ARCClient

Grid ClientNode

POQSECClient

ARC GridManager

CE

CEStorage

ARC GridManager

CE

CEStorage

Babysitter polls ARC

client for jobs statuses

requests to download results for finished jobs

Results downloaded to Local StorageUser can retrieve result when all jobs are ready

January 20, 2006 NGN workshop Uppsala

19

UU- IT - UDBL Ruslan Fomkin

Conclusion

We provide• declarative query interface for representation scientific

queries• parallel query execution in Grid

(generating scripts)• babysitter to keep track of job execution

Query parallelization is importantStandalone desktop Grid, one job Grid, four jobs

Response time 190 min 225 min 24 min

Requested CPU time - 200 min 20 min

January 20, 2006 NGN workshop Uppsala

20

UU- IT - UDBL Ruslan Fomkin

Future work

Estimation time of executing query Dealing with underestimation of execution time Automatic making decision on degree of

parallelism and resource brokering• adaptive• based on current load and job statistics

Dealing with failures in Grid POOL wrapper

January 20, 2006 NGN workshop Uppsala

21

UU- IT - UDBL Ruslan Fomkin

Thank you for attention! Your questions?