10/28/2005Distributed Databases in HEP @ HENPC Group Meeting (LBNL)1 Distributed Databases in HEP Igor A. Gaponenko (LBNL/NERSC) [email protected]

10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 1

Distributed Databases in HEP

Igor A. Gaponenko (LBNL/NERSC)[email protected]


Contents

• Why is that a problem?– Mainly to get a smooth start of the talk

• LCG 3D Project

– Goals, clients, approaches, technologies, status…

• RAL

• ORACLE Streams

• FroNTier

• News from the ROOT databases front!

• The distributed CDB of BaBar (whiteboard drawings?) Conclusions


Why is that a problem at all?

• Shortly, because…– Objective reasons: HEP experiments has overgrown limits of a single computer center, where all data

used to be stored and most of the data processing/analysis used to be conducted– Subjective reasons: We’ve been “spoiled” with tremendous advances in programming and database

technologies (RDBMS, OODBMS)

• Contemporary experiments have distributed:– Data processing (reconstruction)– Events simulation (production)– Physics analysis

• As a result, the data:– Get created in many locations simultaneously– Get consumed in many more locations (if count analysis)

• Not only need we to distributed the event data but an environment to interpret the data has to be passed around. The typical “environment” includes:

– Detector geometry– Detector alignments– Conditions– Calibrations– Run parameters, configurations, etc…


Why Databases?

• One may ask a question: Why not to ship _all_ the environment along with events?– …As it used to be like this in the old “good” data..

• An answer is found in a storage and usage model for the environment data:– Significantly more complex and diverse from a structural point of view– Produced separately from events– The same event (collection) can be produced and/or interpreted in different environments

(calibrations is an example) quite often resulting in one-to-many relationship between an events and environments

• Sometimes a choice should be done dynamically

– Besides, storing the environment for each event (and even for each collection) can be too expensive

• A bottom line:– Even though event data and the environment are related to each other – objectively they produced,

distributed in essentially different ways!

• Contemporary databases (as a technology in a broad sense) provide a good mechanism to store the environment. And that mechanism is:

– Flexible– Extendable– Tunable


LCG 3D

(some slides borrowed from others)


LCG 3D• 3D stands for “Distributed Databases Deployment”• Established: Fall 2004• Project Leader: Dirk Duellmann (CERN/IT)• Web site:

• https://uimon.cern.ch/twiki/bin/view/ADCgroup/LCG3DWiki• 3 workshops so far:

– Oct 2004, Jan 2005: technology oriented, evaluations– Oct 2005 : preparation for large scale deployments

• Join project between:– “Service users” (experiments, s/w projects)– “Service providers” (LCG tiers)

• Major clients/participants/contributors:– (ATLAS, ALICE, LHCb)– CMS goes its own way (FroNTier)!!!

3D3 Workshops3 Clients

https://uimon.cern.ch/twiki/bin/view/ADCgroup/LCG3DWiki


LCG 3D: Goals

• Given:• Experiments are using (or planning to use) RDBMS-s

• 3D is an attempt to introduce common standards, services as a part LCG to help experiments with a problem of distributing non-event data

• Declared goals:• Define distributed database services and application access allowing

LCG applications and services to find relevant database back-ends, authenticate and use the provided data in a location independent way.

• Help to avoid the costly parallel development of data distribution, backup and high availability mechanisms in each experiment or grid site in order to limit the support costs.

• Enable a distributed deployment of an LCG database infrastructure with a minimal number of LCG database administration personnel.

Quoted from LCG 3D Web Site


LCG 3D: Non-Goals

• Store all database data• Experiments are free to deploy databases and replicate data under their

responsibility

• Setup a single monolithic distributed database system• Given constraints like WAN connections one can not assume that a single

synchronously updated database would work or give sufficient availability.

• Setup a single vendor system• Technology independence and multi-vendor implementation will be required to

minimize the long term risks and to adapt to the different requirements/constraints on different tiers.

• Impose a CERN centric infrastructure to participating sites• CERN is one equal partner of other LCG sites on each tierDecide on an

architecture, implementation, new services, policies

• Produce a technical proposal for all of those to LCG PEB/GDB

Dirk Duellmann’s slide


Supported database technologies(“database services” in 3D)

• ORACLE

– Tier0 (and perhaps Tier1) sites

• MySQL

– Tier0 and Tier1 sites

– Engines: InnoDB (fully ACID compliant) or MyISAM engins

– Also available in a server-less mode

• SQLite

– Tier1+ sites

– Server-less technology, all in one “database” file

• ROOT I/O

– Tier1+ sites

– Not quite a database technology

Wasn’t originally in a scope of 3DBut users badly want it!!!


Targeted database applications

• Those used in events reconstruction and/or analysis:– Run configurations/parameters– Detector description/geometry– Detector alignment– Conditions (calibrations, constants)

• General kinds– Detector construction– Monitoring– Bookkeeping– LCG LFC catalogs– Etc.

Quite oftenthese threeare combinedinto the Conditions/DB


General approach to the distribution

• Generally follow BaBar’s CDB approach (deployed 3 years ago):– Writable Master(-s) -> read-only Replicas

• Simple to manage/synchronize• Unlimited scalability (for readers)

• Have writable database(-s) at central location(-s) (Tier 0)– Use a reliable technology (Oracle, MySQL/InnoDB)

• Produce read-only copies to be used read-only elsewhere (Tier 1, 2, …)– Use “free” database technologies MySQL, SQLite– Translate into non-database ROOT based format (ALICE)

• Synchronize database installations using LCG 3D services or by other (experiment specific) methods (see subsequent slides for more info)

• Also an alternative (to local database replicas) option of using automatic caches is under investigation (by 3D):– FroNTier (FNAL)– Not much progress so far


M

Starting Point for a Service Architecture?

O

O

O

M

T1- db back bone- all data replicated- reliable service

T2 - local db cache-subset data-only local service

T3/4

M

M

T0- autonomous

Oracle StreamsCross vendor extractMySQL FilesProxy Cache



Main issues with the distribution

• A variety of existing database applications

• Databases services (ORACLE, MySQL, etc.) aren’t compatible:– At a level of implementing SQL standards, database schemas– At a level of a common programmatic API– Lack of a “out-of-box-across-the-borders” replication tools

• One of the options suggested by LCG 3D:– Introduce RAL – Relation Abstraction Layer (sort of ODBC, JDBC)– RAL is (almost) SQL-free C++ “true OO” API– Rewrite applications in terms of RAL– Makes it easy to implement the data distribution based on RAL (on of

the methods)

See a separate slideshow


Oracle MySQL

RAL

APP

network

db file storage

db & cacheservers

client s/w

webcache

webcache

SQLite file

Application s/w and Distribution Options

RAL = relational abstraction layer



(All) Distribution Options -and Impact on Deployment and Apps

• DB Vendor native replication• Requires same (or at least similar) schema for all applications running

against replicas of the database• Commercial heterogeneous database replication solutions• Relational Abstraction based replication

• Requires that applications are based on an agreed mapping between different back-ends

• Possibly enforced by the abstraction layer• Otherwise by the application programmer

• Application level replication• Requires common API (or data exchange format) for different

implementations of one application• Eg POOL File catalogs, ConditionsDB (MySQL/Oracle)

• Free to choose backend database schema to exploit specific capabilities of a database vendor

• Eg large table partitioning in the case of the Conditions Database



DB Vendor Native Distribution

• ORACLE

– Table-to-table via asynchronous “streams” (see next slides)

– Potentially extensible to other database vendors through API• There seem to be troubles with this (there is a talk on the Oct 2005 LCG 3D

Workshop)

– Has been successfully evaluated by CERN/IT

• MySQL

– Native replication mechanism exists

– ATLAS has some progress in testing this in a cooperation with 3D

– BaBar is considering this for the migrated CDB and Configuration databases


(ORACLE) STREAMS Overview

• Flexible feature for information sharing

• Basic elements:

– Capture

– Staging

– Consumption

• Replicate data from one database to one or more databases

• Databases can be non identical copies

Eva Dafonte Perez’s slide


STREAMS Architecture

CAPTURE PROCESS

APPLY PROCESS

user changes

REDO LOG

log changes

capture changes

LCRs

SOURCE QUEUE

DESTINATION QUEUE

propagate events

LCRs apply changes

SOURCE DATABASE

TARGET DATABASE

(replica)

capture staging consumption



TESTBED Configuration

CERN

CNAF

RAL

Sinica

FNAL

GridKA

BNL

CERN

SOURCE DATABASE

FroNtier & OEM

create table emp (id number, name varchar2,….)

EMP

EMP

EMP

EMP

EMP

EMP

insert into emp values ( 03, “Manuel”,….)

EMP

03 Manuel …

EMP

03 Manuel …EMP

03 Manuel …

EMP

03 Manuel …

EMP

03 Manuel …

EMP

03 Manuel …



MySQL replication & clusters

• “Replication”– One-way asynchronous replication, similar to what’s found in

ORACLE– Designed for performance of read-only operations

• SELECT and alike queries– Based on capturing changes stored in a “binary log” file– Full and incremental replications supported– A chain (tree) of replications is also possible– Provides a foundation for non-intrusive (to a master database) backups

• Backups require to make a shutdown of a server, replications – don’t. Therefore backups can be made on a slave rather than directly on a master.

• “Cluster”– Synchronous replication– Designed for performance of both update and read-only operations

• CREATE, INSERT, UPDATE queries



FroNTier(FNAL)

(slides borrowed)


The FroNtier Project

• Goal: Assemble a toolkit, using standard web technologies, to provide high performance, scalable, database access through a stateless, multi-tier architecture.

• Pilot project Ntier tested the technology:

– Tomcat, HTTP, Squid

– Client monitoring w/ existing CDF tools (udp messages) • FroNtier project was established to provide a production system for

CDF and other interested users

• http://whcdf03.fnal.gov/ntier-wiki/FrontPage


FroNtier Overview

CDF Persistent Object Templates(Java)

FroNtier components in yellow

Client

Caching

FroNtierServer

Database

FroNtier Client API Library

Squid Proxy/Caching Server

FroNtier Servlet running under Tomcat

Database (or other

persistency service)

XML Server Descriptors

DDL for Table Descriptions

C++ Headers and Stubs

JDBC

HTTP

HTTP

http://jakarta.apache.org/tomcat/index.html


CalibrationCalibrationDatabaseDatabase

The FroNtier Servlet

1. Client sends request (URI)2. Command Parser translates

URI into commands + values

3. Servicer Factory gets XSD (XML Server Descriptor) from database and

4. Instantiates a Servicer 5. Servicer queries database

and6. Results sent for encoding7. Encoder marshals

(serializes) the data to requesting client XSDXSD

DatabaseDatabase

CommandCommandParserParser

ServicerFactory

Servicer

Encoder

ClientClient

1

2

3

4

5

6

7


FroNtier XML Server Descriptor (XSD)

• Object name and version information

• Response description

• The SQL mapping to the database

– Select statement

– From statement

– Where clause

– Special modifiers (order by, etc)

<descriptor type="CalibRunLists“ version="1" xsdversion="1"><attribute position="1" type="int" field="calib_run" /> <attribute position="2" type="int" field="calib_version" /> <attribute position="3" type="string" field="data_status" /> <select> calib_run, calib_version, data_status </select><from> CalibRunLists </from> <where> <clause> cid = @param </clause> <param position="1" type="int" key="cid"/> </where><final> </final>

</descriptor>


FroNtier client API features

• Compatible with C and C++

• Portable

– 32 and 64 bit systems tested

• Transparent object access

– Type conversion detection

– Preserves data integrity

• Multi-object requests

• Easy runtime configuration

• Extensive error reporting

– Adjustable log levels

FroNtier Service

User application

FroNtier API


CDF FroNtier Testing at FNAL/SDSC

(San Diego Super Computing Center)

FNAL Launchpad

SDSCSquid

SDCS CAF

CDF Oracle@FNAL

• SiChipPed objects are usually about 0.5 MB, up to 1.7 MB in size. (Silicon Chip Pedestals)

• SvxBeamPosition objects are 502 Bytes (Silicon tracker beam position)

• The real savings are also in the reduced DB access.

Access times for direct Oracle and Frontier

Oracle

Frontier

Oracle

Frontier

SiChipPed SvxBeamPosition

1e-03 1e+01 1e-03 1.0Access time (s) Access time (s)


News from the ROOT v5 front(ROOT team has made a bid in the distributed DB

business )

(slides borrowed from Rene’s talk presented at October 2005 LCG 3D Workshop)


ROOT File types & Access (SQL implemented in 1999)

LocalFile

X.xml

RFIO Chirp

CastorDcacheLocalFile

X.root

http rootd/xrootd

Oracle

SapDb

PgSQL

MySQL

TFileTKey/TTree

TStreamerInfo

user

TSQLServerTSQLRow

TSQLResult


RDBC (from V.Onuchin)(implemented in 2000)

• The RDBC aims for JDBC 2.0 compliance.– It contains the set of classes corresponding to JDBC 2.0 one – TSQLDriverManager, TSQLConnection, TSQLStatement,

TSQLPreparedStatement,– TSQLCallableStatement, TSQLResultSet, TSQLResultSetMetadata,

TSQLDatabaseMetadata• The RDBC aims for ROOT SQL compliance, e.g. TSQLResult is subclass of TSQLResultSet• RDBC implementation is based on libodbc++ library (http://orcane.net/freeodbc++)

developed• by Manush Dodunekov [email protected]• Connection string can by either JDBC style i.e. <dbms>://<host>[:<port>][/<database>], or

– ODBC style (as DSN) e.g. "dsn=minos;uid=scott;pwd=tiger" • Exceptions handling is implemented via ROOT signal-slot communication mechanism.• RDBC has an interface which allows to store ROOT objects in relational database as BLOBs.

– For example, it is possible to store ROOT histograms, trees as a cells of SQL table. • RDBC provides connection pooling, i.e. reusing opened connections during ROOT session.• RDBC has an interface which allows to convert TSQLResultSets to ROOT TTrees • RDBC with Carrot (ROOT Apache Module) allows to create three-tier architecture.

used by Phenix a

nd Minos

http://orcane.net/freeodbc

http://orcane.net/freeodbc

mailto:[email protected]


File types & Access in 5.04

LocalFile

X.xml

RFIO Chirp

CastorDcacheLocalFile

X.root

http rootd/xrootd

Oracle

SapDb

PgSQL

MySQL

TFileTKey/TTree

TStreamerInfo

user

TSQLServerTSQLRow

TSQLResult

TTreeSQL


New RDBMS interface in v5

• New class TTreeSQL– support the TTree containing branches created using a leaf list (eg.

hsimple.C).• Access any RDBMS tables from TTree::Draw

• Create a TTree in split mode creating a RDBMS table and filling it.

• The table can be processed by SQL directly.• The interface uses the normal I/O engine

– including support for Automatic Schema Evolution.


TTreeSQL Syntax

• Currently:– ROOT:

– MySQL:

• Coming:

–

TFile *file = new TFile("simple.root","RECREATE");TTree *tree; file->GetObject(“ntuple”,tree);

TSQLServer*dbserver = TSQLServer::Connect("mysql://…”,db,user,passwd);TTree *tree = new TTreeSQL(dbserver,"rootDev","ntuple");

TTree *tree = TTree::Open(“root:/simple.root/ntuple”);

TTree *tree = TTree::Open(“mysql://host../rootDev/ntuple”);


ROOT & RDBMSgo & nogo

• ROOT interface with RDBMS is minimal

• Because there are many different use cases, we see many users with their own interface that seems appropriate in most cases.

• Because of scalability issues, the move to read-only files in a distributed environment is becoming obvious.

• We prefer to invest in a direction that we believe is very important for data analysis:

– Optimize the use of read-only files in a distributed environment: size, read speed, read ahead & cache, selective reads (rows &columns) with Trees.

– Optimize the performance: xrootd, load balancing, authentication with caching for interaction, robustness.


TArchiveFile and TZIPFile

• TArchiveFile is an abstract class that describes an archive file containing multiple sub-files, like a ZIP or TAR archive.

• The TZIPFile class describes a ZIP archive file containing multiple ROOT sub-files. Notice that the ROOT files should not be compressed when being added to the ZIP file, since ROOT files are normally already compressed. To create the file multi.zip do:

• The ROOT files in an archive can be simply accessed like this:

• A TBrowser and TChain interface will follow shortly.

zip –n root multi file1.root file2.root

TFile *f = TFile::Open("multi.zip#file2.root") orTFile *f = TFile::Open("root://mymachine/multi.zip#2")


Class TGrid (abstract interface) //--- General GRID const char *GridUrl() const const char *GetGrid() const const char *GetHost() const const char *GetUser() const const char *GetPw() const const char *GetOptions() const Int_t GetPort() const

//--- Catalogue Interface virtual TGridResult *Command(const char *command, Bool_t interactive = kFALSE, UInt_t stream = kFALSE)

virtual TGridResult *Query(const char *path, const char *pattern, const char *conditions, const char *options)

virtual TGridResult *LocateSites()

virtual TGridResult *ls(const char*ldn ="", Option_t*options ="") virtual Bool_t cd(const char*ldn ="",Bool_t verbose =kFALSE) virtual Bool_t mkdir(const char*ldn ="", Option_t*options ="") virtual Bool_t rmdir(const char*ldn ="", Option_t*options ="") virtual Bool_t register(const char *lfn , const char *turl , Long_t size, const char *se, const char *guid) virtual Bool_t rm(const char*lfn , Option_t*option ="")

//--- Job Submission Interface virtual TGridJob *Submit(const char *jdl) virtual TGridJDL *GetJDLGenerator()

//--- Load desired plugin and setup conection to GRID static TGrid *Connect(const char *grid, const char *uid, const char *pw, const char *options)


Access to File Catalogues

eg Alien FC

Same style int

erface could b

e implemented

for

Other GRID Fil

e Catalogues


// ConnectTGrid alien = TGrid::Connect(“alien://”);

// QueryTGridResult *res =alien.Query(“/alice/cern.ch/user/p/peters/analysis/miniesd/”, ”*.root“);// List of filesTList *listf = res->GetFileInfoList();

// Create chainTChain chain(“Events", “session");Chain.AddFileInfoList(listf);

// Start PROOFTProof proof(“remote”);

// Process your queryChain.Process(“selector.C”);

TGrid example with Alien


Replica of a DB subset

local TZipFile

remote TZipFile

T0

T1

http, xrootd

, castor, dc

ache..


Current status: Who uses What• ALICE

– PostgreSQL:• Detector Construction DB

– ORACLE:• Detector Construction (read-only copy at CERN)

– MySQL:• DAQ/ONLINE

– ROOT files• Condition/DB:

– Basically the only database required to be distributed– Using GRID distributed catalogs service

– Very little (if any?) use of 3D• ATLAS

– Most advanced use of databases compared to others– ORACLE, MySQL via RAL for Conditions (COOL), Geometry, DD, POOL catalogs, some POOL collections (via

Object-to-Relational mapping)– SQLite for distributed Geometry

• LHCb– ORACLE:

• Everything in Tier0 & Tier1, no databases in Tier2+

• CMS (also CDF)– Much less developed compared to others– FroNtier:

• For everything?• Actual databases are hidden behind the scene

• BaBar– Objectivity/DB (in a process of phasing out)– ROOT I/O for all database applications– MySQL + ROOT (as BLOB-s) for CDB only, MySQL only for Config/DB


Conclusions

• Extensive use of database technology in (LHC) HEP experiments and keep growing

• Various RDBMS and non (ROOT) are in use• Very little progress in establishing common database distribution

services– LCG 3D doesn’t seem to play the role it may potentially do, perhaps

it’s just a matter of time(?)• Very little progress in establishing common standards for database

applications and their implementations– COOL is the only noticeable exception. Though it has its own

problems in a conceptual model (not technology neutral, no room for ROOT)

• A bottom line:– Still a “zoo”, even within experiments – Though, a significant progress has been made in understanding on how

things “should look like”

Documents

10/28/2005Distributed Databases in HEP @ HENPC Group Meeting (LBNL)1 Distributed Databases in HEP Igor A. Gaponenko (LBNL/NERSC) [email protected]