Upload
victoria-todd
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 1
Distributed Databases in HEP
Igor A. Gaponenko (LBNL/NERSC)[email protected]
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 2
Contents
• Why is that a problem?– Mainly to get a smooth start of the talk
• LCG 3D Project
– Goals, clients, approaches, technologies, status…
• RAL
• ORACLE Streams
• FroNTier
• News from the ROOT databases front!
• The distributed CDB of BaBar (whiteboard drawings?) Conclusions
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 3
Why is that a problem at all?
• Shortly, because…– Objective reasons: HEP experiments has overgrown limits of a single computer center, where all data
used to be stored and most of the data processing/analysis used to be conducted– Subjective reasons: We’ve been “spoiled” with tremendous advances in programming and database
technologies (RDBMS, OODBMS)
• Contemporary experiments have distributed:– Data processing (reconstruction)– Events simulation (production)– Physics analysis
• As a result, the data:– Get created in many locations simultaneously– Get consumed in many more locations (if count analysis)
• Not only need we to distributed the event data but an environment to interpret the data has to be passed around. The typical “environment” includes:
– Detector geometry– Detector alignments– Conditions– Calibrations– Run parameters, configurations, etc…
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 4
Why Databases?
• One may ask a question: Why not to ship _all_ the environment along with events?– …As it used to be like this in the old “good” data..
• An answer is found in a storage and usage model for the environment data:– Significantly more complex and diverse from a structural point of view– Produced separately from events– The same event (collection) can be produced and/or interpreted in different environments
(calibrations is an example) quite often resulting in one-to-many relationship between an events and environments
• Sometimes a choice should be done dynamically
– Besides, storing the environment for each event (and even for each collection) can be too expensive
• A bottom line:– Even though event data and the environment are related to each other – objectively they produced,
distributed in essentially different ways!
• Contemporary databases (as a technology in a broad sense) provide a good mechanism to store the environment. And that mechanism is:
– Flexible– Extendable– Tunable
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 5
LCG 3D
(some slides borrowed from others)
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 6
LCG 3D• 3D stands for “Distributed Databases Deployment”• Established: Fall 2004• Project Leader: Dirk Duellmann (CERN/IT)• Web site:
• https://uimon.cern.ch/twiki/bin/view/ADCgroup/LCG3DWiki• 3 workshops so far:
– Oct 2004, Jan 2005: technology oriented, evaluations– Oct 2005 : preparation for large scale deployments
• Join project between:– “Service users” (experiments, s/w projects)– “Service providers” (LCG tiers)
• Major clients/participants/contributors:– (ATLAS, ALICE, LHCb)– CMS goes its own way (FroNTier)!!!
3D3 Workshops3 Clients
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 7
LCG 3D: Goals
• Given:• Experiments are using (or planning to use) RDBMS-s
• 3D is an attempt to introduce common standards, services as a part LCG to help experiments with a problem of distributing non-event data
• Declared goals:• Define distributed database services and application access allowing
LCG applications and services to find relevant database back-ends, authenticate and use the provided data in a location independent way.
• Help to avoid the costly parallel development of data distribution, backup and high availability mechanisms in each experiment or grid site in order to limit the support costs.
• Enable a distributed deployment of an LCG database infrastructure with a minimal number of LCG database administration personnel.
Quoted from LCG 3D Web Site
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 8
LCG 3D: Non-Goals
• Store all database data• Experiments are free to deploy databases and replicate data under their
responsibility
• Setup a single monolithic distributed database system• Given constraints like WAN connections one can not assume that a single
synchronously updated database would work or give sufficient availability.
• Setup a single vendor system• Technology independence and multi-vendor implementation will be required to
minimize the long term risks and to adapt to the different requirements/constraints on different tiers.
• Impose a CERN centric infrastructure to participating sites• CERN is one equal partner of other LCG sites on each tierDecide on an
architecture, implementation, new services, policies
• Produce a technical proposal for all of those to LCG PEB/GDB
Dirk Duellmann’s slide
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 9
Supported database technologies(“database services” in 3D)
• ORACLE
– Tier0 (and perhaps Tier1) sites
• MySQL
– Tier0 and Tier1 sites
– Engines: InnoDB (fully ACID compliant) or MyISAM engins
– Also available in a server-less mode
• SQLite
– Tier1+ sites
– Server-less technology, all in one “database” file
• ROOT I/O
– Tier1+ sites
– Not quite a database technology
Wasn’t originally in a scope of 3DBut users badly want it!!!
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 10
Targeted database applications
• Those used in events reconstruction and/or analysis:– Run configurations/parameters– Detector description/geometry– Detector alignment– Conditions (calibrations, constants)
• General kinds– Detector construction– Monitoring– Bookkeeping– LCG LFC catalogs– Etc.
Quite oftenthese threeare combinedinto the Conditions/DB
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 11
General approach to the distribution
• Generally follow BaBar’s CDB approach (deployed 3 years ago):– Writable Master(-s) -> read-only Replicas
• Simple to manage/synchronize• Unlimited scalability (for readers)
• Have writable database(-s) at central location(-s) (Tier 0)– Use a reliable technology (Oracle, MySQL/InnoDB)
• Produce read-only copies to be used read-only elsewhere (Tier 1, 2, …)– Use “free” database technologies MySQL, SQLite– Translate into non-database ROOT based format (ALICE)
• Synchronize database installations using LCG 3D services or by other (experiment specific) methods (see subsequent slides for more info)
• Also an alternative (to local database replicas) option of using automatic caches is under investigation (by 3D):– FroNTier (FNAL)– Not much progress so far
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 12
M
Starting Point for a Service Architecture?
O
O
O
M
T1- db back bone- all data replicated- reliable service
T2 - local db cache-subset data-only local service
T3/4
M
M
T0- autonomous
Oracle StreamsCross vendor extractMySQL FilesProxy Cache
Dirk Duellmann’s slide
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 13
Main issues with the distribution
• A variety of existing database applications
• Databases services (ORACLE, MySQL, etc.) aren’t compatible:– At a level of implementing SQL standards, database schemas– At a level of a common programmatic API– Lack of a “out-of-box-across-the-borders” replication tools
• One of the options suggested by LCG 3D:– Introduce RAL – Relation Abstraction Layer (sort of ODBC, JDBC)– RAL is (almost) SQL-free C++ “true OO” API– Rewrite applications in terms of RAL– Makes it easy to implement the data distribution based on RAL (on of
the methods)
See a separate slideshow
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 14
Oracle MySQL
RAL
APP
network
db file storage
db & cacheservers
client s/w
webcache
webcache
SQLite file
Application s/w and Distribution Options
RAL = relational abstraction layer
Dirk Duellmann’s slide
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 15
(All) Distribution Options -and Impact on Deployment and Apps
• DB Vendor native replication• Requires same (or at least similar) schema for all applications running
against replicas of the database• Commercial heterogeneous database replication solutions• Relational Abstraction based replication
• Requires that applications are based on an agreed mapping between different back-ends
• Possibly enforced by the abstraction layer• Otherwise by the application programmer
• Application level replication• Requires common API (or data exchange format) for different
implementations of one application• Eg POOL File catalogs, ConditionsDB (MySQL/Oracle)
• Free to choose backend database schema to exploit specific capabilities of a database vendor
• Eg large table partitioning in the case of the Conditions Database
Dirk Duellmann’s slide
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 16
DB Vendor Native Distribution
• ORACLE
– Table-to-table via asynchronous “streams” (see next slides)
– Potentially extensible to other database vendors through API• There seem to be troubles with this (there is a talk on the Oct 2005 LCG 3D
Workshop)
– Has been successfully evaluated by CERN/IT
• MySQL
– Native replication mechanism exists
– ATLAS has some progress in testing this in a cooperation with 3D
– BaBar is considering this for the migrated CDB and Configuration databases
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 17
(ORACLE) STREAMS Overview
• Flexible feature for information sharing
• Basic elements:
– Capture
– Staging
– Consumption
• Replicate data from one database to one or more databases
• Databases can be non identical copies
Eva Dafonte Perez’s slide
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 18
STREAMS Architecture
CAPTURE PROCESS
APPLY PROCESS
user changes
REDO LOG
log changes
capture changes
LCRs
SOURCE QUEUE
DESTINATION QUEUE
propagate events
LCRs apply changes
SOURCE DATABASE
TARGET DATABASE
(replica)
capture staging consumption
Eva Dafonte Perez’s slide
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 19
TESTBED Configuration
CERN
CNAF
RAL
Sinica
FNAL
GridKA
BNL
CERN
SOURCE DATABASE
FroNtier & OEM
create table emp (id number, name varchar2,….)
EMP
EMP
EMP
EMP
EMP
EMP
insert into emp values ( 03, “Manuel”,….)
EMP
03 Manuel …
EMP
03 Manuel …EMP
03 Manuel …
EMP
03 Manuel …
EMP
03 Manuel …
EMP
03 Manuel …
Eva Dafonte Perez’s slide
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 20
MySQL replication & clusters
• “Replication”– One-way asynchronous replication, similar to what’s found in
ORACLE– Designed for performance of read-only operations
• SELECT and alike queries– Based on capturing changes stored in a “binary log” file– Full and incremental replications supported– A chain (tree) of replications is also possible– Provides a foundation for non-intrusive (to a master database) backups
• Backups require to make a shutdown of a server, replications – don’t. Therefore backups can be made on a slave rather than directly on a master.
• “Cluster”– Synchronous replication– Designed for performance of both update and read-only operations
• CREATE, INSERT, UPDATE queries
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 21
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 22
FroNTier(FNAL)
(slides borrowed)
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 23
The FroNtier Project
• Goal: Assemble a toolkit, using standard web technologies, to provide high performance, scalable, database access through a stateless, multi-tier architecture.
• Pilot project Ntier tested the technology:
– Tomcat, HTTP, Squid
– Client monitoring w/ existing CDF tools (udp messages) • FroNtier project was established to provide a production system for
CDF and other interested users
• http://whcdf03.fnal.gov/ntier-wiki/FrontPage
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 24
FroNtier Overview
CDF Persistent Object Templates(Java)
FroNtier components in yellow
Client
Caching
FroNtierServer
Database
FroNtier Client API Library
Squid Proxy/Caching Server
FroNtier Servlet running under Tomcat
Database (or other
persistency service)
XML Server Descriptors
DDL for Table Descriptions
C++ Headers and Stubs
JDBC
HTTP
HTTP
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 25
CalibrationCalibrationDatabaseDatabase
The FroNtier Servlet
1. Client sends request (URI)2. Command Parser translates
URI into commands + values
3. Servicer Factory gets XSD (XML Server Descriptor) from database and
4. Instantiates a Servicer 5. Servicer queries database
and6. Results sent for encoding7. Encoder marshals
(serializes) the data to requesting client XSDXSD
DatabaseDatabase
CommandCommandParserParser
ServicerFactory
Servicer
Encoder
ClientClient
1
2
3
4
5
6
7
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 26
FroNtier XML Server Descriptor (XSD)
• Object name and version information
• Response description
• The SQL mapping to the database
– Select statement
– From statement
– Where clause
– Special modifiers (order by, etc)
<descriptor type="CalibRunLists“ version="1" xsdversion="1"><attribute position="1" type="int" field="calib_run" /> <attribute position="2" type="int" field="calib_version" /> <attribute position="3" type="string" field="data_status" /> <select> calib_run, calib_version, data_status </select><from> CalibRunLists </from> <where> <clause> cid = @param </clause> <param position="1" type="int" key="cid"/> </where><final> </final>
</descriptor>
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 28
FroNtier client API features
• Compatible with C and C++
• Portable
– 32 and 64 bit systems tested
• Transparent object access
– Type conversion detection
– Preserves data integrity
• Multi-object requests
• Easy runtime configuration
• Extensive error reporting
– Adjustable log levels
FroNtier Service
User application
FroNtier API
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 29
CDF FroNtier Testing at FNAL/SDSC
(San Diego Super Computing Center)
FNAL Launchpad
SDSCSquid
SDCS CAF
CDF Oracle@FNAL
• SiChipPed objects are usually about 0.5 MB, up to 1.7 MB in size. (Silicon Chip Pedestals)
• SvxBeamPosition objects are 502 Bytes (Silicon tracker beam position)
• The real savings are also in the reduced DB access.
Access times for direct Oracle and Frontier
Oracle
Frontier
Oracle
Frontier
SiChipPed SvxBeamPosition
1e-03 1e+01 1e-03 1.0Access time (s) Access time (s)
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 30
News from the ROOT v5 front(ROOT team has made a bid in the distributed DB
business )
(slides borrowed from Rene’s talk presented at October 2005 LCG 3D Workshop)
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 31
ROOT File types & Access (SQL implemented in 1999)
LocalFile
X.xml
RFIO Chirp
CastorDcacheLocalFile
X.root
http rootd/xrootd
Oracle
SapDb
PgSQL
MySQL
TFileTKey/TTree
TStreamerInfo
user
TSQLServerTSQLRow
TSQLResult
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 32
RDBC (from V.Onuchin)(implemented in 2000)
• The RDBC aims for JDBC 2.0 compliance.– It contains the set of classes corresponding to JDBC 2.0 one – TSQLDriverManager, TSQLConnection, TSQLStatement,
TSQLPreparedStatement,– TSQLCallableStatement, TSQLResultSet, TSQLResultSetMetadata,
TSQLDatabaseMetadata• The RDBC aims for ROOT SQL compliance, e.g. TSQLResult is subclass of TSQLResultSet• RDBC implementation is based on libodbc++ library (http://orcane.net/freeodbc++)
developed• by Manush Dodunekov [email protected]• Connection string can by either JDBC style i.e. <dbms>://<host>[:<port>][/<database>], or
– ODBC style (as DSN) e.g. "dsn=minos;uid=scott;pwd=tiger" • Exceptions handling is implemented via ROOT signal-slot communication mechanism.• RDBC has an interface which allows to store ROOT objects in relational database as BLOBs.
– For example, it is possible to store ROOT histograms, trees as a cells of SQL table. • RDBC provides connection pooling, i.e. reusing opened connections during ROOT session.• RDBC has an interface which allows to convert TSQLResultSets to ROOT TTrees • RDBC with Carrot (ROOT Apache Module) allows to create three-tier architecture.
used by Phenix a
nd Minos
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 33
File types & Access in 5.04
LocalFile
X.xml
RFIO Chirp
CastorDcacheLocalFile
X.root
http rootd/xrootd
Oracle
SapDb
PgSQL
MySQL
TFileTKey/TTree
TStreamerInfo
user
TSQLServerTSQLRow
TSQLResult
TTreeSQL
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 34
New RDBMS interface in v5
• New class TTreeSQL– support the TTree containing branches created using a leaf list (eg.
hsimple.C).• Access any RDBMS tables from TTree::Draw
• Create a TTree in split mode creating a RDBMS table and filling it.
• The table can be processed by SQL directly.• The interface uses the normal I/O engine
– including support for Automatic Schema Evolution.
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 35
TTreeSQL Syntax
• Currently:– ROOT:
– MySQL:
• Coming:
–
TFile *file = new TFile("simple.root","RECREATE");TTree *tree; file->GetObject(“ntuple”,tree);
TSQLServer*dbserver = TSQLServer::Connect("mysql://…”,db,user,passwd);TTree *tree = new TTreeSQL(dbserver,"rootDev","ntuple");
TTree *tree = TTree::Open(“root:/simple.root/ntuple”);
TTree *tree = TTree::Open(“mysql://host../rootDev/ntuple”);
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 36
ROOT & RDBMSgo & nogo
• ROOT interface with RDBMS is minimal
• Because there are many different use cases, we see many users with their own interface that seems appropriate in most cases.
• Because of scalability issues, the move to read-only files in a distributed environment is becoming obvious.
• We prefer to invest in a direction that we believe is very important for data analysis:
– Optimize the use of read-only files in a distributed environment: size, read speed, read ahead & cache, selective reads (rows &columns) with Trees.
– Optimize the performance: xrootd, load balancing, authentication with caching for interaction, robustness.
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 37
TArchiveFile and TZIPFile
• TArchiveFile is an abstract class that describes an archive file containing multiple sub-files, like a ZIP or TAR archive.
• The TZIPFile class describes a ZIP archive file containing multiple ROOT sub-files. Notice that the ROOT files should not be compressed when being added to the ZIP file, since ROOT files are normally already compressed. To create the file multi.zip do:
• The ROOT files in an archive can be simply accessed like this:
• A TBrowser and TChain interface will follow shortly.
zip –n root multi file1.root file2.root
TFile *f = TFile::Open("multi.zip#file2.root") orTFile *f = TFile::Open("root://mymachine/multi.zip#2")
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 38
Class TGrid (abstract interface) //--- General GRID const char *GridUrl() const const char *GetGrid() const const char *GetHost() const const char *GetUser() const const char *GetPw() const const char *GetOptions() const Int_t GetPort() const
//--- Catalogue Interface virtual TGridResult *Command(const char *command, Bool_t interactive = kFALSE, UInt_t stream = kFALSE)
virtual TGridResult *Query(const char *path, const char *pattern, const char *conditions, const char *options)
virtual TGridResult *LocateSites()
virtual TGridResult *ls(const char*ldn ="", Option_t*options ="") virtual Bool_t cd(const char*ldn ="",Bool_t verbose =kFALSE) virtual Bool_t mkdir(const char*ldn ="", Option_t*options ="") virtual Bool_t rmdir(const char*ldn ="", Option_t*options ="") virtual Bool_t register(const char *lfn , const char *turl , Long_t size, const char *se, const char *guid) virtual Bool_t rm(const char*lfn , Option_t*option ="")
//--- Job Submission Interface virtual TGridJob *Submit(const char *jdl) virtual TGridJDL *GetJDLGenerator()
//--- Load desired plugin and setup conection to GRID static TGrid *Connect(const char *grid, const char *uid, const char *pw, const char *options)
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 39
Access to File Catalogues
eg Alien FC
Same style int
erface could b
e implemented
for
Other GRID Fil
e Catalogues
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 40
// ConnectTGrid alien = TGrid::Connect(“alien://”);
// QueryTGridResult *res =alien.Query(“/alice/cern.ch/user/p/peters/analysis/miniesd/”, ”*.root“);// List of filesTList *listf = res->GetFileInfoList();
// Create chainTChain chain(“Events", “session");Chain.AddFileInfoList(listf);
// Start PROOFTProof proof(“remote”);
// Process your queryChain.Process(“selector.C”);
TGrid example with Alien
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 41
Replica of a DB subset
local TZipFile
remote TZipFile
T0
T1
http, xrootd
, castor, dc
ache..
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 42
Current status: Who uses What• ALICE
– PostgreSQL:• Detector Construction DB
– ORACLE:• Detector Construction (read-only copy at CERN)
– MySQL:• DAQ/ONLINE
– ROOT files• Condition/DB:
– Basically the only database required to be distributed– Using GRID distributed catalogs service
– Very little (if any?) use of 3D• ATLAS
– Most advanced use of databases compared to others– ORACLE, MySQL via RAL for Conditions (COOL), Geometry, DD, POOL catalogs, some POOL collections (via
Object-to-Relational mapping)– SQLite for distributed Geometry
• LHCb– ORACLE:
• Everything in Tier0 & Tier1, no databases in Tier2+
• CMS (also CDF)– Much less developed compared to others– FroNtier:
• For everything?• Actual databases are hidden behind the scene
• BaBar– Objectivity/DB (in a process of phasing out)– ROOT I/O for all database applications– MySQL + ROOT (as BLOB-s) for CDB only, MySQL only for Config/DB
10/28/2005 Distributed Databases in HEP @ HENPC Group Meeting (LBNL) 43
Conclusions
• Extensive use of database technology in (LHC) HEP experiments and keep growing
• Various RDBMS and non (ROOT) are in use• Very little progress in establishing common database distribution
services– LCG 3D doesn’t seem to play the role it may potentially do, perhaps
it’s just a matter of time(?)• Very little progress in establishing common standards for database
applications and their implementations– COOL is the only noticeable exception. Though it has its own
problems in a conceptual model (not technology neutral, no room for ROOT)
• A bottom line:– Still a “zoo”, even within experiments – Though, a significant progress has been made in understanding on how
things “should look like”