Upload
derex
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CHEP04 Interlaken, 27 September 2004. File-Metadata Management System For The LHCb Experiment. Carmine Cioffi Department of Physics, University of Oxford. Outline. What are Metadata and why we need them in the LHCb experiment. The File-Metadata Management System - PowerPoint PPT Presentation
Citation preview
File-Metadata Management System
For The LHCb Experiment Carmine
CioffiDepartment of
Physics, University of Oxford
CHEP04 Interlaken,
27 September 2004
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 2
Outline
• What are Metadata and why we need them in the LHCb experiment.
• The File-Metadata Management System– The two schema strategy– XML and the warehousing database – Services and specialised views– Relationship between the warehousing database
and views.– Web Services
• ARDA and future planning
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 3
Metadata
• Generally speaking, metadata are data which characterise data-files
• The two facets of metadata– Job provenance: Everything you ever
wanted to know about how a data-file was created
– Bookkeeping: How do I identify the datasets I am interested in for my analysis ?
• Metadata are needed to get straight to the files of interest, avoiding unnecessary access to the data storage.
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 4
The two schema strategy
• The two schema strategy consists of having a Database (Warehousing DB) and a View of it, both with their own schema. – The Warehousing DataBase (WDB) is
meant to store data in a simple way but be flexible enough to accept new data.
– The View is designed to be efficient for the service it is made for.
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 5
Entity-Relationship
model for WDB
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 6
XML and the insertion of data
• Due to the key-value strategy the WDB is liable to be corrupted:– Any data with any semantic can be
inserted.– Partial information can be inserted.
• To prevent this the data must be presented in XML format. In this way, using a predefined DTD/XML-SCHEMA it is possible to verify the correctness of the data.
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 7
The DTD for the insertion of a job related metadata– <!ELEMENT Job ( (JobOption|TypedParameter|InputFile|OutputFile)*)>
– <!ELEMENT JobOption EMPTY>– <!ELEMENT TypedParameter EMPTY>– <!ELEMENT InputFile EMPTY>– <!ELEMENT OutputFile ((Parameter|Quality)*)>– <!ELEMENT Parameter EMPTY>– <!ELEMENT Quality (Parameter*)>
– <!ATTLIST Job ConfigName CDATA #REQUIRED– ConfigVersion CDATA #REQUIRED– Date CDATA #REQUIRED>– <!ATTLIST JobOption Recipient CDATA #REQUIRED– Name CDATA #REQUIRED– Value CDATA #REQUIRED>– <!ATTLIST TypedParameter Name CDATA #REQUIRED– Value CDATA #REQUIRED– Type (Info|Environment_Variable) #REQUIRED>– <!ATTLIST InputFile Name CDATA #REQUIRED>– <!ATTLIST OutputFile Name CDATA #REQUIRED– TypeName CDATA #REQUIRED– TypeVersion CDATA #REQUIRED>– <!ATTLIST Parameter Name CDATA #REQUIRED– Value CDATA #REQUIRED>– <!ATTLIST Quality Group CDATA #REQUIRED– Flag CDATA #REQUIRED>
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 8
Services and the specialised views
• Sometimes complex SQL queries do not work well for bulk lookups. – But the WDB contains all the information
about the file that can be used to generate specialised views for specific service.
• Knowing the service, the views can be optimised to give the best performance.
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 9
Replica FILE_ID REPLICA LOCATION
DT_JobSummaryJOB_ID CONFIG DBVERSION EVENTTYPE JOBDATE LABORATORY PROGRAM0 INPUTFILE0 PROGRAM1 INPUTFILE1 PROGRAM2 INPUTFILE2
DT_FileSummaryFILE_ID JOB_ID EVENTTYPE EVENTDESCRIPTIONNBEVENTS FILETYPE FILENAME
FILESIZE
Jyth
on
Web
S
erv
er
SER
VLE
TS
XM
LRPC
SPECIALISED VIEW SCHEMA Web
Browser
Example of view with service and
applications
•This example shows the specialised view that sits on back of the XMLRPC and SERVLETS Services.
•These services are used by GANGA and the Web Browser.
GANGA
application
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 10
Jobs
JobParams
FileParams
Files
TypeParams
ConfigNameConfigVersion
Date
ValueName
Type
LogName
ValueNameValueName
QualityParams
ValueName Replica
FILE_ID REPLICA LOCATION
DT_JobSummary
JOB_ID CONFIG DBVERSION EVENTTYPE JOBDATE LABORATORY PROGRAM0 INPUTFILE0 PROGRAM1 INPUTFILE1 PROGRAM2
INPUTFILE2
DT_FileSummary
FILE_ID JOB_ID EVENTTYPE EVENTDESCRIPTIONNBEVENTS FILETYPE FILENAME FILESIZE
Generation of the specialised View
Warehouse DB
Specialised View
Done periodically or on demand based on the needs of the experiment (every night for LHCb). This is fast despite the fact that WDB contains many GB.
SQL script
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 11
Some Numbers
• LHCb is using ORACLE 9i technology for its DB– It is hosted on a cluster of two ‘Sun Fire 280R’
machine– Each with two processors of 750MHz– 2 GB RAM– 600 GB HD
• The DB contains ~20GB of data– Shared between real data and indexing tables– ~2M jobs rows– ~5.5M files rows– ~57M rows in parameters.
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 12
LHCb services
• Actually LHCb is using two services to access the information from the databases:– Servlet service :
•the service allows the selection of datasets based on their history (job provenance) by the web browser.
– XML-RPC service:•access to and modification of the WDB data•allow GANGA to access Bookkeeping data.
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 13
Collaboration with ARDA
• LHCb has engaged a collaboration with ARDA:– Definition of metadata and understanding of LHCb
requirements– Elaboration of a new interface for the manipulation of file-
metadata.– Possible technology (WSDL).– See how this will fit with the already existing LHCb system.
• Stress-test the Bookkeeping services, analysing various behaviours: – Different number of clients– Different queries– Comparison with direct RPC calls
• Implement the new defined interface– Using the actual LHCb File-Metadata DB as back-end– Using the technology developed with ARDA
CHEP04 Interlaken 27 September 2004
File-Metadata Management system 14
CONCLUSIONS
• The two schema strategy works well for LHCb, and with the DC04 its flexibility was well proven, indeed no changes were required to the WDB although new data have been stored.
• Because of key-value nature of the WDB it can be easily adapted for warehousing of any data, including that of other experiments.