View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Virtual Data in CMS Analysis
A.Arbree, P.Avery, D.Bourilkov,
R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde, Y.Zhao
CMS & GriPhyN
CHEP03, La Jolla, California
March 25, 2003
D.Bourilkov Virtual Data in CMS Analysis 2
We already do this, but manually!
Virtual Data
Webster dictionary: vir·tu·al Function: adjectiveEtymology: Middle English, possessed of certain physical virtues, from Medieval Latin virtualis, from Latin virtus strength, virtue
• Most scientific data are not simple “measurements” produced from increasingly complex computations (e.g. reconstructions, calibrations, selections, simulations, fits etc.)
• HEP (and other sciences) increasingly CPU/Data intensive• Programs are significant community resources (transformations)• So are the executions of those programs (derivations)
• Management of dataset transformations important!• Derivation: Instantiation of a potential data product
• Provenance: Exact history of any existing data product
D.Bourilkov Virtual Data in CMS Analysis 3
Transformation Derivation
Data
product-of
execution-of
consumed-by/generated-by
“I’ve detected a muon calibration error and want to know which derived data products need to be recomputed.”
“I’ve found some interesting data, but I need to know exactly what corrections were applied before I can trust it.”
“I want to search a database for 3 rare electron events. If a program that does this analysis exists, I won’t have to write one from scratch.”
“I want to apply a forward jet analysis to 100M events. If the results already exist, I’ll save weeks of computation.”
Virtual Data Motivations
D.Bourilkov Virtual Data in CMS Analysis 4
Virtual Data Motivations
• Data track-ability and result audit-ability: "Virtual Logbook”• In the nature of science
• Reproducibility of results
• Tools and data sharing and collaboration (data with “recipe”)• Individuals discover other scientists’ work and build from it
• Different Teams can work in a modular, semi-autonomous fashion: reuse previous data/code/results or entire analysis chains
• Repair and correction of data – c.f. “make”
• Workflow management, Performance optimization: data staged-in from remote site OR re-created locally on demand?
• Transparency with respect to location and existence
D.Bourilkov Virtual Data in CMS Analysis 5
Introducing CHIMERA: The GriPhyN Virtual Data System
Virtual Data Language textual (concise, for human consumption) XML (uses XML schema, for component integration)
Virtual Data Interpreter implemented in Java JAVA API and command-line toolkit
Virtual Data Catalog tracks data provenance (acts like a metadata repository); different back-ends for persistency: PostGreSQL and MySQL DB file based (for easy testing)
D.Bourilkov Virtual Data in CMS Analysis 6
Virtual Data in CHIMERA
A “function call” paradigm Virtual data: data objects with a well defined method
of (re)production Transformation [namespace]::identifier:[version ]
• Abstract description of how a script/executable is invoked
• Similar to a "function declaration" in C/C++
Derivation [namespace]::identifier:[version range] • Invocation of a transformation with specific arguments
• Similar to a "function call" in C/C++
• Can be either past or future• a record of how logical files were produced
• a recipe for creating logical files at some point in the future
D.Bourilkov Virtual Data in CMS Analysis 7
Virtual Data Language
TR pythia( out a2, in a1, none param=“160.0” )
{
argument arg = ${param};
argument file = ${a1}; Build-style recipeargument file = ${a2};
}
TR cmsim( out a2, in a1[] )
{
argument files = ${a1};
argument file = ${a2};
}
DV x1->pythia( a2=@{out:file2}, a1=@{in:file1});
DV x2->cmsim( a2=@{out:file3}, a1=[@{in:file2}, @{in:cardfile}] );
Make-style recipe
file1
file2,cardfile
file3
x1
x2
D.Bourilkov Virtual Data in CMS Analysis 8
Abstract and Concrete DAGsAbstract DAXs (Virtual Data DAG)
abstract directed acyclic graph with
logical names for files/executables
(complete build-style recipe as DAX)
– Resource locations unspecified
– File names are logical
– Data destinations unspecified
Concrete DAGs (stuff for DAGMan)
CONDOR style DAG for grid execution
(check RC, skip steps, make-style)
– Resource locations determined
– Physical file names specified
– Data delivered to and returned from physical
locations
Abs. PlanVDC
RC C. Plan.
DAX
DAGMan
DAG
VDL
Log
ical
Ph
ysi
cal
XML
XML
D.Bourilkov Virtual Data in CMS Analysis 9
Nitty-Gritty
Transformation catalog (expects pre-built executables)#poolname ltransformation physical transformation environment String
local hw /bin/echo null
local pythcvs /workdir/lhc-h-6-cvs null
local pythlin /workdir/lhc-h-6-link null
local pythgen /workdir/lhc-h-6-run null
local pythtree /workdir/h2root.sh null
local pythview /workdir/root.sh null
local GriphynRC /vdshome/bin/replica-catalog JAVA_HOME=/vdt/jdk1.3;VDS_HOME=/vdshome
local globus-url-copy /vdt/bin/globus-url-copy GLOBUS_LOCATION=/vdt;LD_LIBRARY_PATH=/vdt/lib
ufl hw /bin/echo null
ufl GriphynRC /vdshome/bin/replica-catalog JAVA_HOME=/vdt/jdk1.3.1_04;VDS_HOME=/vdshome
ufl globus-url-copy /vdt/bin/globus-url-copy GLOBUS_LOCATION=/vdt;LD_LIBRARY_PATH=/vdt/lib
Pool configuration#pool universe job-manager-string url-prefix
workdir ...
ufl vanilla testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir
ufl standard testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir
ufl globus testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir
ufl transfer testulix/jobmanager gsiftp://testulix/mydir /mydir
local vanilla localhost/jm-condor gsiftp://localhost/mydir /mydir
local globus localhost/jm-condor gsiftp://localhost/mydir /mydir
local transfer localhost/jobmanager gsiftp://localhost/mydir /mydir
D.Bourilkov Virtual Data in CMS Analysis 10
Data Analysis in HEP
• Decentralized, “chaotic”
• Flexible enough system: able to accommodate large user base, use cases that we can’t foresee
• Ability to build scripts/executables “on the fly”, including user supplied code/parameters (possibly linking with preinstalled libraries on the execution sites)
D.Bourilkov Virtual Data in CMS Analysis 11
Prototypes
First for SC2002, second for CHEP03
CVS tag
FORTRANcode
datacards
librariesversion N
executable
rootwrapper
h2root
PYTHIAwrapper
compile,link
CVS
plots
ntuples
root trees
eventdisplays
C++ code
D.Bourilkov Virtual Data in CMS Analysis 12
Prototypes
CHIMERA/ROOT prototype for generating events with PYTHIA/CMKIN, histogramming and visualization
D.Bourilkov Virtual Data in CMS Analysis 13
mass = 160decay = WWWW e event = 8
mass = 160decay = WWWW e plot = 1
mass = 160decay = WWplot = 1
mass = 160decay = WWevent = 8
mass = 160decay = WWWW e
mass = 160decay = WWWW leptons
mass = 160
mass = 160decay = WW
mass = 160decay = ZZ
mass = 160decay = bb
mass = 160plot = 1
mass = 160event = 8
A virtual space of simulated data is created for futureuse by scientists...
D.Bourilkov Virtual Data in CMS Analysis 14
mass = 160decay = WWWW e event = 8
mass = 160decay = WWWW e plot = 1
mass = 160decay = WWplot = 1
mass = 160decay = WWevent = 8
mass = 160decay = WWWW e
mass = 160decay = WWWW leptons
mass = 160
mass = 160decay = WW
mass = 160decay = ZZ
mass = 160decay = bb
mass = 160plot = 1
mass = 160event = 8
Search forWW decays of the Higgs Bosonwhere the Ws decay to electron and muon: mass = 160; decay = WW; WW e
D.Bourilkov Virtual Data in CMS Analysis 15
mass = 160decay = WWWW e event = 8
mass = 160decay = WWWW e plot = 1
mass = 160decay = WWplot = 1
mass = 160decay = WWevent = 8
mass = 160decay = WWWW e
mass = 160decay = WWWW leptons
mass = 160
mass = 160decay = WW
mass = 160decay = ZZ
mass = 160decay = bb
mass = 160plot = 1
mass = 160event = 8
Scientist obtainsan interestingresult and wantsto track howit was derived.
D.Bourilkov Virtual Data in CMS Analysis 16
mass = 160decay = WWWW e event = 8
mass = 160decay = WWWW e plot = 1
mass = 160decay = WWplot = 1
mass = 160decay = WWevent = 8
mass = 160decay = WWWW e
mass = 160decay = WWWW leptons
mass = 160
mass = 160decay = WW
mass = 160decay = ZZ
mass = 160decay = bb
mass = 160plot = 1
mass = 160event = 8
Now the scientistwants to dig deeper...
D.Bourilkov Virtual Data in CMS Analysis 17
mass = 160decay = WWWW e Pt > 20
mass = 160decay = WWWW e event = 8
mass = 160decay = WWWW e plot = 1
mass = 160decay = WWplot = 1
mass = 160decay = WWevent = 8
mass = 160decay = WWWW e
mass = 160decay = WWWW leptons
mass = 160
mass = 160decay = WW
mass = 160decay = ZZ
mass = 160decay = bb
mass = 160plot = 1
mass = 160event = 8
...The scientistadds a new derived data branch...
...and continues toinvestigate !
D.Bourilkov Virtual Data in CMS Analysis 18
A Collaborative Data-flowDevelopment Environment:
Complex Data Flow and Data Provenance in HEP
Raw
ESD
AO
D
TA
G
Plo
ts,
Table
s,
Fit
s
Com
pari
sons
Plo
ts,
Table
s,Fit
s
Real Data
SimulatedData
History of a Data Analysis (like CVS)
"Check-point" a Data Analysis
Analysis Development Environment
Audit a Data Analysis
D.Bourilkov Virtual Data in CMS Analysis 19
Outlook
• Work in progress both on CHIMERA & CMS sides – a “snapshot”
• A CHIMERA/ROOT prototype for building executables “on the fly”, generating events with PYTHIA/CMKIN, plotting and visualization available (CHIMERA is a great integration tool)
• The full CMS Monte Carlo chain is working under CHIMERA (next talk)
• Possible future directions:• Workflow management; automatic generation; inheritance …• Store metadata about derivations (like annotations) in a
searchable catalog• Handle Datasets, not just Logical File Names• Integration with CLARENS (remote access), with ROOT/PROOF
(run in parallel) A picture is better than 1000 words: Prototype Demo