19
Virtual Data in CMS Analysis A.Arbree, P.Avery, D.Bourilkov , R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde, Y.Zhao CMS & GriPhyN CHEP03, La Jolla, California March 25, 2003

Virtual Data in CMS Analysis A.Arbree, P.Avery, D.Bourilkov, R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde, Y.Zhao CMS & GriPhyN CHEP03, La Jolla, California

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Virtual Data in CMS Analysis

A.Arbree, P.Avery, D.Bourilkov,

R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde, Y.Zhao

CMS & GriPhyN

CHEP03, La Jolla, California

March 25, 2003

D.Bourilkov Virtual Data in CMS Analysis 2

We already do this, but manually!

Virtual Data

Webster dictionary: vir·tu·al Function: adjectiveEtymology: Middle English, possessed of certain physical virtues, from Medieval Latin virtualis, from Latin virtus strength, virtue

• Most scientific data are not simple “measurements” produced from increasingly complex computations (e.g. reconstructions, calibrations, selections, simulations, fits etc.)

• HEP (and other sciences) increasingly CPU/Data intensive• Programs are significant community resources (transformations)• So are the executions of those programs (derivations)

• Management of dataset transformations important!• Derivation: Instantiation of a potential data product

• Provenance: Exact history of any existing data product

D.Bourilkov Virtual Data in CMS Analysis 3

Transformation Derivation

Data

product-of

execution-of

consumed-by/generated-by

“I’ve detected a muon calibration error and want to know which derived data products need to be recomputed.”

“I’ve found some interesting data, but I need to know exactly what corrections were applied before I can trust it.”

“I want to search a database for 3 rare electron events. If a program that does this analysis exists, I won’t have to write one from scratch.”

“I want to apply a forward jet analysis to 100M events. If the results already exist, I’ll save weeks of computation.”

Virtual Data Motivations

D.Bourilkov Virtual Data in CMS Analysis 4

Virtual Data Motivations

• Data track-ability and result audit-ability: "Virtual Logbook”• In the nature of science

• Reproducibility of results

• Tools and data sharing and collaboration (data with “recipe”)• Individuals discover other scientists’ work and build from it

• Different Teams can work in a modular, semi-autonomous fashion: reuse previous data/code/results or entire analysis chains

• Repair and correction of data – c.f. “make”

• Workflow management, Performance optimization: data staged-in from remote site OR re-created locally on demand?

• Transparency with respect to location and existence

D.Bourilkov Virtual Data in CMS Analysis 5

Introducing CHIMERA: The GriPhyN Virtual Data System

Virtual Data Language textual (concise, for human consumption) XML (uses XML schema, for component integration)

Virtual Data Interpreter implemented in Java JAVA API and command-line toolkit

Virtual Data Catalog tracks data provenance (acts like a metadata repository); different back-ends for persistency: PostGreSQL and MySQL DB file based (for easy testing)

D.Bourilkov Virtual Data in CMS Analysis 6

Virtual Data in CHIMERA

A “function call” paradigm Virtual data: data objects with a well defined method

of (re)production Transformation [namespace]::identifier:[version ]

• Abstract description of how a script/executable is invoked

• Similar to a "function declaration" in C/C++

Derivation [namespace]::identifier:[version range] • Invocation of a transformation with specific arguments

• Similar to a "function call" in C/C++

• Can be either past or future• a record of how logical files were produced

• a recipe for creating logical files at some point in the future

D.Bourilkov Virtual Data in CMS Analysis 7

Virtual Data Language

TR pythia( out a2, in a1, none param=“160.0” )

{

argument arg = ${param};

argument file = ${a1}; Build-style recipeargument file = ${a2};

}

TR cmsim( out a2, in a1[] )

{

argument files = ${a1};

argument file = ${a2};

}

DV x1->pythia( a2=@{out:file2}, a1=@{in:file1});

DV x2->cmsim( a2=@{out:file3}, a1=[@{in:file2}, @{in:cardfile}] );

Make-style recipe

file1

file2,cardfile

file3

x1

x2

D.Bourilkov Virtual Data in CMS Analysis 8

Abstract and Concrete DAGsAbstract DAXs (Virtual Data DAG)

abstract directed acyclic graph with

logical names for files/executables

(complete build-style recipe as DAX)

– Resource locations unspecified

– File names are logical

– Data destinations unspecified

Concrete DAGs (stuff for DAGMan)

CONDOR style DAG for grid execution

(check RC, skip steps, make-style)

– Resource locations determined

– Physical file names specified

– Data delivered to and returned from physical

locations

Abs. PlanVDC

RC C. Plan.

DAX

DAGMan

DAG

VDL

Log

ical

Ph

ysi

cal

XML

XML

D.Bourilkov Virtual Data in CMS Analysis 9

Nitty-Gritty

Transformation catalog (expects pre-built executables)#poolname ltransformation physical transformation environment String

local hw /bin/echo null

local pythcvs /workdir/lhc-h-6-cvs null

local pythlin /workdir/lhc-h-6-link null

local pythgen /workdir/lhc-h-6-run null

local pythtree /workdir/h2root.sh null

local pythview /workdir/root.sh null

local GriphynRC /vdshome/bin/replica-catalog JAVA_HOME=/vdt/jdk1.3;VDS_HOME=/vdshome

local globus-url-copy /vdt/bin/globus-url-copy GLOBUS_LOCATION=/vdt;LD_LIBRARY_PATH=/vdt/lib

ufl hw /bin/echo null

ufl GriphynRC /vdshome/bin/replica-catalog JAVA_HOME=/vdt/jdk1.3.1_04;VDS_HOME=/vdshome

ufl globus-url-copy /vdt/bin/globus-url-copy GLOBUS_LOCATION=/vdt;LD_LIBRARY_PATH=/vdt/lib

Pool configuration#pool universe job-manager-string url-prefix

workdir ...

ufl vanilla testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir

ufl standard testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir

ufl globus testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir

ufl transfer testulix/jobmanager gsiftp://testulix/mydir /mydir

local vanilla localhost/jm-condor gsiftp://localhost/mydir /mydir

local globus localhost/jm-condor gsiftp://localhost/mydir /mydir

local transfer localhost/jobmanager gsiftp://localhost/mydir /mydir

D.Bourilkov Virtual Data in CMS Analysis 10

Data Analysis in HEP

• Decentralized, “chaotic”

• Flexible enough system: able to accommodate large user base, use cases that we can’t foresee

• Ability to build scripts/executables “on the fly”, including user supplied code/parameters (possibly linking with preinstalled libraries on the execution sites)

D.Bourilkov Virtual Data in CMS Analysis 11

Prototypes

First for SC2002, second for CHEP03

CVS tag

FORTRANcode

datacards

librariesversion N

executable

rootwrapper

h2root

PYTHIAwrapper

compile,link

CVS

plots

ntuples

root trees

eventdisplays

C++ code

D.Bourilkov Virtual Data in CMS Analysis 12

Prototypes

CHIMERA/ROOT prototype for generating events with PYTHIA/CMKIN, histogramming and visualization

D.Bourilkov Virtual Data in CMS Analysis 13

mass = 160decay = WWWW e event = 8

mass = 160decay = WWWW e plot = 1

mass = 160decay = WWplot = 1

mass = 160decay = WWevent = 8

mass = 160decay = WWWW e

mass = 160decay = WWWW leptons

mass = 160

mass = 160decay = WW

mass = 160decay = ZZ

mass = 160decay = bb

mass = 160plot = 1

mass = 160event = 8

A virtual space of simulated data is created for futureuse by scientists...

D.Bourilkov Virtual Data in CMS Analysis 14

mass = 160decay = WWWW e event = 8

mass = 160decay = WWWW e plot = 1

mass = 160decay = WWplot = 1

mass = 160decay = WWevent = 8

mass = 160decay = WWWW e

mass = 160decay = WWWW leptons

mass = 160

mass = 160decay = WW

mass = 160decay = ZZ

mass = 160decay = bb

mass = 160plot = 1

mass = 160event = 8

Search forWW decays of the Higgs Bosonwhere the Ws decay to electron and muon: mass = 160; decay = WW; WW e

D.Bourilkov Virtual Data in CMS Analysis 15

mass = 160decay = WWWW e event = 8

mass = 160decay = WWWW e plot = 1

mass = 160decay = WWplot = 1

mass = 160decay = WWevent = 8

mass = 160decay = WWWW e

mass = 160decay = WWWW leptons

mass = 160

mass = 160decay = WW

mass = 160decay = ZZ

mass = 160decay = bb

mass = 160plot = 1

mass = 160event = 8

Scientist obtainsan interestingresult and wantsto track howit was derived.

D.Bourilkov Virtual Data in CMS Analysis 16

mass = 160decay = WWWW e event = 8

mass = 160decay = WWWW e plot = 1

mass = 160decay = WWplot = 1

mass = 160decay = WWevent = 8

mass = 160decay = WWWW e

mass = 160decay = WWWW leptons

mass = 160

mass = 160decay = WW

mass = 160decay = ZZ

mass = 160decay = bb

mass = 160plot = 1

mass = 160event = 8

Now the scientistwants to dig deeper...

D.Bourilkov Virtual Data in CMS Analysis 17

mass = 160decay = WWWW e Pt > 20

mass = 160decay = WWWW e event = 8

mass = 160decay = WWWW e plot = 1

mass = 160decay = WWplot = 1

mass = 160decay = WWevent = 8

mass = 160decay = WWWW e

mass = 160decay = WWWW leptons

mass = 160

mass = 160decay = WW

mass = 160decay = ZZ

mass = 160decay = bb

mass = 160plot = 1

mass = 160event = 8

...The scientistadds a new derived data branch...

...and continues toinvestigate !

D.Bourilkov Virtual Data in CMS Analysis 18

A Collaborative Data-flowDevelopment Environment:

Complex Data Flow and Data Provenance in HEP

Raw

ESD

AO

D

TA

G

Plo

ts,

Table

s,

Fit

s

Com

pari

sons

Plo

ts,

Table

s,Fit

s

Real Data

SimulatedData

History of a Data Analysis (like CVS)

"Check-point" a Data Analysis

Analysis Development Environment

Audit a Data Analysis

D.Bourilkov Virtual Data in CMS Analysis 19

Outlook

• Work in progress both on CHIMERA & CMS sides – a “snapshot”

• A CHIMERA/ROOT prototype for building executables “on the fly”, generating events with PYTHIA/CMKIN, plotting and visualization available (CHIMERA is a great integration tool)

• The full CMS Monte Carlo chain is working under CHIMERA (next talk)

• Possible future directions:• Workflow management; automatic generation; inheritance …• Store metadata about derivations (like annotations) in a

searchable catalog• Handle Datasets, not just Logical File Names• Integration with CLARENS (remote access), with ROOT/PROOF

(run in parallel) A picture is better than 1000 words: Prototype Demo