Distributed Data Analysis and Tools

Distributed Data Analysis and Tools

CHEP, 21-27 March 2009, Prague P. Mato /CERN

Distributed Data Analysis is very wide subject and I don’t like catalogue-like talks

Narrowing the scope of the presentation to the perspective of the ‘physicists’, discussing issues that affects them directly

My presentation will be LHC centric, which is very relevant for the current phase we are now. -- Sorry

Thanks to all the people that has help me to prepare this presentation

Foreword

26/9/09Distributed Data Analysis and Tools 2

The full data processing chain from reconstructed event data up to producing the final plots for publication

Data analysis is a iterative process◦ Reduce data samples to more interesting subsets

(selection)◦ Compute higher level information, redo some

reconstruction, etc.◦ Calculate statistical entities

Algorithm development is essential in analysis◦ The ingeniousness is materialized in code

Data Analysis


The large amount of data to be analyzed and the computing requirements prevents the idea of non-distributed data analysis

The scale of ‘distribution’ goes from a local cluster to computer center or to the whole grid(s)

Distributed analysis complicates the life of the physicists◦ In addition to the analysis code he/she has to

worry about many other technical issues

◦

Some Obvious Facts


Distributed Data Analysis and Tools 5

LHC Analysis Data Flow

26/9/09

Data is generated at the experiment, process and distributed worldwide (T1, T2, T3)

The analysis will process, reduce, transform and select parts of the data iteratively until it can fit in a single computer

How this is realized?

All elements there and still valid◦ Less organized activity (chaotic)◦ Input data defined by asking questions◦ Data scattered all over the world◦ Own algorithms◦ Data provenance◦ Software version management◦ Resource estimation◦ Interactivity

Advocating for a sophisticated WMS◦ Common to all VO’s◦ Plugins to VO’s specific tools/services

HEPCAL-II† Dreams

WorkloadManagemen

tSystem

DatasetQuery

UserAlgorithms

UserOutput

† Common use cases for a HEP Common Application Layer for Analysis, LCG-2003

OtherServices

26/9/09 6Distributed Data Analysis and Tools

“If there is no special middleware support [for analysis], the job may not benefit from being run in the grid environment, and analysis may even take a step backward from pre-grid days”

Need for a Common Layer


The implementation has evolved into a number of VO specific “middleware” using a small set of basic services◦ E.g. DIRAC, PanDA, AliEn, Glide-In

Development of “user-friendly” and ‘intelligent” interfaces to hide the complexity◦ E.g. Crab, Ganga

Not optimal for small VOs that cannot afford to develop specific services/interface◦ Or individuals with special needs

HEPCAL-II Reality


VO specificWMS, DSC

Grid middlewareBasic Services

[VO specific]Front-end interface

Computing & Storageresources

Specialization of the VO’s Frameworks and Data Models for data analysis to process ESD/AOD◦ CMS Physics Analysis Toolkit (PAT), ATLAS Analysis

Framework, LHCb DaVinci/LoKi/Bender, ALICE Analysis Framework

◦ In same cases selecting subset of Framework libraries ◦ Collaboration approved analysis algorithms and tools

Other [scripting] languages have a role here◦ PYTHON is getting very popular in addition to CINT

macros◦ Ideal for prototyping new ideas

User typically develops its own Algorithm(s) based on these frameworks but also is willing to replace parts of the official release

Analysis Software


Front-End Tools


Ganga ALICE

Crab

Both Ganga and ALICE provide an interactive shell to configure and automate analysis jobs (Python, CINT)◦ In addition Ganga provides a GUI

Crab has a thin client. Most of the work (automation, recovery, monitoring, etc) is done in a server◦ This functionality is delegated to the VO specific WMS for the

other cases Ganga offers a convenient overview of all user jobs (job

repository) enabling automation Both Crab and Ganga are able to pack local user libraries and

environment automatically making use of the configuration tool knowledge◦ For ALICE the user provides .par files with the sources

Major Differences



1. Algorithm development and testing starts locally and small◦ Single computer small cluster

2. Grows to a large data and computation task◦ Large cluster the Grid

3. Final analysis is again more local and small◦ Small cluster single computer

Ideally the analysis activity should be a continuum in terms of tools, software frameworks, models, etc.◦ LHC experiments are starting to offer this to their physicists◦ Ganga is a good example. From inside the same session

you can do a large data job and do final analysis with the results

Analysis Activity

26/9/09

The user specifies on what data to run the analysis using VO specific dataset catalogs ◦ Specification is based on a query◦ The front-end interfaces provide functionality to facilitate

the catalog queries Each experiment has developed event tags

mechanisms for sparse input data selection Data scattered over the world

◦ Computing model and policies of the experiment dictate the placement of data

◦ Read-only data with several replicas◦ Portions of the data copied to local clusters (CAF, T3, etc)

for local access

Input Data


Small output data files such like histogram files are returned to the client session (using the sandbox)◦ Usually limited to few MB

Large output files are typically put in Storage Elements (e.g. Castor) and registered in the grid file catalogue (e.g. LFC) and can be used as input for other Grid jobs (iterative process)

Tools such as CRAB and Ganga (ATLAS) provides strong links with VO’s Distributed Data Management/Transfer systems (eg. DQ2, PhEDEx) to place output where user wants it

Output Data



The goal is to make it easy for physicists Distributed analysis as simple as doing it locally

◦ Which is already complicated enough!!◦ Hiding the technical details is a must

In Ganga changing the back-end from LSF to DIRAC requires to change one parameter

In ALICE changing from PROOF to AliEn requires to change one name and provide a AliEn plugin configuration

In CRAB changing from local batch to gLite requires a single parameter change in the configuration file

Submission Transparency

26/9/09

PROOF

Out

put l

ist

AM

O1

AM

task1

task2

task3

taskN

Inpu

ts

Out

puts

AM

task1

task2

task3

taskN

Inpu

ts

Out

puts

AM

task1

task2

task3

taskN

Inpu

ts

Out

puts

AM

task1

task2

task3

taskN

Inpu

ts

Out

puts

AM

task1

task2

task3

taskN

Inpu

ts

Out

puts

AM

task1

task2

task3

taskN

Inpu

ts

Out

puts

Inpu

t list

AM

PROOF Transparency example


AnalysisManager

task1task2task3

taskNInpu

t cha

in

Out

puts

WorkerWorkerWorkerWorkerWorker

AliAnalysisSelector

TSelector

AM->StartAnalysis(“proof”)

MyAnalysis.CCLIENT

O2

On

O

O

O

Master

O2

O1

On

Terminate()

SlaveBegin()Process()SlaveTerminate()

Andrei Gheata

A large variety of frontends and backends It is great, but it may add confusion and

complicate user support

ATLAS Physicist Choices


Distributed analysis relies on the software installed in the remote nodes (e.g local cluster, Grid)◦ Experiment’s officially released software is taken care by

VOs◦ Installation procedures for big VO are well oiled ◦ Problem for small VOs / Individuals

Physicist’s add-ons and private analysis algorithms need to be send along with the job◦ Every user tool provides some level of support for this◦ The exact matching of the OS version/compiler (platform)

is required when sending binaries The later imposes strong constrains on the

platform uniformity of the different facilities◦ Local interactive service Local facility Grid

Managing the Software



CernVM is a Virtual Appliance that provides a complete, portable and easy to configure user environment for developing and running analysis locally and on the Grid independently of physical software and hardware platform

It comes with the read only file system (CVMFS) optimized for software distribution◦ Very little fraction of the software is actually used (~10%)◦ Very aggressive local caching, web proxy cache (squids)◦ Operational in off-line mode

On-demand Install with CernVM

26/9/09

CernVM CVMFSCernVM CVMF

SCernVM CVMFS

LAN/WAN https://

The CernVM platform is stating to be used by Physicists to develop/test/debug data analysis◦ With a laptop you carry the complete development

environment and the Grid UI with you◦ Managing all phases of analysis from the same ‘window’

Ideally the same environment should be used to execute their jobs in the Grid◦ Validation with large datasets◦ Decoupling application

software from system software and hardware

Can the existing ‘Grid’ be adapted to CernVM?

Virtualization Role


Job splitting (parallelization) is essential to be able to analyze large data samples in a limited time◦ Very lasting jobs are more unreliable

Tools such as PROOF splits dynamically the analysis job at the sub-file level (packets) offering [quasi] interactivity with the user

All the other Grid submission tools provides parallelization by splitting the list of input files◦ Sub-jobs constrained by input data location

The more difficult part is the result merging◦ Standard automation of the most common cases◦ User intervention for more complicated ones

Job Splitting


Majority of today computing resources are based on multi-core architectures◦ Exploiting these multi-core architectures (MT, MP)

can optimize the use of resources (memory, I/O)◦ See V. Innocente’s presentation

Submitting a single job per node that utilizes all available cores can be advantageous◦ Efficient in resources, mainly increasing the

fraction of shared memory◦ Scale down the number of jobs that the WMS

needs to handle

Using Multi-Core Architectures



Grouping data analysis is way to optimize when going over a large part or the full dataset◦ Requires the support of the framework, (a model)◦ …and some discipline

Examples:◦ Alice is using the AliAnalysisManager framework to

optimize CPU/IO ratio (85% savings reported)◦ LHCb is grouping

pre-selections intheir stripping jobs

Analysis Trains

26/9/09

At the time of HEPCAL-II resource estimation was an important issue◦ How much CPU time this analysis would take, what will be

the output data size, etc. In practice Physicists can estimate resources

pretty well since test analysis are performed with small data samples before submitting large jobs◦ Proper reporting of the ‘cost’ of each job with

standardized units could facilitate this estimation◦ In the old times of CERNVM a job summary with the CPU

time in ‘CERN units’ was printed in each job

Resource Estimation



Job failures are very common (E.g. ~45% of the CMS analysis jobs do not terminate successful)◦ The reasons are very diverse (data access, stalled, upload data,

application failure,…) Proper reporting of job failures is essential for diagnosing

and handling them efficiently◦ Detailed monitoring, log files, etc.

Handling failures may imply to provide corrections in configurations, code, re-submission, managing site backlists, etc.◦ Automated correction actions can handled by severs (e.g. CRAB)◦ Scripting support available to users (e.g. Ganga)

Handling Job Failures

26/9/09

[1]: jobs.select(status=‘failed’).resubmit()[2]: jobs.select(name= ‘testjob’).kill()[3]: newjobs = jobs.select( status=’new’)[4]: newjobs.select( name= ’urgent’).submit()

Monitoring is essential for the users and also for administrators

Physicists may use the web based interfaces to find out information about their jobs◦ Each WMS have develop a very complete monitoring

tools◦ The details available are

really impressive (e.g. Panda Monitor)

Often the connectionwith the submissiontools is poor◦ Not well integrated

Monitoring


If the front-end submission tool understands the analysis application [framework] it can become extremely helpful to the users

E.g. the Ganga application component can◦ Setup the correct environment, collect user shareable

libraries, analyze configuration files and follow dependencies, determine inputs and outputs and register them automatically, etc.

The technical solution to achieve this is by implementing ‘plugins’ for each type of application

Application Awareness


Fundamentally the way analysis is being done has not changed very much◦ The initial dreams that the Grid will change dramatically

the paradigm has not happen◦ Parts of the analysis with large data jobs will be done in

batch and parts will be done more locally and interactively

Each collaboration has developed tools to cope with the large data and computational requirements and to simply the life of physicists◦ Turned out that the model/architecture of these tools are

very similar but they are not in common◦ The number of users of these tools are increasing rapidly

Summary


Documents

Distributed Data Analysis and Tools