Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 Tucson Fermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO DES Data Management Ray Plante

Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO

DES Data Management

Ray Plante & Joe MohrNCSA UI Astronomy


Data Management

• Operations Sites: CTIO, La Serena, NCSA, Fermilab

• Common access to data by pipelines & users through archive

• Grid-based processing approach for automation and flexibility

• Community-friendly data release strategy

Data AcquisitionShort-term

Archive

Long-term Archive(NCSA)

(Semi-) Replicated

ArchiveQuick Quality Assessment &

Image Correction

Scheduling

Operator

Mountaintop

Public AccessProprietary Accesstransferingest

Single Frame Calibration Pipeline

Co-add Pipeline

Science Analysis Pipelines

Source Extraction Pipeline

Grid-based Environmentdependency

transfer

Basic Calibration& Quality Assessment

La Serena

SNe Analysis

Short-term Archive

FermilabNOAO


Overview of Operations

• CTIO (Mountaintop)– Quick quality assessment

• La Serena– 10% of each night: repeat observations of SNe fields– Automated reduction/analysis to find SNe and follow light curves

• NCSA/UofI– Main processing facility, leveraging existing NCSA hardware

• Fermilab– Survey planning and simulation– augment hardware environment as needed (reprocessing)

• expect data to be transferred over network


Data Release Strategy

• Single pointing images are automatically released ~1 year after acquisition

Level 0: raw, uncalibrated dataLevel 1: calibrated, single-frame images

• Science products released twice:– Halfway point of the survey – One year after the end of the survey

Level 2: Co-added/mosaiced imagesLevel 3: “Pre-science” catalogs (object catalog)

• Science results released upon publication by science team

Level 4: photo-z catalog, cluster catalog, etc.


The DES Archive

• Automated ingest of data products• Common infrastructure for proprietary and public access

– Difference is a matter of authorization

• Management via the Data Model – Tracking survey status, available products– Monitor data: included among level 0 products– Exposing sufficient metadata for external use

• Interactive Access – Primarily for public– Search and retrieval tools

• Leverage existing NCSA archives (BIMA/CARMA, ADIL, Quest, …)

• Programmatic Access– Access by DES pipelines– External Access by Virtual Observatory apps via standard interfaces

• Partial archive replication at partner sites


Grid-based Pipeline Framework

• Purpose:– Support fully automated processing– Provide platform-independent execution

environment

• Example topology #1:La Serena:Calibration,

SNe analysis

NCSA:Level 1-3 processing

(including full Calibration)Photo-z catalog

Fermilab:Cluster finding


Grid-based Pipeline Framework

• Example topology #2La Serena:

Level 1 (Full Calibration,)SNe analysis

NCSA:Level 2-3 processing

Fermilab:Photo-z catalogCluster finding

• Example topology #3 (Reprocessing)NCSA:

Level 1-4 processing

Common execution environment allows processing to be easily moved between sites


Enabling Automation

• Critical for handling large data rate– Run all processing as data becomes available

– Automated release of Level 0-1 data

• Automation based on events that trigger processing– First event: data lands in La Serena cache

– Engages a pipeline on a set of data products

• Techniques for recovery from failure

• Biggest Challenge: automated quality assessment– Quantify measures of quality

– Flag obvious problems

– Filter down cases requiring human inspection

– Attach quality measures to metadata


Platform-independent execution environment

• “Application” level should not worry about what machine it is running on– Common way of initiating a pipeline

application and passing in its inputs– Transparent access to data– Common logging– Transparent parallelism– Common exit strategy

• Status • Declaration of output products


Grid-based Capabilities Needed

• Data access through logical identifiers

• Automated archiving of processed products

• Workflow management

• Process monitoring, error detection, error recovery

• Transparent support for local authentication/authorization mechanisms

• Grid-based job execution– Hides local batch complication


Leverage existing technologies and experience

• Astronomical pipelines

– Community code: IRAF, SExtractor, …

– NOAO Mosaic Pipeline, OPUS

– BIMA Data Archive and Pipeline

• Real-time data ingest and automated release

• Grid-based image processing on NCSA platforms

– Quest2 pipeline

• Deployed existing pipeline on TeraGrid platforms in ~2 weeks

• Grid technology used for replicating data between Caltech and NCSA

• NCSA and Fermilab programs for Grid Infrastructure

• Emerging collaboration between NCSA and NOAO

– NCSA to provide data management services to NOAO archives

• NOAO, NCSA, Fermilab are partners in the National Virtual Observatory (NVO)


Software Design & Engineering

• Data Management Steering Group– Cross-institution working group handles high-level design, policies & plan

• Design Deliverables– Currently underway: detailed DM requirements, high-level design, Work Package

definitions

• Design & Development Process– Design Reviews– Coding Standards and software reviews– Testing framework, Data Challenge Definition– Reporting and effort trackingMust choose weight appropriate for project

• Process for Pipeline Framework Design– Understand constraints from target platforms and existing software technologies– Define reference platform: the environment that software must run in– Design data access, archive, and processing framework


Schedule

Development Testing/Data Challenges

Year 1 Software Engineering

Archive-based Collection Access

Pipeline Systems

framework, simulation, single-frame calibration

Testing framework

Year 2 Archive-based Collection Access

Pipeline Systems framework, simulation, single-frame calibration, co-add

Challenge I: hand-run test of existing software

calibration thru object extract.

Year 3 Pipeline Systems co-add, object extraction

Operations software

Challenge II:Automated test of archive and pipeline framework

Data ingest thru single-frame cal.

End of

Year 3Challenge III:

Automated test of chained pipelines

One year’s worth of data

Data ingest thru co-add

Year 4 Operations software

Photo-z catalog

Address data challenge issues, rerun as nec.

Year 5 Operations


Summary

• Data Management with strong support for community access– Common public/proprietary access infrastructure

– Aggressive release schedule, with fast release of basic data products

– Long-term archive plan

– Building on NOAO-NCSA relationship

• Grid-based pipeline architecture– Automation, flexibility

– Key to supporting geographically distributed processing

• Leverage existing software and technologies

Documents

Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 Tucson Fermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO DES Data Management Ray Plante