Upload
cornelius-rice
View
212
Download
0
Embed Size (px)
Citation preview
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
DES Data Management
Ray Plante & Joe MohrNCSA UI Astronomy
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Data Management
• Operations Sites: CTIO, La Serena, NCSA, Fermilab
• Common access to data by pipelines & users through archive
• Grid-based processing approach for automation and flexibility
• Community-friendly data release strategy
Data AcquisitionShort-term
Archive
Long-term Archive(NCSA)
(Semi-) Replicated
ArchiveQuick Quality Assessment &
Image Correction
Scheduling
Operator
Mountaintop
Public AccessProprietary Accesstransferingest
Single Frame Calibration Pipeline
Co-add Pipeline
Science Analysis Pipelines
Source Extraction Pipeline
Grid-based Environmentdependency
transfer
Basic Calibration& Quality Assessment
La Serena
SNe Analysis
Short-term Archive
FermilabNOAO
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Overview of Operations
• CTIO (Mountaintop)– Quick quality assessment
• La Serena– 10% of each night: repeat observations of SNe fields– Automated reduction/analysis to find SNe and follow light curves
• NCSA/UofI– Main processing facility, leveraging existing NCSA hardware
• Fermilab– Survey planning and simulation– augment hardware environment as needed (reprocessing)
• expect data to be transferred over network
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Data Release Strategy
• Single pointing images are automatically released ~1 year after acquisition
Level 0: raw, uncalibrated dataLevel 1: calibrated, single-frame images
• Science products released twice:– Halfway point of the survey – One year after the end of the survey
Level 2: Co-added/mosaiced imagesLevel 3: “Pre-science” catalogs (object catalog)
• Science results released upon publication by science team
Level 4: photo-z catalog, cluster catalog, etc.
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
The DES Archive
• Automated ingest of data products• Common infrastructure for proprietary and public access
– Difference is a matter of authorization
• Management via the Data Model – Tracking survey status, available products– Monitor data: included among level 0 products– Exposing sufficient metadata for external use
• Interactive Access – Primarily for public– Search and retrieval tools
• Leverage existing NCSA archives (BIMA/CARMA, ADIL, Quest, …)
• Programmatic Access– Access by DES pipelines– External Access by Virtual Observatory apps via standard interfaces
• Partial archive replication at partner sites
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Grid-based Pipeline Framework
• Purpose:– Support fully automated processing– Provide platform-independent execution
environment
• Example topology #1:La Serena:Calibration,
SNe analysis
NCSA:Level 1-3 processing
(including full Calibration)Photo-z catalog
Fermilab:Cluster finding
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Grid-based Pipeline Framework
• Example topology #2La Serena:
Level 1 (Full Calibration,)SNe analysis
NCSA:Level 2-3 processing
Fermilab:Photo-z catalogCluster finding
• Example topology #3 (Reprocessing)NCSA:
Level 1-4 processing
Common execution environment allows processing to be easily moved between sites
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Enabling Automation
• Critical for handling large data rate– Run all processing as data becomes available
– Automated release of Level 0-1 data
• Automation based on events that trigger processing– First event: data lands in La Serena cache
– Engages a pipeline on a set of data products
• Techniques for recovery from failure
• Biggest Challenge: automated quality assessment– Quantify measures of quality
– Flag obvious problems
– Filter down cases requiring human inspection
– Attach quality measures to metadata
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Platform-independent execution environment
• “Application” level should not worry about what machine it is running on– Common way of initiating a pipeline
application and passing in its inputs– Transparent access to data– Common logging– Transparent parallelism– Common exit strategy
• Status • Declaration of output products
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Grid-based Capabilities Needed
• Data access through logical identifiers
• Automated archiving of processed products
• Workflow management
• Process monitoring, error detection, error recovery
• Transparent support for local authentication/authorization mechanisms
• Grid-based job execution– Hides local batch complication
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Leverage existing technologies and experience
• Astronomical pipelines
– Community code: IRAF, SExtractor, …
– NOAO Mosaic Pipeline, OPUS
– BIMA Data Archive and Pipeline
• Real-time data ingest and automated release
• Grid-based image processing on NCSA platforms
– Quest2 pipeline
• Deployed existing pipeline on TeraGrid platforms in ~2 weeks
• Grid technology used for replicating data between Caltech and NCSA
• NCSA and Fermilab programs for Grid Infrastructure
• Emerging collaboration between NCSA and NOAO
– NCSA to provide data management services to NOAO archives
• NOAO, NCSA, Fermilab are partners in the National Virtual Observatory (NVO)
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Software Design & Engineering
• Data Management Steering Group– Cross-institution working group handles high-level design, policies & plan
• Design Deliverables– Currently underway: detailed DM requirements, high-level design, Work Package
definitions
• Design & Development Process– Design Reviews– Coding Standards and software reviews– Testing framework, Data Challenge Definition– Reporting and effort trackingMust choose weight appropriate for project
• Process for Pipeline Framework Design– Understand constraints from target platforms and existing software technologies– Define reference platform: the environment that software must run in– Design data access, archive, and processing framework
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Schedule
Development Testing/Data Challenges
Year 1 Software Engineering
Archive-based Collection Access
Pipeline Systems
framework, simulation, single-frame calibration
Testing framework
Year 2 Archive-based Collection Access
Pipeline Systems framework, simulation, single-frame calibration, co-add
Challenge I: hand-run test of existing software
calibration thru object extract.
Year 3 Pipeline Systems co-add, object extraction
Operations software
Challenge II:Automated test of archive and pipeline framework
Data ingest thru single-frame cal.
End of
Year 3Challenge III:
Automated test of chained pipelines
One year’s worth of data
Data ingest thru co-add
Year 4 Operations software
Photo-z catalog
Address data challenge issues, rerun as nec.
Year 5 Operations
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 TucsonFermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO
Summary
• Data Management with strong support for community access– Common public/proprietary access infrastructure
– Aggressive release schedule, with fast release of basic data products
– Long-term archive plan
– Building on NOAO-NCSA relationship
• Grid-based pipeline architecture– Automation, flexibility
– Key to supporting geographically distributed processing
• Leverage existing software and technologies