DIBBs Brown Dog - nationaldataservice.org · • Matlab Data . Ecosystems and ... • Low level...

Preview:

Citation preview

DIBBs Brown Dog An Extensible and Distributed Data Transformation Service

Tabular Data Gap Filling

Climate Modeling Lidar

Flood Plain Analysis River Depth Distribution

River Maturity Stream Detection and Sinuosity

Satellite/Aerial Photos Land Cover/Usage

Water Detection (e.g. Lakes, Retaining Ponds)

Green Infrastructure

Hyperspectral

Radar

Photos

3D Reconstruction

3D Data

Human Preference Modeling

Video

People Detection/Tracking

Large Dynamic Group Behavior

Bee Detection/Tracking

Bee Colony Behavior

Underwater Photos

Color Correction

Image Stitching

Mapping

Event Detection

Species Detection/Counting Reef Changes

Food Supply

Structural Defects

Hazard Modeling

Microscopy Images

Pollen Detection/Classification

Paleoclimate

Evolution Root Tip Tracking

Phenomics

Materials Development

Cell Tracking

Tissue Classification

Renal Failure

Loss of Organ Function

Feedlot Tracking

Disease Detection

Historic Maps

River Meander

Coastline Changes

Documents

NLP

Sentiment Analysis

Regions in Conflict

Handwritten Documents Pre-Digital Datasets

Databases

Web Sites

Publications

Simulations

Ecosystems and Climate Change M. Dietze, K. McHenry, A. Desai, “Model-data Synthesis and Forecasting Across the Upper Midwest: Partitioning Uncertainty and Environmental Heterogeneity in Ecosystem Carbon,” NSF DBI-1062547, 2011-2014

M. Dietze, K. McHenry, A. Desai, “ABI Development: The PEcAn Project - A Community Platform for Ecological Forecasting,” NSF DBI-1457890, 2015-2019

• Towards regional-scale high resolution estimates of plant life and carbon storage

• Scientific workflow and data assimilation system connecting a variety of models within the Ecology community to a variety of data sources

• Grown to 52 developers over the past 3 years

• NCSA / U. Illinois, BU, Brookhaven National Lab, University of Wisconsin, University of Notre Dame, Utah State, Columbia University, Pacific Northwest National Laboratory, DuPont Pioneer, Exeter College, UK, U. Arizona, Dartmouth College

Ecosystems and Climate Change

• Models: • Ecosystem Demography (ED) • SIPNET • DALEC • …

• Data: • Biofuel Ecophysiological Trait and Yield Database (BETY) • Forest Inventory and Analysis (FIA) • North American Regional Reanalysis (NARR) • North American Carbon Program (NACP) • Food and Agriculture Organization (FAO) • …

Ecosystems and Climate Change

• Data with Unstructured Aspects: • MODIS (Multi-spectral) • Lidar • Palsar (Radar) • Aviris (Airborne Infrared Spectrometer) • Landsat (Images)

• Published results (e.g. tables, figures, plots)

• Manually done to ingest into BETY

• Settlement Vegetation data • Born Physical

• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets

• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,

XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS

• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data

Ecosystems and Climate Change

• Document

• Settlement Vegetation data • Born Physical

• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets

• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,

XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS

• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data

Ecosystems and Climate Change

• Document • Image

• Settlement Vegetation data • Born Physical

• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets

• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,

XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS

• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data

Ecosystems and Climate Change

• Document • Image • Spatial

• Settlement Vegetation data • Born Physical

• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets

• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,

XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS

• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data

Ecosystems and Climate Change

• Document • Image • Spatial • Tabular

• Settlement Vegetation data • Born Physical

• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets

• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,

XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS

• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data

Ecosystems and Climate Change

• Document • Image • Spatial • Tabular • Weather

• Settlement Vegetation data • Born Physical

• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets

• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,

XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS

• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data

Ecosystems and Climate Change

• Document • Image • Spatial • Tabular • Weather • 3D

• Settlement Vegetation data • Born Physical

• Paper, Microfiche, Alphanumeric/Color coded on vellum sheets

• Born Digital • PDF, JPEG, GIF, TIFF, XLS, XLSX, CSV, SHP, netCDF, HDF5,

XML, GRIB, GRIB2, geoTIFF, DBF, BIL, BIP, ARC, SDTS, SRTM, IMG, UA, LGW, SXW, ODS

• Ad hoc formats: • Spreadsheets • Databases • Services • R Data • Matlab Data

Ecosystems and Climate Change

• Document • Image • Spatial • Tabular • Weather • 3D • Archive, Database,

Filesystem, …

“Big Data” • Large quantities of data • Large varieties of data

• “Long-Tail”

Number of grants

Dollars

http://www.slideshare.net/rheimann04/big-social-data-the-social-turn-in-big-data

The “Long-Tail” of “Big Data”

Tabular Data Gap Filling

Climate Modeling Lidar

Flood Plain Analysis River Depth Distribution

River Maturity Stream Detection and Sinuosity

Satellite/Aerial Photos Land Cover/Usage

Water Detection (e.g. Lakes, Retaining Ponds)

Green Infrastructure

Hyperspectral

Radar

Photos

3D Reconstruction

3D Data

Human Preference Modeling

Video

People Detection/Tracking

Large Dynamic Group Behavior

Bee Detection/Tracking

Bee Colony Behavior

Underwater Photos

Color Correction

Image Stitching

Mapping

Event Detection

Species Detection/Counting Reef Changes

Food Supply

Structural Defects

Hazard Modeling

Microscopy Images

Pollen Detection/Classification

Paleoclimate

Evolution Root Tip Tracking

Phenomics

Materials Development

Cell Tracking

Tissue Classification

Renal Failure

Loss of Organ Function

Feedlot Tracking

Disease Detection

Historic Maps

River Meander

Coastline Changes

Documents

NLP

Sentiment Analysis

Regions in Conflict

Handwritten Documents Pre-Digital Datasets

Databases

Web Sites

Publications

Simulations

Tabular Data Gap Filling

Climate Modeling Lidar

Flood Plain Analysis River Depth Distribution

River Maturity Stream Detection and Sinuosity

Satellite/Aerial Photos Land Cover/Usage

Water Detection (e.g. Lakes, Retaining Ponds)

Green Infrastructure

Hyperspectral

Radar

Photos

3D Reconstruction

3D Data

Human Preference Modeling

Video

People Detection/Tracking

Large Dynamic Group Behavior

Bee Detection/Tracking

Bee Colony Behavior

Underwater Photos

Color Correction

Image Stitching

Mapping

Event Detection

Species Detection/Counting Reef Changes

Food Supply

Structural Defects

Hazard Modeling

Microscopy Images

Pollen Detection/Classification

Paleoclimate

Evolution Root Tip Tracking

Phenomics

Materials Development

Cell Tracking

Tissue Classification

Renal Failure

Loss of Organ Function

Feedlot Tracking

Disease Detection

Historic Maps

River Meander

Coastline Changes

Documents

NLP

Sentiment Analysis

Regions in Conflict

Handwritten Documents Pre-Digital Datasets

Databases

Web Sites

Publications

Simulations

The Data

• Diversity of data types • Diversity of file formats

• Ad hoc formats • Obsolete formats • Proprietary formats

• Un-curated data • No metadata • No consistent/useful naming of files/directories

• Unstructured data • Non-text contents

• Potentially large and/or made up of many small files

Tabular Data Gap Filling

Climate Modeling Lidar

Flood Plain Analysis River Depth Distribution

River Maturity Stream Detection and Sinuosity

Satellite/Aerial Photos Land Cover/Usage

Water Detection (e.g. Lakes, Retaining Ponds)

Green Infrastructure

Hyperspectral

Radar

Photos

3D Reconstruction

3D Data

Human Preference Modeling

Video

People Detection/Tracking

Large Dynamic Group Behavior

Bee Detection/Tracking

Bee Colony Behavior

Underwater Photos

Color Correction

Image Stitching

Mapping

Event Detection

Species Detection/Counting Reef Changes

Food Supply

Structural Defects

Hazard Modeling

Microscopy Images

Pollen Detection/Classification

Paleoclimate

Evolution Root Tip Tracking

Phenomics

Materials Development

Cell Tracking

Tissue Classification

Renal Failure

Loss of Organ Function

Feedlot Tracking

Disease Detection

Historic Maps

River Meander

Coastline Changes

Documents

NLP

Sentiment Analysis

Regions in Conflict

Handwritten Documents Pre-Digital Datasets

Databases

Web Sites

Publications

Simulations

Tabular Data Gap Filling

Climate Modeling Lidar

Flood Plain Analysis River Depth Distribution

River Maturity Stream Detection and Sinuosity

Satellite/Aerial Photos Land Cover/Usage

Water Detection (e.g. Lakes, Retaining Ponds)

Green Infrastructure

Hyperspectral

Radar

Photos

3D Reconstruction

3D Data

Human Preference Modeling

Video

People Detection/Tracking

Large Dynamic Group Behavior

Bee Detection/Tracking

Bee Colony Behavior

Underwater Photos

Color Correction

Image Stitching

Mapping

Event Detection

Species Detection/Counting Reef Changes

Food Supply

Structural Defects

Hazard Modeling

Microscopy Images

Pollen Detection/Classification

Paleoclimate

Evolution Root Tip Tracking

Phenomics

Materials Development

Cell Tracking

Tissue Classification

Renal Failure

Loss of Organ Function

Feedlot Tracking

Disease Detection

Historic Maps

River Meander

Coastline Changes

Documents

NLP

Sentiment Analysis

Regions in Conflict

Handwritten Documents Pre-Digital Datasets

Databases

Web Sites

Publications

Simulations

Processes Over the Data

• Diversity of analyses • Many forms (e.g. scripts, libraries, whole suites, services) • Many languages • Many dependencies

• Leverage towards dealing with unstructured/un-curated data • Analyses churn through data and generate new, often higher

level, data • Metadata, data about data

The Problem

• A huge diversity in the data • Types • Formats • Analyses

• A huge diversity of software involved • Scripts • Applications • Libraries • Services

• Dealing with these issues has become part of the scientific workflow, its time consuming and redundant, its difficult, its varies across labs/fields, and makes reproducibility/reusability difficult!

The Problem

• A huge diversity in the data • Types • Formats • Analyses

• A huge diversity of software involved • Scripts • Applications • Libraries • Services

• Dealing with these issues has become part of the scientific workflow, its time consuming and redundant, its difficult, its varies across labs/fields, and makes reproducibility/reusability difficult!

A Science Driven Data Transformation Service

• Supporting Data Manipulation as a Service • File format conversions • Data set conversions • Database ingestion/dumping • Website scraping

• Supporting Data Analysis as a Service • Low level analyses • Tags and Metadata • Previews • Other derived products

• Relieve scientific community from having to address this as a first step of their workflows.

Brown Dog

• Data transformations • Conversions and Extractions

• Extensibility • Easy to add new converters/extractors • Encapsulated software & dependencies

https://en.wikipedia.org/wiki/Mongrel

• API • Clients, Scalability, Provenance, Information Loss, Data

Movement

• Data Access Proxy (DAP) • An extensible and distributed service for carrying out file

format conversions • Move towards an internet/world that is agnostic to file

formats • Aid in accessing a files contents independent of how it

is represented on disk

• Data Tilling Service (DTS)

• An extensible and distributed service for the extraction of new data or metadata from a file’s contents

• Provide means to query and/or relate collections of data without metadata

• Data Conversion: A transformation on digital data that largely preserves the entirety of the data. Largely reversible.

• Data Extraction: A transformation on digital data

which creates new, often higher level, data from the contents of the given data (e.g. tags, signatures). Not reversible.

Brown Dog

• The Data Access Proxy (DAP) • https://dap.ncsa.illinous.edu/polyglot /api/ • File in, File out

• The Data Tilling Service (DTS) • https://dts.ncsa.illinois.edu/clowder/api/ • File in, JSON out • JSON can contain metadata, tags, signatures, links to derived

data products, etc…

Brown Dog

• Services!!! • Programmable interface • Client applications build on top of these services • Back with computational and storage resources • Place to preserve/reuse software/tools

Clowder

• “Smart Drop Box” • Share, collaborate

on datasets • Publishing data • Social curation • Extensible Auto-

curation

Architecture

Load balancer (nginx)

Data/Metadata

(MongoDB)

Event Bus (RabbitMQ)

Extractor 1 (Java)

Extractor 2 (Python)

Text Search (Elastic search)

Webapp (Scala/Play)

Webapp (Scala/Play)

Webapp (Scala/Play)

Clowder

External Software

Web Browser Custom Clients

Client

Server

Multimedia Search (Versus)

Multimedia Search (Versus)

Text Search (Elastic search)

Data/Metadata

(MongoDB)

Load Balancer

API Frontend

Job Queue

Extractor

Database

1. File

2. Routing

3. File Stored

4. Job Submitted

6. Read 5. Job Picked Up

8. Write

7. Extract 7.5 Status Updates

Log Analysis

Distributed Log

Extractions

extractors.connect_message_bus(extractorName=extractorName, messageType=messageType, rabbitmqURL=rabbitmqURL, rabbitmqExchange=rabbitmqExchange, processFileFunction=process_file, checkMessageFunction=check_message)

Connecting to rabbitmq

Connect

def process_file(parameters): global extractorName inputfile=parameters['inputfile'] # call actual program result = subprocess.check_output(['wc', inputfile], stderr=subprocess.STDOUT) (lines, words, characters, filename) = result.split()

Return Metadata

Work on File

extractors.upload_file_metadata(mdata=metadata, parameters=parameters)

wordcount.py

face.py #!/usr/bin/env python import pika import sys import json import traceback import requests import tempfile import subprocess import os import itertools import numpy as np import cv2 import time import logging from config import * import pymedici.extractors as extractors def main(): global extractorName, messageType, rabbitmqExchange, rabbitmqURL #set logging logging.basicConfig(format='%(levelname)-7s : %(name)s - %(message)s', level=logging.WARN) …

Polyglot

• Wraps and automates I/O operations within arbitrary software

• Searches for conversion paths across software

• Estimates information loss

• Horizontally scalable

#Application name (Version) #File types supported (e.g. document, depth, image, …) #Comma separated list of supported input formats #Comma separated list of supported output formats

Describe

#Call external application and/or carry out conversion … Convert File

;OpenOffice ;document ;doc, odt, rtf, txt ;doc, odt, pdf, rtf, txt ;Run program Run, "C:\Program Files\OpenOffice.org 3\program\soffice.exe" -headless -norestore "-accept=socket`,host=local…" RunWait, "C:\Program Files\OpenOffice.org 3\program\python.exe" "C:\Converters\DocumentConverter.py" "%1%" "%2%"

OpenOffice_convert.ahk

A3DReviewer_open.ahk

;Adobe 3D Reviewer (v9) ;model ;3ds, 3dxml, arc, asm, bdl, catdrawing, catpart, catproduct, catshape, cgr, dae, dlv, exp, hgl, hp, hpgl, hpl, iam, ifc, igs, iges, ipt, jt, kmz, mf1, model, neu, obj, _pd, par, pdf, pkg, plt, prc, prt, prw, psm, pwd, sab, sat, sda, sdac, sdp, sdpc, sds, sdsc, sdw, sdwc, ses, session, sldasm, sldlfp, sldprt, stl, step, stp, u3d, unv, wrl, vrml, x_b, x_t, xas, xpr, xmt, xmt_txt, xv0, xv3 ;Run program if not already running IfWinNotExist, Adobe 3D Reviewer { Run, C:\Program Files\Adobe\Acrobat 9.0\Acrobat\plug_ins3d\prc\A3DReviewer.exe WinWait, Adobe 3D Reviewer } ;Activate the window WinActivate, Adobe 3D Reviewer WinWaitActive, Adobe 3D Reviewer ;Parse filename root arg1 = %1% …

PEcAn#ED_convert.R

#!/usr/bin/Rscript #PEcAn #data #pecan.zip #ed.zip .libPaths("/home/polyglot/R/library") sink(stdout(),type="message") # global variables overwrite <- TRUE verbose <- TRUE # get command line arguments args <- commandArgs(trailingOnly = TRUE) usage <- function(msg) { print(msg) print(paste0("Usage: ", args[0], " cf-nc_Input_File edOutputDir ")) print(paste0("Example1: ", args[0], " US-Dk3.pecan.nc US-Dk3.ed.zip [/tmp/watever] ")) …

API Gateway

API GATEWAY REDIS

CROWD

DTS / CLOWDER

DAP / POLYGLOT

VERSUS

DATAWOLF

Request

Response

Request+

Response

Request+

Response

API Gateway

FENCE

Get /keys/8d4/token Headers: Crowd Credentials

using Basic Auth

Get /dap/outputs Headers: Access token

Get /dts/api/extractions/extractors_n

ames Headers: Access token

REDIS Add token with ttl

POLYGLOT (DAP)

CLOWDER (DTS)

Get /outputs Headers: Polyglot Credentials

Get /api/extractions/extractors_nam

es Headers: Clowder Credentials

CROWD Check user credentials

1

1

1

2

3

2

3

Support within Data Management Plans

The data analysis/manipulation software developed here will be pushed into the NSF DIBBs: Brown Dog (ACI-1261582) project as data extractors/converters within the DTS and DAP, services providing automatic data annotations/analysis and format conversions as broadly usable internet resources. Brown Dog aims to both provide services and tools to aid in the curation, accessing, and indexing of data as well as to preserve scientific software that might be leveraged for that purpose. As Brown Dog extractors/converters, the capabilities of these tools will be preserved, will take part in an ecosystem of other extraction/conversion tools, and will be leverageable by others within the scientific community, perhaps in very different fields, as well as by the general public.

Milestones

• XSEDE Tutorial • July 18th, Miami • Walk through adding and deploying new tools (i.e. converters,

extractors) • Walk through the API and creating a toy client application

• Beta Release • End of this year

Polyglot

Versus Daffodil

http://browndog.ncsa.illinois.edu

Recommended