End-to-End eScience

End-to-End eScienceIntegrating Query, Workflow, Visualization, and Mashups at an Ocean Observatory

Bill Howe, University of Washington

Harrison Green-Fishback, PSUDavid Maier, PSU

Erik Anderson, UtahEmanuele Santos, UtahJuliana Freire, UtahCarlos Scheidegger, UtahClaudio Silva, Utah

Antonio Baptista, OHSU

Peter Lawson, OSURenee Bellinger, OSU

http://dev.pacificfishtrax.org/

QuickTime™ and a decompressor

are needed to see this picture.

04/11/23 Bill Howe, eScience Institute 2

Outline

eScience Brief Demo A Domain-Specific Query Algebra Mashups

Theory

Experiment

Observation

slide: Ed Lazowska

Theory

Experiment

Observation

slide: Ed Lazowska

Theory

Experiment

Observation

slide: Ed Lazowska

Theory

Experiment

ObservationComputational

Science

slide: Ed Lazowska

Theory

Experiment

ObservationComputational

Science

eScience


All Science is becoming eScience

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)New model: “Download the world” (Data acquired en masse, independent of hypotheses)But: Acquisition now outpaces analysis

Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: lab automation, high-throughput sequencing

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Empirical X Analytical X Computational X X-informatics


The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

The Long Tailda

ta in

vent

ory

ordinal position

Researchers with growing data management challenges but limited resources for cyberinfrastructure

• No dedicated IT staff

• Overreliance on simple tools (e.g., spreadsheets)CERN (~15PB/year)

LSST (~100PB)

PanSTARRS (~40PB)

Ocean Modelers <Spreadsheet

users>

SDSS (~100TB)

Seis-mologists

MicrobiologistsCARMEN (~50TB)

“The future is already here. It’s just not very evenly distributed.”-- William Gibson


eScience Institute at UW

Mission Help position the University of Washington at the

forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon these techniques and technologies

Strategy Increase the sharing of expertise and facilities Bootstrap a cadre of Research Scientists Add faculty in key fields Make the entire University more effective

Launched July 1 with $1 million in permanent funding from the Washington State Legislature Sought, and need, $2 million


Web Services

Facets of Database Research

Query Languages

Storage Management

Visualization; Workflow

Data IntegrationKnowledge Extraction,Crawlers

Access Methods

Data Mining, Parallel Programming Models, Provenance

complexity-hiding interfaces

My research: customize and optimize for science


The eScience Elephant

eScience

Cloud/Cluster

Workflow

Databases

Visualization Provenance

“flexibility; web services; integration”

“query processing; data independence; algebraic optimization; needles in haystacks”

“Exploratory science; mapping quantitative data to intuition”

“Reproducibility; forensics; sharing/reuse”

“Massive data parallelism”

Mashups“Rapid Prototyping; Simplified web programming”


Some eScience Research

Query Algebra for new Data Type

Scientific Workflow Systems

Science Mashups

“Dataspace” systems

[Howe, Freire, Silva, et al. 2008]

[Howe, Green-Fishback, Maier, 2009]

[Howe, Maier, Rayner, Rucker 2008]

[Howe, Maier. 2004, 2005, 2006]

thi s

talk


Outline

eScience Brief Demo A Domain-Specific Query Algebra Science Mashups


VisTrails for Computation

Spatial Patterns in Fisheries: new Spatial Patterns in Fisheries: new techniques, new opportunities for techniques, new opportunities for

ecosystem-based managementecosystem-based managementPeter LawsonPeter Lawson11, Lorenzo Cianelli, Lorenzo Cianelli22, Bobby Ireland, Bobby Ireland22

12


Enabling Scientific Discourse between Fishermen and Fisheries Managers





VisTrails for Collaboration

Bill Howe @ CMOP computes salt flux using GridFields

Erik Anderson @ Utah adds vector

streamlines and adjusts opacity

Bill Howe @ CMOP adds an isosurface of

salinity

Peter Lawson adds discussion of the

scientific interpretation


Outline

eScience Brief Demo A Domain-Specific Query Algebra Mashups


CMOP


Columbia River Estuary

red = high salinity (~34psu)

blue = fresh water (~0 psu)


Accessing Model Results CMOP ocean circulation models run in forecast or

hindcast mode Models run serially in ~1/5 real time

On MPICH2, about 10x speedup before overhead dominates Forecasts kept for 10 days, hindcasts kept indefinitely

(40TB + 25TB/year)

Access via a GridFields Web Service GFServer optimizes and evaluates GF expressions and returns

the result


Unstructured Grids

“unstructured grids” model complex domains at multiple scales simultaneously

red = high salinity (~34psu)

blue = fresh water (~0 psu)

Columbia River Estuary

….but complicate processing


“Structured” Grids

“structured grids” do a poor job of modeling complex features and complicate multi-scale analysis.

But:Coastlines are not rectilinear

x x

xx

xx xx

xx

xx

x

1) Missing values = wasted effort

Higher resolution = wasted effort in areas of low dynamism

2) Data associated with cells at multiple dimensions

Simple: Isomorphic to multidimensional arrays


Structured grids are easy

The data model(Cartesian products of coordinate variables)

immediately implies a representation, (multidimensional arrays)

an API, (reading and writing subslabs)

and an efficient implementation (address calculation using array “shape”)


Structured grid example

f( i , j )

x( i )

y( j )

for i in [4:6]:

for j in [1:4]:

addr = &f + j*|x| + i

= f[4:6, 1:4] =

NetCDF, MATLAB, RasDaMan, SciDB (soon), many more


Unstructured Grids

2

3 4( E, I ) = A

y

xz

E0 = {2,3,4}

E1 = {x,y,z}

E2 = {A}

I = z2z4Az

x2x3Ax

Ayy4y3

…plus the transitive closure


Subsetting

Full grid: Eastern Pacific Subset: mouth of Columbia River

color: bathymetry

Washington

Oregon

California


Correctness properties preserved

Grid is well-supported

(no ragged edges)


Subset semantics

01

1

1

1 0

0

1

1

1

1

1

1

1

1

Input Simple Drop “Exact”

1

1

11

0

01

1 0

0 1

1

1

12

1

1

Cut everything labeled “0”. What should be kept?


What about Visualization Libs?

Different C++ classes, each dependent on data characteristics. Changes to data characteristics require changes to the program Logical equivalences obscured No data independence

vtkExtractGeometryvtkThresholdvtkExtractGridvtkExtractVOIvtkThresholdPoints

We want:

in VTK:


GridField Data Model

A GridField with two attributes bound to the 2-cells and four attributes bound to the 0-cells

x y salt temp

13.8 10.6 29.4 12.1

13.9 9.4 29.8 12.5

14.3 9.0 28.0 12.0

13.4 9.0 30.1 13.2

flux area

11.5 3.3

13.9 5.5

13.1 4.5


GridField Operations

Lifted set operations Union, Intersection, Cross Product

Scan/Bind Read a grid/attribute

Restrict Remove cells that do not satisfy a predicate

Accrete Grow a grid by adding neighbors of cells

Regrid Map the data of one grid onto another


Usage Example (1)

H = Scan(context, "H")

rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)

H = rH =

dimensionpredicate

color: bathymetry


Usage Example (2)

H = Scan(context, “H")

rH = Restrict(“h<500", 0, H)

H = rH =

color: bathymetry


Longer Example

H : (x,y,b)

V : (z)

render

H V

(H V)

r(z>b)

r(H V)

b(s)

b(r(H V))

r(region)

r(b(r(H V)))


H(x,y,b)

V(z)

r(z>b) b(s) r(region)

H(x,y,b)

V(z)

r(z>b) b(s)

r(x,y)

r(z)

Optimization

*Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005


Transect (Vertical Slice)

P


Transect: Bad Plan

H(x,y,b)

V(z)

r(z>b) b(s) regrid

PP V

1) Construct full-size 3D grid2) Construct 2D transect grid3) Spatial Join 1) with 2)


Transect: Optimized Plan

P V

V(z)P

H(x,y,b)regrid b(s) regrid

1) Find 2D cells containing points2) Create “stacks” of 2D cells carrying data3) Create 2D transect grid4) Spatial Join 2) with 3)


1) Find cells containing points in P


1)

4)

2)

1) Find cells containing points in P

2) Construct “stacks” of cells

4) Join 2) with 3)


0

5

10

15

20

25

30

35

40

45

vtk(3D) interpolate simple interp_o simple_o

Transect: Results

secs

800 MB dataset

simple = nearest neighbor interpolation

*_o = optimized by restricting to the region of interest


Ongoing work NSF Cluster Exploratory Award:

Where the Ocean Meets the Cloud: Ad Hoc Longitudinal Analysis of Massive Mesh Data

Partnership between NSF, IBM, Google Data-intensive computing

massive queries, not massive simulations

To “Cloud-Enable” GridFields and VisTrails Goal: 10+-year climatologies at interactive speeds Parallel implementations of GridField operators

via Hadoop (and Dryad!) Provenance, repeatability, visualization via VisTrails

Connect rich desktop experience

Co-PIs from University of Utah Claudio Silva and Juliana Freire


Outline

eScience Brief Demo A Domain-Specific Query Algebra Scientific Mashups


Why Mashups?

Jim Gray: # of datasets scales as N2

Each pairwise comparison generates a new dataset

Corollary: # of apps scales as N2

Every pairwise comparison motivates a new mashup To keep up, we need to

entrain new programmers, make existing programmers more productive, or both


Satellite Images + Crime Incidence Reports


Twitter Feed + Flickr Stream


Mashup Frameworks

A bottom up approach Start with a GPL, add

Visual programming Interactive type checking Exploit a corpus of

previous examples bootstrapping a mashup mashup “autocomplete” emit warnings





Scientific Mashup Characteristics

Turn over more data per operation Involve subtle visualizations Must serve a diverse audience


A Model for Scientific Mashups

The “Data Product” is the currency of scientific communication with the public

Scientists are already adept at crafting them (consider powerpoint slides and figures)

We take a top down approach: Take a static data product ensemble, endow it with interactivity, publish it online, allow others to repurpose it at runtime


Data Product Ensemble


Mashup


CTD: Conducitvity, Temperature, Depth


Sampling


Event Detection: Red Water


CTD Cast


Flowthrough


Mashup


Mashup


Key Concepts

A mashup is a synchronized ensemble of data products

A data product is a mashable that has been adapted for a particular purpose

A mashable is an arbitrarily-complex computation that returns a relation

An adaptor displays the relation to the user and returns a subset

All adapted mashables accept input Hence, user controls are modeled

as adapted mashables just like “visual” data products


Adapted Mashables


Data Flow Graph


Inferring Data Flow

provides: {ABC}

requires: {AB}


Inferring Data Flow

provides: {AC}

requires: {AB}

provides: {B}


Inferring Data Flow

provides: {AC}

requires: {AB}

underspecified mashup

Solution: 1) use defaults2) root environment3) hand-specified parameter


Inferring Data Flow

provides: {AB}

requires: {AB}

provides: {B}

overspecified mashup

Solution: Break ties:1) Prefer nodes on longer paths2) Use layout information


Audience-Tailored Mashups

K12 studentsExperts


Conclusions and Future Directions

We want to augment scientists, not programmers Requires limiting expressiveness -- not yet clear where

to draw the line

More work on semi-automatically tailoring a mashup at runtime Automatically insert “context products”

See salinity, add a salinity colorbar See a time, add a tide chart See a location, add a map

Re-skin data products “Dashboard-style” vs. “Wizard-style” apps


http://escience.washington.edu

(retooled website coming soon)


ComparisonData Model Operations Services

GPL * * Typing, maybe

Workflow * arbitrary boxes-and-arrows

typing, provenance, Pegasus-style resource mapping, task parallelism

Relational Algebra

Relations Select, Project, Join, Aggregate, …

optimization, physical data independence, data parallelism

MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance

MS Dryad IQueryable, IEnumerable

RA + Apply + Partitioning

typing, massive data parallelism, fault tolerance

MPI Arrays/ Matrices

70+ ops data parallelism, full control


Mashups serve a diverse audience

student

public

scientist


Computational Science

Theory Experiment Observation Simulation (in silico) Analysis (in ferro)

Data acquisition is hypothesis-driven

Data acquisition is technology-driven


Explore architectures blending techniques from

• mashups (rapid prototyping),

• visualization (interactivity, richness),

• workflow (data integration, provenance),

• databases (optimization, data independence)

to answer science questions at an Ocean Observatory

Motivation


Source: MayaVi website

PLOT3D, GDAL, ShapeFile,

OGC, .obj, .vtk, netCDF, HDF5,

FITS, others

Optimized for “throwing datasets” and interactivity

Declarative query, interoperability, repeatability generally lacking

Source: http://pogl.wordpress.com/2007/06/

Visualization


Workflow Emphasis on integration, web

services, flexibility

Unconstrained boxes-and-arrows Any operation on any data type

Very expressive, but limited opportunities for static reasoning

Type safety Task parallelism Cache safety Optimization via rewrite rules Result size / execution time estimation Transparent data parallelism Platform portability

To move the earth, you need somewhere to stand


Databases

Pre-relational DBMS brittleness: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

physical data independence

logical data independence

files and pointers

relations

views

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independent of physical data representation


Heterogeneity also drives costs#

of

by

tes

# of data types

CERN (~15PB/year, particle interactions)

LSST(~100PB; images, objects)

PanSTARRS (~40PB; images, objects, trajectories)

OOI(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)

SDSS (~100TB; images, objects)

Biologists(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogeny trees)


The eScience Elephant

“Like a snake”

“

“Like a hand fan”

“Like a wall” “Like tree trunk”

“Like a spear”

“Like a rope”


Technology

End-to-End eScience