Upload
bill-howe
View
728
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Invited talk at Microsoft Research, Spring 2009
Citation preview
End-to-End eScienceIntegrating Query, Workflow, Visualization, and Mashups at an Ocean Observatory
Bill Howe, University of Washington
Harrison Green-Fishback, PSUDavid Maier, PSU
Erik Anderson, UtahEmanuele Santos, UtahJuliana Freire, UtahCarlos Scheidegger, UtahClaudio Silva, Utah
Antonio Baptista, OHSU
Peter Lawson, OSURenee Bellinger, OSU
http://dev.pacificfishtrax.org/
QuickTime™ and a decompressor
are needed to see this picture.
04/11/23 Bill Howe, eScience Institute 2
Outline
eScience Brief Demo A Domain-Specific Query Algebra Mashups
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
ObservationComputational
Science
slide: Ed Lazowska
Theory
Experiment
ObservationComputational
Science
eScience
04/11/23 Bill Howe, eScience Institute 8
All Science is becoming eScience
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)New model: “Download the world” (Data acquired en masse, independent of hypotheses)But: Acquisition now outpaces analysis
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X Analytical X Computational X X-informatics
04/11/23 Bill Howe, eScience Institute 9
The long tail is getting fatter:
notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)
The Long Tailda
ta in
vent
ory
ordinal position
Researchers with growing data management challenges but limited resources for cyberinfrastructure
• No dedicated IT staff
• Overreliance on simple tools (e.g., spreadsheets)CERN (~15PB/year)
LSST (~100PB)
PanSTARRS (~40PB)
Ocean Modelers <Spreadsheet
users>
SDSS (~100TB)
Seis-mologists
MicrobiologistsCARMEN (~50TB)
“The future is already here. It’s just not very evenly distributed.”-- William Gibson
04/11/23 Bill Howe, eScience Institute 10
eScience Institute at UW
Mission Help position the University of Washington at the
forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon these techniques and technologies
Strategy Increase the sharing of expertise and facilities Bootstrap a cadre of Research Scientists Add faculty in key fields Make the entire University more effective
Launched July 1 with $1 million in permanent funding from the Washington State Legislature Sought, and need, $2 million
04/11/23 Bill Howe, eScience Institute 11
Web Services
Facets of Database Research
Query Languages
Storage Management
Visualization; Workflow
Data IntegrationKnowledge Extraction,Crawlers
Access Methods
Data Mining, Parallel Programming Models, Provenance
complexity-hiding interfaces
My research: customize and optimize for science
04/11/23 Bill Howe, eScience Institute 12
The eScience Elephant
eScience
Cloud/Cluster
Workflow
Databases
Visualization Provenance
“flexibility; web services; integration”
“query processing; data independence; algebraic optimization; needles in haystacks”
“Exploratory science; mapping quantitative data to intuition”
“Reproducibility; forensics; sharing/reuse”
“Massive data parallelism”
Mashups“Rapid Prototyping; Simplified web programming”
04/11/23 Bill Howe, eScience Institute 13
Some eScience Research
Query Algebra for new Data Type
Scientific Workflow Systems
Science Mashups
“Dataspace” systems
[Howe, Freire, Silva, et al. 2008]
[Howe, Green-Fishback, Maier, 2009]
[Howe, Maier, Rayner, Rucker 2008]
[Howe, Maier. 2004, 2005, 2006]
thi s
talk
04/11/23 Bill Howe, eScience Institute 14
Outline
eScience Brief Demo A Domain-Specific Query Algebra Science Mashups
04/11/23 Bill Howe, eScience Institute 15
VisTrails for Computation
Spatial Patterns in Fisheries: new Spatial Patterns in Fisheries: new techniques, new opportunities for techniques, new opportunities for
ecosystem-based managementecosystem-based managementPeter LawsonPeter Lawson11, Lorenzo Cianelli, Lorenzo Cianelli22, Bobby Ireland, Bobby Ireland22
12
04/11/23 Bill Howe, eScience Institute 17
Enabling Scientific Discourse between Fishermen and Fisheries Managers
04/11/23 Bill Howe, eScience Institute 18
04/11/23 Bill Howe, eScience Institute 19
04/11/23 Bill Howe, eScience Institute 20
04/11/23 Bill Howe, eScience Institute 21
VisTrails for Collaboration
Bill Howe @ CMOP computes salt flux using GridFields
Erik Anderson @ Utah adds vector
streamlines and adjusts opacity
Bill Howe @ CMOP adds an isosurface of
salinity
Peter Lawson adds discussion of the
scientific interpretation
04/11/23 Bill Howe, eScience Institute 22
Outline
eScience Brief Demo A Domain-Specific Query Algebra Mashups
04/11/23 Bill Howe, eScience Institute 23
CMOP
04/11/23 Bill Howe, eScience Institute 24
Columbia River Estuary
red = high salinity (~34psu)
blue = fresh water (~0 psu)
04/11/23 Bill Howe, eScience Institute 25
Accessing Model Results CMOP ocean circulation models run in forecast or
hindcast mode Models run serially in ~1/5 real time
On MPICH2, about 10x speedup before overhead dominates Forecasts kept for 10 days, hindcasts kept indefinitely
(40TB + 25TB/year)
Access via a GridFields Web Service GFServer optimizes and evaluates GF expressions and returns
the result
04/11/23 Bill Howe, eScience Institute 26
Unstructured Grids
“unstructured grids” model complex domains at multiple scales simultaneously
red = high salinity (~34psu)
blue = fresh water (~0 psu)
Columbia River Estuary
….but complicate processing
04/11/23 Bill Howe, eScience Institute 27
“Structured” Grids
“structured grids” do a poor job of modeling complex features and complicate multi-scale analysis.
But:Coastlines are not rectilinear
x x
xx
xx xx
xx
xx
x
1) Missing values = wasted effort
Higher resolution = wasted effort in areas of low dynamism
2) Data associated with cells at multiple dimensions
Simple: Isomorphic to multidimensional arrays
04/11/23 Bill Howe, eScience Institute 28
Structured grids are easy
The data model(Cartesian products of coordinate variables)
immediately implies a representation, (multidimensional arrays)
an API, (reading and writing subslabs)
and an efficient implementation (address calculation using array “shape”)
04/11/23 Bill Howe, eScience Institute 29
Structured grid example
f( i , j )
x( i )
y( j )
for i in [4:6]:
for j in [1:4]:
addr = &f + j*|x| + i
= f[4:6, 1:4] =
NetCDF, MATLAB, RasDaMan, SciDB (soon), many more
04/11/23 Bill Howe, eScience Institute 30
Unstructured Grids
2
3 4( E, I ) = A
y
xz
E0 = {2,3,4}
E1 = {x,y,z}
E2 = {A}
I = z2z4Az
x2x3Ax
Ayy4y3
…plus the transitive closure
04/11/23 Bill Howe, eScience Institute 31
Subsetting
Full grid: Eastern Pacific Subset: mouth of Columbia River
color: bathymetry
Washington
Oregon
California
04/11/23 Bill Howe, eScience Institute 32
Correctness properties preserved
Grid is well-supported
(no ragged edges)
04/11/23 Bill Howe, eScience Institute 33
Subset semantics
01
1
1
1 0
0
1
1
1
1
1
1
1
1
Input Simple Drop “Exact”
1
1
11
0
01
1 0
0 1
1
1
12
1
1
Cut everything labeled “0”. What should be kept?
04/11/23 Bill Howe, eScience Institute 34
What about Visualization Libs?
Different C++ classes, each dependent on data characteristics. Changes to data characteristics require changes to the program Logical equivalences obscured No data independence
vtkExtractGeometryvtkThresholdvtkExtractGridvtkExtractVOIvtkThresholdPoints
We want:
in VTK:
04/11/23 Bill Howe, eScience Institute 35
GridField Data Model
A GridField with two attributes bound to the 2-cells and four attributes bound to the 0-cells
x y salt temp
13.8 10.6 29.4 12.1
13.9 9.4 29.8 12.5
14.3 9.0 28.0 12.0
13.4 9.0 30.1 13.2
flux area
11.5 3.3
13.9 5.5
13.1 4.5
04/11/23 Bill Howe, eScience Institute 36
GridField Operations
Lifted set operations Union, Intersection, Cross Product
Scan/Bind Read a grid/attribute
Restrict Remove cells that do not satisfy a predicate
Accrete Grow a grid by adding neighbors of cells
Regrid Map the data of one grid onto another
04/11/23 Bill Howe, eScience Institute 37
Usage Example (1)
H = Scan(context, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
H = rH =
dimensionpredicate
color: bathymetry
04/11/23 Bill Howe, eScience Institute 38
Usage Example (2)
H = Scan(context, “H")
rH = Restrict(“h<500", 0, H)
H = rH =
color: bathymetry
04/11/23 Bill Howe, eScience Institute 39
Longer Example
H : (x,y,b)
V : (z)
render
H V
(H V)
r(z>b)
r(H V)
b(s)
b(r(H V))
r(region)
r(b(r(H V)))
04/11/23 Bill Howe, eScience Institute 40
H(x,y,b)
V(z)
r(z>b) b(s) r(region)
H(x,y,b)
V(z)
r(z>b) b(s)
r(x,y)
r(z)
Optimization
*Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005
04/11/23 Bill Howe, eScience Institute 41
Transect (Vertical Slice)
P
04/11/23 Bill Howe, eScience Institute 42
Transect: Bad Plan
H(x,y,b)
V(z)
r(z>b) b(s) regrid
PP V
1) Construct full-size 3D grid2) Construct 2D transect grid3) Spatial Join 1) with 2)
04/11/23 Bill Howe, eScience Institute 43
Transect: Optimized Plan
P V
V(z)P
H(x,y,b)regrid b(s) regrid
1) Find 2D cells containing points2) Create “stacks” of 2D cells carrying data3) Create 2D transect grid4) Spatial Join 2) with 3)
04/11/23 Bill Howe, eScience Institute 44
1) Find cells containing points in P
04/11/23 Bill Howe, eScience Institute 45
1)
4)
2)
1) Find cells containing points in P
2) Construct “stacks” of cells
4) Join 2) with 3)
04/11/23 Bill Howe, eScience Institute 46
0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
Transect: Results
secs
800 MB dataset
simple = nearest neighbor interpolation
*_o = optimized by restricting to the region of interest
04/11/23 Bill Howe, eScience Institute 47
Ongoing work NSF Cluster Exploratory Award:
Where the Ocean Meets the Cloud: Ad Hoc Longitudinal Analysis of Massive Mesh Data
Partnership between NSF, IBM, Google Data-intensive computing
massive queries, not massive simulations
To “Cloud-Enable” GridFields and VisTrails Goal: 10+-year climatologies at interactive speeds Parallel implementations of GridField operators
via Hadoop (and Dryad!) Provenance, repeatability, visualization via VisTrails
Connect rich desktop experience
Co-PIs from University of Utah Claudio Silva and Juliana Freire
04/11/23 Bill Howe, eScience Institute 48
Outline
eScience Brief Demo A Domain-Specific Query Algebra Scientific Mashups
04/11/23 Bill Howe, eScience Institute 49
Why Mashups?
Jim Gray: # of datasets scales as N2
Each pairwise comparison generates a new dataset
Corollary: # of apps scales as N2
Every pairwise comparison motivates a new mashup To keep up, we need to
entrain new programmers, make existing programmers more productive, or both
04/11/23 Bill Howe, eScience Institute 50
Satellite Images + Crime Incidence Reports
04/11/23 Bill Howe, eScience Institute 51
Twitter Feed + Flickr Stream
04/11/23 Bill Howe, eScience Institute 52
Mashup Frameworks
A bottom up approach Start with a GPL, add
Visual programming Interactive type checking Exploit a corpus of
previous examples bootstrapping a mashup mashup “autocomplete” emit warnings
04/11/23 Bill Howe, eScience Institute 53
04/11/23 Bill Howe, eScience Institute 54
04/11/23 Bill Howe, eScience Institute 55
04/11/23 Bill Howe, eScience Institute 56
Scientific Mashup Characteristics
Turn over more data per operation Involve subtle visualizations Must serve a diverse audience
04/11/23 Bill Howe, eScience Institute 57
A Model for Scientific Mashups
The “Data Product” is the currency of scientific communication with the public
Scientists are already adept at crafting them (consider powerpoint slides and figures)
We take a top down approach: Take a static data product ensemble, endow it with interactivity, publish it online, allow others to repurpose it at runtime
04/11/23 Bill Howe, eScience Institute 58
Data Product Ensemble
04/11/23 Bill Howe, eScience Institute 59
Mashup
04/11/23 Bill Howe, eScience Institute 60
CTD: Conducitvity, Temperature, Depth
04/11/23 Bill Howe, eScience Institute 61
Sampling
04/11/23 Bill Howe, eScience Institute 62
Event Detection: Red Water
04/11/23 Bill Howe, eScience Institute 63
CTD Cast
04/11/23 Bill Howe, eScience Institute 64
Flowthrough
04/11/23 Bill Howe, eScience Institute 65
Mashup
04/11/23 Bill Howe, eScience Institute 66
Mashup
04/11/23 Bill Howe, eScience Institute 67
Key Concepts
A mashup is a synchronized ensemble of data products
A data product is a mashable that has been adapted for a particular purpose
A mashable is an arbitrarily-complex computation that returns a relation
An adaptor displays the relation to the user and returns a subset
All adapted mashables accept input Hence, user controls are modeled
as adapted mashables just like “visual” data products
04/11/23 Bill Howe, eScience Institute 68
Adapted Mashables
04/11/23 Bill Howe, eScience Institute 69
Data Flow Graph
04/11/23 Bill Howe, eScience Institute 70
Inferring Data Flow
provides: {ABC}
requires: {AB}
04/11/23 Bill Howe, eScience Institute 71
Inferring Data Flow
provides: {AC}
requires: {AB}
provides: {B}
04/11/23 Bill Howe, eScience Institute 72
Inferring Data Flow
provides: {AC}
requires: {AB}
underspecified mashup
Solution: 1) use defaults2) root environment3) hand-specified parameter
04/11/23 Bill Howe, eScience Institute 73
Inferring Data Flow
provides: {AB}
requires: {AB}
provides: {B}
overspecified mashup
Solution: Break ties:1) Prefer nodes on longer paths2) Use layout information
04/11/23 Bill Howe, eScience Institute 74
Audience-Tailored Mashups
K12 studentsExperts
04/11/23 Bill Howe, eScience Institute 75
Conclusions and Future Directions
We want to augment scientists, not programmers Requires limiting expressiveness -- not yet clear where
to draw the line
More work on semi-automatically tailoring a mashup at runtime Automatically insert “context products”
See salinity, add a salinity colorbar See a time, add a tide chart See a location, add a map
Re-skin data products “Dashboard-style” vs. “Wizard-style” apps
04/11/23 Bill Howe, eScience Institute 76
http://escience.washington.edu
(retooled website coming soon)
04/11/23 Bill Howe, eScience Institute 77
ComparisonData Model Operations Services
GPL * * Typing, maybe
Workflow * arbitrary boxes-and-arrows
typing, provenance, Pegasus-style resource mapping, task parallelism
Relational Algebra
Relations Select, Project, Join, Aggregate, …
optimization, physical data independence, data parallelism
MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance
MS Dryad IQueryable, IEnumerable
RA + Apply + Partitioning
typing, massive data parallelism, fault tolerance
MPI Arrays/ Matrices
70+ ops data parallelism, full control
04/11/23 Bill Howe, eScience Institute 78
Mashups serve a diverse audience
student
public
scientist
04/11/23 Bill Howe, eScience Institute 79
Computational Science
Theory Experiment Observation Simulation (in silico) Analysis (in ferro)
Data acquisition is hypothesis-driven
Data acquisition is technology-driven
04/11/23 Bill Howe, eScience Institute 80
Explore architectures blending techniques from
• mashups (rapid prototyping),
• visualization (interactivity, richness),
• workflow (data integration, provenance),
• databases (optimization, data independence)
to answer science questions at an Ocean Observatory
Motivation
04/11/23 Bill Howe, eScience Institute 81
Source: MayaVi website
PLOT3D, GDAL, ShapeFile,
OGC, .obj, .vtk, netCDF, HDF5,
FITS, others
Optimized for “throwing datasets” and interactivity
Declarative query, interoperability, repeatability generally lacking
Source: http://pogl.wordpress.com/2007/06/
Visualization
04/11/23 Bill Howe, eScience Institute 82
Workflow Emphasis on integration, web
services, flexibility
Unconstrained boxes-and-arrows Any operation on any data type
Very expressive, but limited opportunities for static reasoning
Type safety Task parallelism Cache safety Optimization via rewrite rules Result size / execution time estimation Transparent data parallelism Platform portability
To move the earth, you need somewhere to stand
04/11/23 Bill Howe, eScience Institute 83
Databases
Pre-relational DBMS brittleness: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.
physical data independence
logical data independence
files and pointers
relations
views
“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”
Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independent of physical data representation
04/11/23 Bill Howe, eScience Institute 84
Heterogeneity also drives costs#
of
by
tes
# of data types
CERN (~15PB/year, particle interactions)
LSST(~100PB; images, objects)
PanSTARRS (~40PB; images, objects, trajectories)
OOI(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)
SDSS (~100TB; images, objects)
Biologists(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogeny trees)
04/11/23 Bill Howe, eScience Institute 85
The eScience Elephant
“Like a snake”
“
“Like a hand fan”
“Like a wall” “Like tree trunk”
“Like a spear”
“Like a rope”
04/11/23 Bill Howe, eScience Institute 86