Upload
ian-foster
View
1.368
Download
2
Tags:
Embed Size (px)
DESCRIPTION
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Citation preview
Ian Foster
Computation Institute
Argonne National Lab & University of Chicago
Towards anOpen Analytics Environment
2
The Computation Institute
A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods.
Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three).
www.ci.uchicago.edu
Faculty, fellows, staff, students, computers, projects.
3
The Good Old Days: Astronomy ~1600
30 years? years
10 years6 years2 years
4
Automation10
-1 108 Hz
data capture
Community10
0 104
astronomers(106 amateur)
ComputationData10
6 1015
Baggregate 10
-1 1015
Hzpeak
Literature10
1 105
pages/year
Astronomy,from 1600 to 2000
5
Biomedical Research ~1600
6
Biomedical Research ~2000
...atcgaattccaggcgtcacattctcaattcca...
DNA sequencesalignments
MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...
Proteins sequence
2º structure 3º structure
Protein-ProteinInteractions
metabolism pathways
receptor-ligand 4º structure
Polymorphism and Variants
genetic variants individual patients
epidemiology
Physiology Cellular biology
Biochemistry Neurobiology
Endocrinology etc.>10
6
ESTs Expression patternsLarge-scale screensGenetics and Maps
Linkage Cytogenetic Clone-based
From John Wooley>10
6
>109
>106
>105
>109
7
Growth of Sequences and Annotations since 1982
Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.
8
The Analyst in Denial
“I just need a bigger disk (and workstation)”
9
An Open Analytics Environment
Resultsout
Datain
Programs& rules in
“No limits” Storage Computing Format Program
Allowing for Versioning Provenance Collaboration Annotation
10
o·pen [oh-puhn] adjective
having the interior immediately accessible
relatively free of obstructions to sight, movement, or internal arrangement
generous, liberal, or bounteous
in operation; live
readily admitting new members
not constipated
11
What Goes In (1)
12
What Goes In (2)
RulesRules
WorkflowsWorkflows
DryadDryad
MapReduceMapReduce
Parallel programsParallel programs
SQLSQL
BPELBPEL
SwiftSwift
SCFLSCFL
RR
MatLabMatLab
OctaveOctave
13
How it Cooks
Virtualization Run any program, store
any data Indexing
Automated maintenance Provisioning
Policy-driven allocation of resources to competing demands
14
What Comes Out
DataData
15
Analysis as (Collaborative) ProcessTransformAnnotate SearchAdd toTag
VisualizeDiscover
ExtendGroupShare
16
Centralizedor
Distributed?
Both
17
Towards an Open Analysis Environment:(1) Applications
Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics
18
Towards an Open Analysis Environment:(2) Hardware
SiCortex6K cores, 6 Top/s
IBM BG/P160K cores, 500 Top/s
PADS
PADS
10-40 Gbit/s
19
PADS: Petascale Active Data Store
500 TB reliable storage (data &
metadata)
180 TB, 180 GB/s 17 Top/s
analysisData
ingest
Dynamic provisioning
Parallel analysis
Remote access
Offload to remote data centers
P A D S
Diverseusers
Diversedata
sources
1000 TBtape backup
20
Towards an Open Analysis Environment:(3) Methods
HPC systems software (MPICH, PVFS, etc.) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management Distributed data management (GridFTP, etc.)
21
Tagging & Social Networking
GLOSS: Generalized
Labels Over Scientific data Sources
22
./group23
drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA
drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH
drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC
./group23/AA:
drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa
drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa
. /group23/AA/04nov06aa:
drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY
drwxr-xr-x 2 yongzh users 49152 Dec 5 11:40 FUNCTIONAL
. /group23/AA/04nov06aa/ANATOMY:
-rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr
-rw-r--r-- 1 yongzh users 16777216 Nov 5 12:29 coplanar.img
. /group23/AA/04nov06aa/FUNCTIONAL:
-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr
-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0001.img
-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr
-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0002.img
-rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat
-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr
-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0003.img
XDTM: XML Data Typing & Mapping
LogicalPhysical
23
fMRI Type Definitions
type Study { Group g[ ];
}
type Group { Subject s[ ];
}
type Subject { Volume anat; Run run[ ];
}
type Run { Volume v[ ];
}
type Volume { Image img; Header hdr;
}
type Image {};
type Header {};
type Warp {};
type Air {};
type AirVec { Air a[ ];
}
type NormAnat {Volume anat; Warp aWarp; Volume
nHires;}
24
High-PerformanceData Analytics
FunctionalMRI
Ben Clifford, Mihael Hatigan, Mike Wilde,Yong Zhao
25
SwiftScript for fMRI Data Analysis
(Run snr) functional ( Run r, NormAnat a, Air shrink ) {
Run yroRun = reorientRun( r , "y" );Run roRun = reorientRun( yroRun , "x" );Volume std = roRun[0];Run rndr = random_select( roRun, 0.1 );AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" );Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" );Volume meanRand = softmean( reslicedRndr, "y", "null" );Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" );Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );…
}
(Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); } }
26
Provenance Data Model
dvIDhoststart
durationexitcode
stats
Invocation
nmspacename
version
Call
passes passes
executescalls
binds references
describesuses
includes
nmspacename
version
Procedure
argnametype
direction
FormalArg
argnamevalue
ActualArg
wfidfromDV
toDV
Workflow
nmspacename
Dataset
objectpred
type/valuserdate
Annotation
1
1
1
1
1
1
*
*
*
*
*
1
11
1
1
1
1 describes
27
Virtual Node(s)
SwiftScript
Abstractcomputation
Virtual DataCatalog
SwiftScriptCompiler
Specification Execution
Worker Nodes
Provenancedata
ProvenancedataProvenance
collector
launcher
launcher
file1
file2
file3
AppF1
AppF2
Scheduling
Execution Engine(Karajan w/
Swift Runtime)
Swift runtimecallouts
C
C CC
Status reporting
Multi-level Scheduling
Provisioning
FalkonResource
Provisioner
AmazonEC2
28
DOCK on SiCortex CPU cores: 5760 Power: 15,000 W Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years
(does not include ~800 sec to stage input data)
Ioan Raicu,Zhao
Zhang
29
Birmingham•
LIGO Gravitational WaveObservatory
>1 Terabyte/day to 8 sites770 TB replicated to date: >120 million replicasMTBF = 1 month
Cardiff
AEI/Golm
Ann Chervenak et al., ISI; Scott Koranda et al, LIGO
30
Lag Plot for Data Transfers to Caltech
Credit: Kevin Flasch, LIGO
31
SIDGrid: B. Bertenthal et al., U.Chicago, IU, UIC
32
Social Informatics Data Grid (SIDgrid)
TeraGrid PADS …
SIDgrid
Collaborative, multi-modal analysis of cognitive science data
Diverseexperimenta
ldata &
metadata Browse dataSearchContent previewTranscodeDownloadAnalyze
33
ELAN
SIDGrid Portal
34
35
A Community Integrated Model for Economic and Resource Trajectories for
Humankind (CIM-EARTH)
Dynamics,foresight,
uncertainty,resolution, …
Agriculture,transport,
taxation, …
Data (global,local, …)
(Super)computers
CIM-EARTHFramework
Communityprocess
Opencode, data
36
Alleviating Poverty
in Thailand:Modeling
Entrepreneurship
Consider only wealth,
access to capital
Consider alsodistance to
6 major cities
Rob Townsend, Victor Zhorin, et al.
Match
High
Low
37
Text Mining
38
GeneWays
Online Journals
Pathways
GeneWays
Andrey Rzhetsky et al.
Screening 250,000 journal articles
2.5M reasoning chains
4M statements
39
Identify Genes
Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4
Predictive Disease Susceptibility
Physiology
Metabolism Endocrine
Proteome
Immune Transcriptome
BiomarkerSignatures
Morphometrics
Pharmacokinetics
EthnicityEnvironment
AgeGender
Evidence Integration:Genetics & Disease Susceptibility
Source: Terry Magnuson
40James Evans, U.Chicago
Arabidopsis articles
41
An Open Analytics Environment
Resultsout
Datain
Programs& rules in
“No limits” Storage Computing Format Program
Allowing for Versioning Provenance Collaboration Annotation
42