Upload
ursula-donovan
View
28
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Swift: implicitly parallel dataflow scripting for clusters, clouds, and multicore systems. www.ci.uchicago.edu /swift Michael Wilde - [email protected]. When do you need Swift? Typical application: protein-ligand docking for drug screening. O(10) proteins implicated in a disease. O (100K) - PowerPoint PPT Presentation
Citation preview
Swift: implicitly parallel dataflow scripting for clusters, clouds, and multicore systems
www.ci.uchicago.edu/swift
Michael Wilde - [email protected]
When do you need Swift?Typical application: protein-ligand docking for drug screening
Work of M. Kubal, T.A.Binkowski,And B. Roux
(B)
O(100K)drug
candidates
Tens of fruitfulcandidates for wetlab & APS
O(10)proteinsimplicatedin a disease
1Mcompute
jobs
X …
2
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Numerous many-task applications
Simulation of super-cooled glass materials
Protein folding using homology-free approaches
Climate model analysis and decision making in energy policy
Simulation of RNA-protein interaction
Multiscale subsurfaceflow modeling
Modeling of power gridfor OE applications
A-E have published science results obtained using Swift
T0623, 25 res., 8.2Å to 6.3Å (excluding tail)
Protein loop modeling. Courtesy A. Adhikari
Native Predicted
Initial
E
D
C
A B
A
B
C
D
E
F F
>
3
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Submit host (login node, laptop, Linux server)
Data server
Swiftscript
Swift runs parallel scripts on a broad rangeof parallel computing resources.
Language-driven: Swift parallel scripting
Clouds:Amazon EC2,
XSEDE Wispy, …ApplicationPrograms
1018
1015
4
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Example – MODIS satellite image processing
MODIS analysis scriptMODISdataset
5 largest forest land-cover tiles in processed region
• Ouput: regions with maximal specific land types
Input: tiles of earth land cover (forest, ice, water, urban, etc)
5
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Goal: Run MODIS processing pipeline in cloud
analyzeLandUse
colorMODIS
assemble
markMapgetLandUsex 317
analyzeLandUse
colorMODISx 317
getLandUsex 317
assemble
markMap
MODIS script is automatically run in parallel:
Each loop levelcan process tensto thousands ofimage files.
6
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
MODIS script in Swift: main data flow
foreach g,i in geos {
land[i] = getLandUse(g,1);
}
(topSelected, selectedTiles) =
analyzeLandUse(land, landType, nSelect);
foreach g, i in geos {
colorImage[i] = colorMODIS(g);
}
gridMap = markMap(topSelected);
montage =
assemble(selectedTiles,colorImage,webDir);
7
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Submit host (login node, laptop, Linux server)
Data server
Swiftscript
Swift runs parallel scripts on cloud resources provisioned by Nimbus’s Phantom service.
Parallel distributed scripting for clouds
Clouds:Amazon EC2,
NSF FutureGrid, Wispy, …
Nimbus,Phantom
8
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Example of Cloud programming:Nimbus-Phantom-Swift on FutureGrid
User provisions 5 nodes with Phantom– Phantom starts 5 VMs– Swift worker agents in VMs contact Swift coaster service to request work
Start Swift application script “MODIS”– Swift places application jobs on free workers– Workers pull input data, run app, push output data
3 nodes fail and shut down– Jobs in progress fail, Swift retries
User can add more nodes with phantom– User asks Phantom to increase node allocation to 12– Swift worker agents register, pick up new workers, runs more in parallel
Workload completes– Science results are available on output data server– Worker infrastructure is available for new workloads
9
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
10
Programming model:all execution driven by parallel data flow
f() and g() are computed in parallel myproc() returns r when they are done
This parallelism is automatic Works recursively throughout the program’s call graph
(int r) myproc (int i){ j = f(i); k = g(i); r = j + k;}
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Encapsulation enables distributed parallelism
Swift app( ) functionInterface definition
Application program
Typed Swiftdata object
Files expectedor produced
by application program
Encapsulation is the key to transparent distribution, parallelization, and automatic provenance capture
11
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
app( ) functions specify cmd line argument passing
Swift app function“predict()”
t
seq dtlog
PSim application-t -d-s-c
>pdb
pg
To run: psim -s 1ubq.fas -pdb p -t 100.0 -d 25.0 >log
In Swift code:
app (PDB pg, File log) predict (Protein seq, Float t, Float dt) { psim "-c" "-s" @pseq.fasta "-pdb" @pg "–t" temp "-d" dt; }
Protein p <ext; exec="Pmap", id="1ubq">; PDB structure; File log; (structure, log) = predict(p, 100., 25.);
Fastafile
100.0 25.0
12
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
foreach sim in [1:1000] { (structure[sim], log[sim]) = predict(p, 100., 25.);}result = analyze(structure)
…1000
Runs of the“predict”
application
Analyze()
T1af7
T1r69
T1b72
Large scale parallelization with simple loops
13
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Nested parallel prediction loops in Swift
14
1. Sweep( )2. {3. int nSim = 1000;4. int maxRounds = 3;5. Protein pSet[ ] <ext; exec="Protein.map">;6. float startTemp[ ] = [ 100.0, 200.0 ];7. float delT[ ] = [ 1.0, 1.5, 2.0, 5.0, 10.0 ];8. foreach p, pn in pSet {9. foreach t in startTemp {10. foreach d in delT {11. ItFix(p, nSim, maxRounds, t, d);12. }13. }14. }15. }16. Sweep();
10 proteins x 1000 simulations x3 rounds x 2 temps x 5 deltas
= 300K tasks www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Spatial normalization of functional run reorientRun
reorientRun
reslice_warpRun
random_select
alignlinearRun
resliceRun
softmean
alignlinear
combinewarp
strictmean
gsmoothRun
binarize
reorient/01
reorient/02
reslice_warp/22
alignlinear/03 alignlinear/07alignlinear/11
reorient/05
reorient/06
reslice_warp/23
reorient/09
reorient/10
reslice_warp/24
reorient/25
reorient/51
reslice_warp/26
reorient/27
reorient/52
reslice_warp/28
reorient/29
reorient/53
reslice_warp/30
reorient/31
reorient/54
reslice_warp/32
reorient/33
reorient/55
reslice_warp/34
reorient/35
reorient/56
reslice_warp/36
reorient/37
reorient/57
reslice_warp/38
reslice/04 reslice/08reslice/12
gsmooth/41
strictmean/39
gsmooth/42gsmooth/43gsmooth/44 gsmooth/45 gsmooth/46 gsmooth/47 gsmooth/48 gsmooth/49 gsmooth/50
softmean/13
alignlinear/17
combinewarp/21
binarize/40
reorient
reorient
alignlinear
reslice
softmean
alignlinear
combine_warp
reslice_warp
strictmean
binarize
gsmooth
Dataset-level workflow Expanded (10 volume) workflow15
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Complex scripts can be well-structuredprogramming in the large: fMRI spatial normalization script example
(Run snr) functional ( Run r, NormAnat a,
Air shrink )
{ Run yroRun = reorientRun( r , "y" );
Run roRun = reorientRun( yroRun , "x" );
Volume std = roRun[0];
Run rndr = random_select( roRun, 0.1 );
AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" );
Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" );
Volume meanRand = softmean( reslicedRndr, "y", "null" );
Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" );
Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );
Run nr = reslice_warp_run( boldNormWarp, roRun );
Volume meanAll = strictmean( nr, "y", "null" )
Volume boldMask = binarize( meanAll, "y" );
snr = gsmoothRun( nr, boldMask, "6 6 6" );
}
(Run or) reorientRun ( Run ir, string direction){ foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); }}
16
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Dataset mapping example: fMRI datasets
type Study { Group g[ ];
}
type Group { Subject s[ ];
}
type Subject { Volume anat; Run run[ ];
}
type Run { Volume v[ ];
}
type Volume { Image img; Header hdr;
}
On-DiskDataLayout Swift’s
in-memory data model
Mapping functionor script
17
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Simple campus Swift usage: running locally on a cluster login host
Swift can send work to multiple sites
Data server(GridFTP, scp, …)
Any remote host
CondorOr GRAMProvider
RemoteData
Provider
Remote campus cluster
LocalScheduler
GRAM
Remote campus cluster
LocalScheduler
GRAM
. . .
18
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Flexible worker-node agents for execution and data transport
Main role is to efficiently run Swift tasks on allocated compute nodes, local and remote
Handles short tasks efficiently Runs over Grid infrastructure: Condor, GRAM Also runs with few infrastructure dependencies Can optionally perform data staging Provisioning can be automatic or external (manually
launched or externally scripted)
19
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Centralized campuscluster environment
Any remote host
Swiftscript
Swift can deploy coasters automatically, or user can deploy manually or with a script. Swift provides scripts for clouds, ad-hoc clusters, etc.
Coasters deployment can be automatic or manual
Data server
CoasterAutoBoot
DataProvider
CoasterService
Computenodes
f1
a1
RemoteHost
GRAMor SSH
Coasterworker
Coasterworker
. . .
ClusterScheduler
21
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Swift is a parallel scripting system for grids, clouds and clusters– for loosely-coupled applications - application and utility programs linked by
exchanging files Swift is easy to write: simple high-level C-like functional language
– Small Swift scripts can do large-scale work Swift is easy to run: contains all services for running Grid workflow - in one
Java application– Untar and run – acts as a self-contained Grid client
Swift is fast: uses efficient, scalable and flexible “Karajan” execution engine.– Scaling close to 1M tasks – .5M in live science work, and growing
Swift usage is growing:– applications in neuroscience, proteomics, molecular dynamics, biochemistry,
economics, statistics, and more. Try Swift! www.ci.uchicago.edu/swift and www.mcs.anl.gov/exm
22
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm
Acknowledgments Swift is supported in part by NSF grants OCI-1148443 and PHY-636265, NIH DC08638,
and the UChicago SCI Program ExM is supported by the DOE Office of Science, ASCR Division Structure prediction supported in part by NIH The Swift team:
– Mihael Hategan, Justin Wozniak, Ketan Maheshwari, Ben Clifford, David Kelly, Jon Monette, Allan Espinosa, Ian Foster, Dan Katz, Ioan Raicu, Sarah Kenny, Mike Wilde, Zhao Zhang, Yong Zhao
GPSI Science portal:– Mark Hereld, Tom Uram, Wenjun Wu, Mike Papka
Java CoG Kit used by Swift developed by:– Mihael Hategan, Gregor Von Laszewski, and many collaborators
ZeptoOS– Kamil Iskra, Kazutomo Yoshii, and Pete Beckman
Scientific application collaborators and usage described in this talk:– U. Chicago Open Protein Simulator Group (Karl Freed, Tobin Sosnick, Glen Hocky, Joe Debartolo,
Aashish Adhikari)– U.Chicago Radiology and Human Neuroscience Lab, (Dr. S. Small)– SEE/CIM-EARTH: Joshua Elliott, Meredith Franklin, Todd Muson– ParVis and FOAM: Rob Jacob, Sheri Mickelson (Argonne); John Dennis, Matthew Woitaszek (NCAR)– PTMap: Yingming Zhao, Yue Chen
24
www.ci.uchicago.edu/swift www.mcs.anl.gov/exm