Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Collaborative data-driven science
Collaborative data-driven science
Collaborative data-driven science
Collaborative data-driven science
Alex Szalay
Collaborative data-driven science
! StartedwiththeSDSSSkyServer! Builtveryquicklyin2001! Goal:instantaccesstorichcontent! Idea:bringtheanalysistothedata! Interac<veaccessatthecore! Muchofthescien<ficprocessisaboutdata◦ Datacollec<on,datacleaning,dataarchiving,dataorganiza<on,datapublishing,mirroring,datadistribu<on,dataanaly<cs,datacura<on…
! 2012:NSFDIBBStoextend/reengineerSkyServer3
Jim Gray
Collaborative data-driven science
! Interac<vescienceonpetascaledata! Sustainandenhanceourastronomyeffort! Growafootprintintonewdisciplines! Buildscalableopennumericallaboratories! Scalesystemtoseveralpetabytes! Deepintegra<onwiththe“LongTail”! Usesharable,well-definedbuildingblocks
4
Collaborative data-driven science
! HPCisaninstrumentinitsownright◦ Largestsimula<onsapproach/exceedpetabytes
! Needpublicaccesstothebestandlatest! Alsoneedensemblesofsimula<onsforUQ! Createsnewchallenges◦ Howtoaccessthedata?◦ Whatisthedatalifecycle?◦ WhataretheanalysispaZerns?◦ Whatarchitecturescansupportthese?
! OnExascaleeverythingisaBigDataproblem
Collaborative data-driven science
! hZp://turbulence.pha.jhu.edu/
6
Collaborative data-driven science
! MirrorofMillenniumDatabase
7
Raw data: Particles
FOF groups Subhalos
Density fields
Halo merger trees
Synthetic galaxies
Mock catalogues
Mock images
Collaborative data-driven science
8
Hydrosta2candnon-hydrosta2csimula2onsofdensewaterscascadingoffashelf:TheEastGreenlandcaseMarcelloG.Magaldi,ThomasW.N.Haine
Collaborative data-driven science
9 Building a DB of a trillion short reads from Next Gen Sequencing
Collaborative data-driven science
10
Daphalapurkar, Brady, Ramesh, Molinari. JMPS (2011)
Collaborative data-driven science
! Createinterac<veNumericalLaboratories! Analysisserver-sidethroughwebservice! Usevirtualsensormetaphor! ManyaccesspaZernsarelocal! Noneedtodownloadwholedatasets! ConceptverysuccessfulinturbulenceandcosmologicalN-body
! turbulence.pha.jhu.edu:19trillionpointsdelivered!
! TotalsciencedatainSciServercurrently~2.5PB
11
Collaborative data-driven science
! UserwriZencrawlers,inefficient! Cutoutsdeliveredtousers,slow! Scalabilitychallenge(over100TBscales)! Requestsforscrip<ngaccess! Needforeasyjoinswithlong-taildata! S<llexpec<nginterac<veresponse
12
Collaborative data-driven science
! Needtodefinesharptradeoffs◦ DataAnaly<cssystemisdifferentfromsupercomputer◦ WhatistherightbalancebetweenI/Oandcompute?
! Needhighbandwidthtolargedata◦ Computa<ons/visualiza<onsmustbeontopofthedata◦ Mustsupportatleastfew100TBperserver◦ Petascale:3copiesforproduc<on(orerasurecode?)◦ Wideareadatamovement/backboneishard
! Lessonsfromthedatabaseworld:◦ ItisnontrivialtoschedulecomplexI/OpaZerns◦ Forsubsetswemustuseindexing,cacheresilientstorage
Collaborative data-driven science
! Offermorecompu<ngresourcesserverside! Enhancedvisualiza<ontools(ParaView)! AugmentandcombineSQLquerieswitheasy-to-usescrip<ngtools
! Heavyuseofvirtualmachines/Docker! Interac2veportalviaiPython/Matlab/R
14
Collaborative data-driven science
Collaborative data-driven science
Mike Rippin April 27, 2016
Collaborative data-driven science
16
Collaborative data-driven science
17
! Alex Szalay (PI)! Mike Rippin (PM)! Ani Thakar! Jordan Raddick! Bonnie Souter! Gerard Lemson! Jaiwon Kim! Dmitry Medvedev! Deoyani Heinis! Manu Popp! Victor Paul ! Sue Werner! Jan Vandenberg
Collaborative data-driven science
18
8:30 AM Continental Breakfast & Coffee
9:00 AM Welcome Alex Szalay
9:05 AM SciServer Overview Mike Rippin
9:25 AM Getting Started with SciServer Jordan Raddick
9:40 AM Technical Overview Dmitry Medvedev
10:30 AM Coffee
10:45 AM Demo Notebook #1 Gerard Lemson
12:00 PM Lunch
1:00 PM Astronomy & Cosmology Examples Gerard Lemson
2:30 PM Break
2:45 PM Explore & Customize Participants
3:30 PM Q&A
3:50 PM Closing Remarks Mike Rippin
4:00 PM Adjourn
Collaborative data-driven science
! Stayinthisroomallday! Restrooms! CoffeeandBreaks–morningandajernoon! Lunch–‘WorkingLunch’ifpreferred! Wrapup–4pm
19
Collaborative data-driven science
20
Collaborative data-driven science
! Technology◦ EveryoneshouldbeabletoconnecttoWIFI◦ Everyonewillcreateanaccount◦ EveryonewillcreateaDockerContainer
! WorkshoprunninginTESTEnvironment! MyDBetcistemporary! JupyterNotebookscanbesavedandtakenaway
21
Collaborative data-driven science
Par2cipants SciServerTeam
• SetupaSciServerNotebook• Authen<catewiththeSciServer
LoginPortal• ImportandquerySDSSwith
CasJobs• Saveyourdataandgraphicslocally• Saveyourdataandgraphicson
SciDrive• Save&RetrieveyourdatainMyDB• LearntheSciServerAPI
• TesttheComputefeatureset• TestouttheArchitecture• Gainearlyfeedbackfrom
par<cipants• Implementthisfeedbackbefore
liverelease
22
WewantthistobeInterac0ve…
Collaborative data-driven science
! Agendasetsthescene! Tostart:Structured◦ Firstexampleworkbookscoverthe‘buildingblocks’andwillbedoneinastructuredway
! Subsequently:Flexible◦ Notebooksdelvedeeperintospecifics◦ Timinganddevia<onsarefine,Q&A,examplesetc◦ Tunetotheexperiencesandneedsofpar<cipants
23
EmphasisonPRACTICALexercises…
Collaborative data-driven science
24
Collaborative data-driven science
Award! NSFDIBBs(DataInfrastructureBuildingBlocks)
! 5years:2013–2018! Approx$10M! Coopera<veAgreement
25
Collaborative data-driven science
Objec2ves! ExtendinfrastructureforSDSStosupportaddi<onalScienceDomains
! Hostandservepetabytedatasets! Supportcustomuserdatasets! Provideaccessandqueryservices! Providescalablecomputeservices! Supportanalysesanddatasetstoolargetohandlelocally
! Providecollabora<vetoolsforsharedanalysis
26
Computa0onsstayCLOSEtotheDATA…
Collaborative data-driven science
27
MajorComponents Suppor2ngTechnologies
Core MicrosojSQLServer
• LoginPortal OpenStack
• CASJobs Docker
• SciServerCompute Jupyter
• SciDrive
Applica0ons
• SkyQuery
• SkyServer
• GLUSEEN
• Turbulence
Collaborative data-driven science
Timelines
28
Year1
(2013-2014) ProjectSetup,Scoping,Planning,BeginRefactoring,SDSSUnifica<on
Year2(2013-2014) ArchitecturalRefactoring–API,SingleSign-on,
prototypeCompute
Year3(2013-2014) SciServerSystemRelease,Interac2ve
Compute,ScalableJobManagement,BasicDashboard,Ini2alCollabora2vecapabili2es
Year4(2013-2014) Implementa<oninScienceDomains,
Educa<onalworkbooks
Year5(2013-2014) SystemScaleout,
DataAnaly<cs,AdvancedDeploymentScenarios
NOW
Collaborative data-driven science
Timelines–Year3
29
Apr2016 • SciServerSystemRelease
May2016 • Interac<veCompute• SkyQuery• Gluseen
August2016 • PrototypeScalableJobManagement• BasicDashboard• Ini<alCollabora<vecapabili<es
October2016 • ScalableJobManagement• Turbulence• Cosmology
November2016 • Project3yearReview
Collaborative data-driven science
Collaborative data-driven science
Jordan Raddick April 27, 2016
Collaborative data-driven science
! Agenda:www.sciserver.org/outreach/spring-workshop/detailed-agenda
! Documenta<onandSupport(goherenow!):www.sciserver.org/outreach/spring-workshop/documenta<on
31
Collaborative data-driven science
Collaborative data-driven science
Dmitry MedvedevJohns Hopkins University
Collaborative data-driven science
2
“For data analysis, one possibility is to move the data to you, but the other possibility is to move your query to the data… Often it turns out to be more efficient to move the questions than to move the data.”
-- Jim Gray
Helen Shen’s article for Nature - Interactive notebooks: Sharing the code – featured a live demo of IPython notebooks created on-demand using Docker containers, and made a strong case for using IPython notebooks in scientific data analysis.
Collaborative data-driven science
Interactive Jupyter notebooks hosted inside Docker containers.
Pre-configured images to create new containers from (R, Python, MATLAB, …).
High-bandwidth, low-latency access to other SciServer services and data sources through the notebooks.
Users manage their own containers.
3
Collaborative data-driven science
4
Hardware
Hypervisor
Virtual Machine
Operating System
Bins / Libs
App App
Virtual Machine
Operating System
Bins / Libs
App App
Type 1 Hypervisor
Hardware
Hypervisor
Virtual Machine
Operating System
Bins / Libs
App App
Virtual Machine
Operating System
Bins / Libs
App App
Type 2 Hypervisor
Operating System
Hardware
Container
Bins / Libs
App App
Container
App App
Linux Containers
Operating System
Bins / Libs
CLI REST API Dockerfile
Collaborative data-driven science
5
C
C
C
C
VMProxy
Dashboard WebApp
Identity Service
Other SciServer Components
SciServer Compute
:433
:80
:10000
:10001
:10002
:10003
:9999
:9999
:9999
:9999
C - Application Container
Registry DB
Collaborative data-driven science
6
/home/idies
workspace
persistent
sdss_das
Persistent Storage
SDSS DAS
Notebook Server
Docker Container
User
Collaborative data-driven science
Run asynchronous non-interactive jobs in separate Docker containers. It’s meant to be more than just Jupyter notebooks!
Create new VM nodes on-demand to accommodate growing number of users.
Provide scratch (temporary) storage space for working with large amounts of data.
Improve resource management.
7
Collaborative data-driven science
8