39
Collaborative data-driven science Collaborative data-driven science

Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Collaborative data-driven science

Page 2: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Collaborative data-driven science

Alex Szalay

Page 3: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  StartedwiththeSDSSSkyServer!  Builtveryquicklyin2001!  Goal:instantaccesstorichcontent!  Idea:bringtheanalysistothedata!  Interac<veaccessatthecore!  Muchofthescien<ficprocessisaboutdata◦  Datacollec<on,datacleaning,dataarchiving,dataorganiza<on,datapublishing,mirroring,datadistribu<on,dataanaly<cs,datacura<on…

!  2012:NSFDIBBStoextend/reengineerSkyServer3

Jim Gray

Page 4: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  Interac<vescienceonpetascaledata!  Sustainandenhanceourastronomyeffort! Growafootprintintonewdisciplines!  Buildscalableopennumericallaboratories!  Scalesystemtoseveralpetabytes! Deepintegra<onwiththe“LongTail”! Usesharable,well-definedbuildingblocks

4

Page 5: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  HPCisaninstrumentinitsownright◦  Largestsimula<onsapproach/exceedpetabytes

!  Needpublicaccesstothebestandlatest!  Alsoneedensemblesofsimula<onsforUQ!  Createsnewchallenges◦  Howtoaccessthedata?◦ Whatisthedatalifecycle?◦ WhataretheanalysispaZerns?◦ Whatarchitecturescansupportthese?

!  OnExascaleeverythingisaBigDataproblem

Page 6: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  hZp://turbulence.pha.jhu.edu/

6

Page 7: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

! MirrorofMillenniumDatabase

7

Raw data: Particles

FOF groups Subhalos

Density fields

Halo merger trees

Synthetic galaxies

Mock catalogues

Mock images

Page 8: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

8

Hydrosta2candnon-hydrosta2csimula2onsofdensewaterscascadingoffashelf:TheEastGreenlandcaseMarcelloG.Magaldi,ThomasW.N.Haine

Page 9: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

9 Building a DB of a trillion short reads from Next Gen Sequencing

Page 10: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

10

Daphalapurkar, Brady, Ramesh, Molinari. JMPS (2011)

Page 11: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  Createinterac<veNumericalLaboratories!  Analysisserver-sidethroughwebservice!  Usevirtualsensormetaphor! ManyaccesspaZernsarelocal!  Noneedtodownloadwholedatasets!  ConceptverysuccessfulinturbulenceandcosmologicalN-body

!  turbulence.pha.jhu.edu:19trillionpointsdelivered!

!  TotalsciencedatainSciServercurrently~2.5PB

11

Page 12: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

! UserwriZencrawlers,inefficient!  Cutoutsdeliveredtousers,slow!  Scalabilitychallenge(over100TBscales)!  Requestsforscrip<ngaccess! Needforeasyjoinswithlong-taildata!  S<llexpec<nginterac<veresponse

12

Page 13: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  Needtodefinesharptradeoffs◦  DataAnaly<cssystemisdifferentfromsupercomputer◦ WhatistherightbalancebetweenI/Oandcompute?

!  Needhighbandwidthtolargedata◦  Computa<ons/visualiza<onsmustbeontopofthedata◦  Mustsupportatleastfew100TBperserver◦  Petascale:3copiesforproduc<on(orerasurecode?)◦ Wideareadatamovement/backboneishard

!  Lessonsfromthedatabaseworld:◦  ItisnontrivialtoschedulecomplexI/OpaZerns◦  Forsubsetswemustuseindexing,cacheresilientstorage

Page 14: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  Offermorecompu<ngresourcesserverside!  Enhancedvisualiza<ontools(ParaView)!  AugmentandcombineSQLquerieswitheasy-to-usescrip<ngtools

!  Heavyuseofvirtualmachines/Docker!  Interac2veportalviaiPython/Matlab/R

14

Page 15: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Collaborative data-driven science

Mike Rippin April 27, 2016

Page 16: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

16

Page 17: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

17

! Alex Szalay (PI)! Mike Rippin (PM)! Ani Thakar! Jordan Raddick! Bonnie Souter! Gerard Lemson! Jaiwon Kim! Dmitry Medvedev! Deoyani Heinis! Manu Popp! Victor Paul ! Sue Werner! Jan Vandenberg

Page 18: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

18

8:30 AM Continental Breakfast & Coffee

9:00 AM Welcome Alex Szalay

9:05 AM SciServer Overview Mike Rippin

9:25 AM Getting Started with SciServer Jordan Raddick

9:40 AM Technical Overview Dmitry Medvedev

10:30 AM Coffee

10:45 AM Demo Notebook #1 Gerard Lemson

12:00 PM Lunch

1:00 PM Astronomy & Cosmology Examples Gerard Lemson

2:30 PM Break

2:45 PM Explore & Customize Participants

3:30 PM Q&A

3:50 PM Closing Remarks Mike Rippin

4:00 PM Adjourn

Page 19: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  Stayinthisroomallday!  Restrooms!  CoffeeandBreaks–morningandajernoon!  Lunch–‘WorkingLunch’ifpreferred! Wrapup–4pm

19

Page 20: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

20

Page 21: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  Technology◦  EveryoneshouldbeabletoconnecttoWIFI◦  Everyonewillcreateanaccount◦  EveryonewillcreateaDockerContainer

! WorkshoprunninginTESTEnvironment! MyDBetcistemporary!  JupyterNotebookscanbesavedandtakenaway

21

Page 22: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Par2cipants SciServerTeam

•  SetupaSciServerNotebook•  Authen<catewiththeSciServer

LoginPortal•  ImportandquerySDSSwith

CasJobs•  Saveyourdataandgraphicslocally•  Saveyourdataandgraphicson

SciDrive•  Save&RetrieveyourdatainMyDB•  LearntheSciServerAPI

•  TesttheComputefeatureset•  TestouttheArchitecture•  Gainearlyfeedbackfrom

par<cipants•  Implementthisfeedbackbefore

liverelease

22

WewantthistobeInterac0ve…

Page 23: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  Agendasetsthescene!  Tostart:Structured◦  Firstexampleworkbookscoverthe‘buildingblocks’andwillbedoneinastructuredway

!  Subsequently:Flexible◦  Notebooksdelvedeeperintospecifics◦  Timinganddevia<onsarefine,Q&A,examplesetc◦  Tunetotheexperiencesandneedsofpar<cipants

23

EmphasisonPRACTICALexercises…

Page 24: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

24

Page 25: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Award! NSFDIBBs(DataInfrastructureBuildingBlocks)

!  5years:2013–2018!  Approx$10M!  Coopera<veAgreement

25

Page 26: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Objec2ves!  ExtendinfrastructureforSDSStosupportaddi<onalScienceDomains

!  Hostandservepetabytedatasets!  Supportcustomuserdatasets!  Provideaccessandqueryservices!  Providescalablecomputeservices!  Supportanalysesanddatasetstoolargetohandlelocally

!  Providecollabora<vetoolsforsharedanalysis

26

Computa0onsstayCLOSEtotheDATA…

Page 27: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

27

MajorComponents Suppor2ngTechnologies

Core MicrosojSQLServer

•  LoginPortal OpenStack

•  CASJobs Docker

•  SciServerCompute Jupyter

•  SciDrive

Applica0ons

•  SkyQuery

•  SkyServer

•  GLUSEEN

•  Turbulence

Page 28: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Timelines

28

Year1

(2013-2014) ProjectSetup,Scoping,Planning,BeginRefactoring,SDSSUnifica<on

Year2(2013-2014) ArchitecturalRefactoring–API,SingleSign-on,

prototypeCompute

Year3(2013-2014) SciServerSystemRelease,Interac2ve

Compute,ScalableJobManagement,BasicDashboard,Ini2alCollabora2vecapabili2es

Year4(2013-2014) Implementa<oninScienceDomains,

Educa<onalworkbooks

Year5(2013-2014) SystemScaleout,

DataAnaly<cs,AdvancedDeploymentScenarios

NOW

Page 29: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Timelines–Year3

29

Apr2016 •  SciServerSystemRelease

May2016 •  Interac<veCompute•  SkyQuery•  Gluseen

August2016 •  PrototypeScalableJobManagement•  BasicDashboard•  Ini<alCollabora<vecapabili<es

October2016 •  ScalableJobManagement•  Turbulence•  Cosmology

November2016 •  Project3yearReview

Page 30: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Collaborative data-driven science

Jordan Raddick April 27, 2016

Page 31: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

!  Agenda:www.sciserver.org/outreach/spring-workshop/detailed-agenda

! Documenta<onandSupport(goherenow!):www.sciserver.org/outreach/spring-workshop/documenta<on

31

Page 32: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Collaborative data-driven science

Dmitry MedvedevJohns Hopkins University

Page 33: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

2

“For data analysis, one possibility is to move the data to you, but the other possibility is to move your query to the data… Often it turns out to be more efficient to move the questions than to move the data.”

-- Jim Gray

Helen Shen’s article for Nature - Interactive notebooks: Sharing the code – featured a live demo of IPython notebooks created on-demand using Docker containers, and made a strong case for using IPython notebooks in scientific data analysis.

Page 34: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Interactive Jupyter notebooks hosted inside Docker containers.

Pre-configured images to create new containers from (R, Python, MATLAB, …).

High-bandwidth, low-latency access to other SciServer services and data sources through the notebooks.

Users manage their own containers.

3

Page 35: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

4

Hardware

Hypervisor

Virtual Machine

Operating System

Bins / Libs

App App

Virtual Machine

Operating System

Bins / Libs

App App

Type 1 Hypervisor

Hardware

Hypervisor

Virtual Machine

Operating System

Bins / Libs

App App

Virtual Machine

Operating System

Bins / Libs

App App

Type 2 Hypervisor

Operating System

Hardware

Container

Bins / Libs

App App

Container

App App

Linux Containers

Operating System

Bins / Libs

CLI REST API Dockerfile

Page 36: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

5

C

C

C

C

VMProxy

Dashboard WebApp

Identity Service

Other SciServer Components

SciServer Compute

:433

:80

:10000

:10001

:10002

:10003

:9999

:9999

:9999

:9999

C - Application Container

Registry DB

Page 37: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

6

/home/idies

workspace

persistent

sdss_das

Persistent Storage

SDSS DAS

Notebook Server

Docker Container

User

Page 38: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

Run asynchronous non-interactive jobs in separate Docker containers. It’s meant to be more than just Jupyter notebooks!

Create new VM nodes on-demand to accommodate growing number of users.

Provide scratch (temporary) storage space for working with large amounts of data.

Improve resource management.

7

Page 39: Collaborative data-driven science...Everyone will create a Docker Container ! Workshop running in TEST Environment ! MyDB etc is temporary ! Jupyter Notebooks can be saved and taken

Collaborative data-driven science

8