44
Environment for training models Dmitry Spodarets AI Rush

Environment for training models

Embed Size (px)

Citation preview

Page 1: Environment for training models

Environment for training modelsDmitry Spodarets

AI Rush

Page 2: Environment for training models

Who am I

Dmitry Spodarets• Founder and CEO at FlyElephant

• PhD candidate at Odessa National University

• Lecturer at Odessa Polytechnic University • Organizer of technical conferences about AI,

BigData, HPC, JS, Web Technologies …

Page 3: Environment for training models

Agenda

•Data Science Tools Survey Results• Computing resources• Clouds (AWS & Azure)• Containers (Docker, Singularity)• FlyElephant platform for Data Science

Page 4: Environment for training models

Data Science Tools Survey

220datascientist

Page 5: Environment for training models

Datasets

0

10

20

30

40

50

60

70

lessthan1MB

1.1to10MB

11to100MB

101MBto1GB

1.1to10GB

11to100GB

101GBto1

Terabyte

1.1to10TB

11to100TB

101TBto1

Petabyte

1.1PBto10

Petabyte

11to100PB

over100PB

Datasets

Datasets

Page 6: Environment for training models

Tools for collecting data

Python 45

R 26

Spark 18

SQL 15

Excel 13

Kafka 11

Pandas 10

custom 8

Hadoop 5

Numpy 5

SAS 5

Page 7: Environment for training models

Tools for storing data

PostgreSQL 37

CSV 31

MySQL 21

Hadoop 16

Excel 15

HDFS 15

Mongodb 15

MyServer 12

Oracle 11

Hive 8

Page 8: Environment for training models

Programming languages

Python 151

R 88

SQL 37

Java 32

Scala 22

bash 17

C++ 17

JavaScript 15

C# 13

vba 8

C 6

Page 9: Environment for training models

Libraries

Pandas 88

Numpy 68

scikit-learn 48

scipy 26

dplyr 20

matplotlib 20

ggplot2 15

keras 14

SPARK 13

xgboost 13

Tensorflow 12

Page 10: Environment for training models

Tools for the visualization of data

matplotlib 66

ggplot 40

seaborn 33

Excel 22

Tableau 22

R 19

plotly 13

bokeh 12

d3 11

Page 11: Environment for training models

Clouds

aws 77

none 41

azure 25

google 24

digital ocean 9

OpenStack 7

Watson 1

Page 12: Environment for training models

The Jupyter Notebook

Page 13: Environment for training models

Jupyter Lab

Page 14: Environment for training models

Computing resources

Page 15: Environment for training models

Computing resources

Page 16: Environment for training models

Computing resources

NVIDIADGX-1Deep Learning Supercomputer170/3TFLOPS(GPUFP16/CPUFP32)

nvidia tesla p100~5 TeraFLOPS

~3TeraFLOPS

Page 17: Environment for training models

Image Training Performance on GoogLeNet

251,77425,38

569,1467,73

791,96

1230,63

0

200

400

600

800

1000

1200

1400

1GPU(1.86X) 2GPUs(1.87X) 4GPUs(2.2X)

TeslaK80 TeslaP100

http://www.nvidia.com/object/caffe-benchmarks.html

images

traine

dpe

rsecon

d

Page 18: Environment for training models

1080 vs Titan X vs K80 vs P100

0,25

8,8

0,3

10,1

2,9

8,7

5,3

10,6

0

2

4

6

8

10

12

FP32(Singleprecision) FP64(Doubleprecision)

1080 TitanX K80 P100

http://www.nvidia.com/

TFLO

PS

Page 19: Environment for training models

Problem

Effective parallelization of algorithms

Page 20: Environment for training models

NVIDIA Deep Learning SDK

Page 21: Environment for training models

Computing power (Intel)

• Intel Math Kernel Library (Intel MKL)Natively supports C, C++ and Fortran Development. Cross-language compatible with Java, C#, Python and other languages.

• Intel Data Analytics Acceleration Library (Intel DAAL)Includes Python, C++, and Java APIs and connectors to popular data sources including Spark and Hadoop.

• Intel MPI LibraryNatively supports C,C++ and Fortran development

Page 22: Environment for training models

Books

Page 23: Environment for training models

Clouds

Page 24: Environment for training models

Clouds

P2-series N-series 16X K80 4X K80X1-series H-Series

128 vCPU / 1952 GB 16 vCPU / 224 GBC4-series

36 vCPU / 60 GBaws.amazon.com/marketplace/ azuremarketplace.microsoft.com

Page 25: Environment for training models

Azure CLI

1. sudo pip install azure-cli2. az login3. az group create --name GroupName --location EastUS4. az vm create --resource-group GroupName --name MyVM --image

Canonical:UbuntuServer:16.04-LTS:latest --size Standard_NC6 --storage-sku Standard_LRS --admin-username user --ssh-key-value ~/.ssh/id_rsa.pub

5. az vm deallocate --resource-group GroupName --name MyVM6. az vm start --resource-group GroupName --name MyVM7. az vm list-ip-addresses --resource-group GroupName --name

MyVM8. az vm delete --resource-group GroupName --name MyVM9. az group delete --name GroupName

Page 26: Environment for training models

Data Science images in Azure Marketplace

Page 27: Environment for training models

Data Science images in AWS Marketplace

Page 28: Environment for training models

Containers

Page 29: Environment for training models

Docker

Page 30: Environment for training models

Docker (Dockerfile)

FROM gcr.io/tensorflow/tensorflow

MAINTAINER Dmitry Spodarets <[email protected]>

RUN apt update && apt -y upgrade && apt -y install git curl wget

CMD /run_jupyter.sh

Page 31: Environment for training models

Docker (build.sh)

#!/bin/bashfunction docker_build {

docker build -t $1 ./$1; docker tag $1 registry.flyelephant.net/$1 docker push registry.flyelephant.net/$1 docker rmi $1 registry.flyelephant.net/$1

}case $1 in all)

for i in `cat build.list`; do docker_build $i;

done ;;

*) docker_build $1;;

esac

Page 32: Environment for training models

Docker Hub

https://hub.docker.com/

Page 33: Environment for training models

Docker

1. docker images

2. docker run --memory 512m --cpus="2" --name mycont registry.flyelephant.net/tensorflow

3. docker exec -i -t mycont bash

4. docker ps

5. docker stats

6. docker stop CONTAINER ID

7. docker start CONTAINER ID

8. docker rm CONTAINER ID

Page 34: Environment for training models

Docker Machine

• Amazon Web Services

• Digital Ocean

• Exoscale

• Generic

• Google Compute Engine

• IBM Softlayer

• Microsoft Azure

• Microsoft Hyper-V

• OpenStack

• Oracle

• VirtualBox

• Rackspace

• VMware Fusion

• VMware v

• Cloud Air

• VMware vSphere

docker-machine create --driver azure --azure-subscription-id subscription-id --azure-resource-group resourcename --azure-ssh-user user --azure-size machine-name

docker-machine ssh machine-name

Page 35: Environment for training models

Singularity

Page 36: Environment for training models

Singularity - Containers for Science

• First public release in April 2016, followed by a massive uptake•HPC Wire Editor’s choice: Top Technologies to Watch for 2017• Simple integration with resource managers, InfiniBand, GPUs, MPI, file

systems, and supports multiple architectures (x86_64, PPC, ARM, etc..)• Limits user’s privileges (inside user == outside user)•No root owned container daemon•Network images are supported via URIs and all require local caching:

○ docker:// - This will pull a container from Docker Hub

○ http://, https:// - This will pull an image or tarball from the URL, cache and run it

○ shub:// - Pull an image from the Singularity Hub

Page 37: Environment for training models

Singularity - Usage Examples

$ python ./hello.pyHello World: The Python version is 2.7.5$ sudo singularity exec --writable /tmp/debian.img apt-get install python…$ singularity exec /tmp/debian.img python ./hello.pyHello World: The Python version is 2.7.12

Webinar"IntroductiontoSingularity"https://youtu.be/h5rDnCA3NJA

Page 38: Environment for training models

Contributors to Singularity

Page 39: Environment for training models

Network Based Computing LabOhio State University

• High-Performance Big Data (HiBD)http://hibd.cse.ohio-state.edu/

• High-Performance Deep Learning (HiDL)http://hidl.cse.ohio-state.edu/

Page 40: Environment for training models

FlyElephant

Page 41: Environment for training models

FlyElephant platform for Data Science

We automate Data Scienceand help teams to work efficiently.

Computing resources

Ready-computing infrastructure

Collaboration& Sharing

Fast Deployment

Expert Community

Page 42: Environment for training models

Ready-computing infrastructure

Jupyter orother IDE

Automatic running of tasks

Server orCluster

Page 43: Environment for training models

Our resources

• Public Clouds: Azure & AWS.• Private cloud based on OpenStack.• HPC-clusters based on SLURM.• Docker-clusters based on Swarm / Singularity.

• Tools and languages: R, Python, Java, Scala, C/C++, Julia, OpenFOAM, Octave, PyFR,

Scilab, GROMACS, MATLAB, Intel MKL, FlowVision, ANSYS, COMSOL, AVL, Hadoop, Spark, H2O, Anaconda, Octave, scikit-learn, Tensorflow, Theano, Caffe, etc.

FlyElephant US 1 Cloud (P100, K80, Titan X, FPGA (Xilinx))

• HPC HUB 1: 80 nodes (2 × Xeon E5-2680v2 (20 cores), 64GB RAM, IB FDR) and 240TB storage.• HPC HUB 2: 100 nodes (2 × Xeon E5-2670v2 (20 cores), 256GB RAM, IB FDR) and 240TB storage.• HPC HUB 3: 150 nodes (2 × Xeon E5-2650v2 (16 cores), 128GB RAM, 2 × Tesla K80, IB FDR) and 240TB storage.

Advania, CESGA, TACC(17), HLRS (14), LANL(10)

Page 44: Environment for training models

Dmitry Spodarets

[email protected]