33
IT4Innovations national supercomputer center Branislav Jansík [email protected]

Branislav Jansík branislav.jansik@vsb - LRZ and make it available for Czech and international research teams from academia and industry. Vision To became top European Centre of Excellence

Embed Size (px)

Citation preview

IT4Innovations national supercomputer centerBranislav Jansí[email protected]

Mission and Vision

MissionOur mission is to deliver scientifically excellent and industry relevant research in the fields of high performance computing and embedded systems. We are providing state-of-the-art technology and expertise in high performance computing and embedded systems and make it available for Czech and international research teams from academia and industry.

VisionTo became top European Centre of Excellence in IT with the emphasis on high performance computing and embedded systems. With our research, know-how and infrastructure we aspire to improve the quality of life, to increase the competitiveness of industrial sector and to promote the cross-fertilization of high-performance computing, embedded systems and other scientific and technical disciplines.

Salomon

#67 on Top500.org

Salomon cluster HPC infrastructureStorage● HOME: 500 TB ● SCRATCH 1700 TB

Interconnect● Infiniband, 7D hypercube● 56Gb/s

Compute ● 1008 nodes● Haswell-EP 2.5GHz x86-64● 24 cores, 256 bit FMA instr.● 128 GB RAM● 864x Intel Xeon Phi 7120P● 61(244) cores, 512 bit FMA● 16 GB RAM

Switch

Salomon cluster HPC infrastructureStorage● HOME: 500 TB ● SCRATCH 1500 TB

Interconnect● Infiniband, 7D hypercube● 56Gb/s

Compute ● 1008 nodes● Haswell-EP 2.5GHz x86-64● 24 cores, 256 bit FMA instr.● 128 GB RAM● 864x Intel Xeon Phi 7120P● 61(244) cores, 512 bit FMA● 16 GB RAM

Salomon 7D hypercube

Accelerated section

18 racks, 432 nodes, 864 Xeon Phi 7120P

Intel Xeon Phi accelerator cards

Xeon Phi 7120P ● x86 architecture, 1.2TF● 864x Intel Xeon Phi 7120P● 61(244) cores, 512 bit FMA● 16 GB RAM

3.36TF per accelerated node

1.2TF

1.2TF

0.48TF

0.48TFLos Alamos Blue Mountain (2000)

3.07TF

Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue

* =+

=

Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue

*

+ =

Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue

Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue

12 x

Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue

Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue

12 x

4 x

2 x

Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue

Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue

12 x

4 x 4 x

2 x

Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue

Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue

12 x

4 x 4 x

2 x 2 x

768 x

Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue

Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue

12 x

4 x 4 x

2 x 2 x

768 x

122 x

4 x 8 x

2 x

7808 x

Logging in:

● ssh salomon.it4i.cz

● ssh login1.salomon.it4i.cz● ssh login2.salomon.it4i.cz● ssh login3.salomon.it4i.cz● ssh login4.salomon.it4i.cz

Logging inuser0 login

GUI:

● ssh -X salomon.it4i.cz● VNC, docs.it4i.cz

Switch

Modules

● Sets up the application paths, library paths and environment variables for particular application

● Lmod availablehttps://docs.it4i.cz/software/lmod/

$ module avail

$ module load impi

$ module unload impi

$ module list module purge

$ module whatis impi

Allocation and executionuser0 login

Resource allocationand execution via PBS queue system

Switch

Job execution$ qsub -A Project ID -q queue -l select=x jobscript

$ qsub -A Project ID -q queue -l select=x -I

● Jobscript is executed on first node of the allocation

● Jobscript is executed in HOME directory

● File $PBS_NODEFILE contains list of allocated nodes+ files in /lscratch/$PBS_JOBID

● Allocated nodes are accessible to user (ssh)

Job submissionUse the qsub command to submit your job to a queue

$ qsub -A DD-16-1 -q R* \select=2:accelerator=True,walltime=00:20:00 -I

$ qsub -A DD-16-1 -q R* \select=2:accelerator=True,walltime=00:20:00 -I

Use the qsub command to submit your job to a queue

$ qsub -A DD-16-1 -q R* -l select=2:accelerator=True -I

/home/$USER/scratch/work/$USER/scratch/temp/

/lscratch/$PBS_JOBID /lscratch/$PBS_JOBID

Job execution

Job management● Use the qstat and check-pbs-jobs to check job status

$ qstat -a$ qstat -an$ qstat -an -u username$ qstat -f jobid

$ check-pbs-jobs --check-all$ check-pbs-jobs –print-job-out$ check-pbs-jobs --ls-lscratch

• Code has to be compiled on a node with MIC / MPSS installed

• Nodes cns577 – cns1008

• To setup the MIC programming environment use:• $ module load intel• $ module load impi

• To get information about MIC accelerator use: • $ micinfo

MIC Programming on SalomonMIC programming on Salomon

[jansik@r38u13n982 ~]$ micinfoMicInfo Utility LogCreated Tue Feb 7 09:15:19 2017

System InfoHOST OS : LinuxOS Version : 2.6.32-573.12.1.el6.noc0w.x86_64Driver Version : 3.7.1-1MPSS Version : 3.7.1Host Physical Memory : 128838 MB

Device No: 0, Device Name: mic0

VersionFlash Version : 2.1.02.0391SMC Firmware Version : 1.17.6900SMC Boot Loader Version : 1.8.4326Coprocessor OS Version : 2.6.38.8+mpss3.7.1Device Serial Number : ADKC44601828

• Compile on HOST• $ icc -mmic -fopenmp vect-add-short.c \ -o vect-add-mic

• Connect to MIC: • ssh mic0 , or• ssh mic1

• Setup path OpenMP libraries • $ echo $MIC_LD_LIBRARY_PATH• mic0 $ export LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH

• Set number of OpenMP threads (1 – 240):• mic0 $ export OMP_NUM_THREADS=240

• Run:• mic0 $ ~/path_to_binary/vect-add-mic.

Native mode example

• Compile on HOST• $ icc -fopenmp vect-add.c \ -o vect-add

• Turn on offload info: • $ export OFFLOAD_REPORT=2

• Run:• $ ~/path_to_binary/vect-add

Offload mode example

Some debugging options

openmp_report[0|1|2] -   controls the OpenMP parallelizer diagnostic level

vec-report[0|1|2] - controls the compiler based vectorization diagnostic level

MPI example• Load Intel MPI module and setup host environment

• $ module load intel• $ export I_MPI_MIC_POSTFIX=-mic• $ export I_MPI_MIC=1

• Compile on HOST• For both host and mic:• $ mpiicc -xHost -o mpi-test mpi-test.c• $ mpiicc -mmic -o mpi-test-mic mpi-test.c

• Setup environment once for all • $ vim ~/.profile

PBS generated node-files

• PBS generates a set of node-files • Host only node-file: 

• /lscratch/${PBS_JOBID}/nodefile-cn-sn

•• MIC only node-file: 

• /lscratch/${PBS_JOBID}/nodefile-mic-sn

•• Host and MIC node-file: 

• /lscratch/${PBS_JOBID}/nodefile-mix-sn

• Each host or accelerator is listed only once per file• User has to specify how many jobs should be executed

per node using "-n" parameter of the mpirun command

MPI example, symmetric mode over multiple nodes

#!/bin/bash

# change to exec directorycd $PBS_O_WORKDIR || exit

# set environment variablesmodule load impiexport I_MPI_MIC=1export I_MPI_MIC_POSTFIX=-mic

mpirun -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_PROVIDER_LIST ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1 -genv I_MPI_PIN_PROCESSOR_LIST 1 -machinefile /lscratch/$PBS_JOBID/nodefile-mix-sn ./hellompi.x

#exitexit

Documentationhttp://docs.it4i.cz

Software environment• Programing environment

‒ Gnu compilers: gfortran, gcc, g++, gdb‒ Intel compilers: ifort, icc, idb‒ PGAS compilers: upc‒ Portland group compilers: pgc, pgf‒ Interpreters: Perl, python, java, ruby, bash● HPC libraries: intel MKL suite, FFTW3, GSL, PETSc

Scalapack, Plasma and Magma Comm libraries: bullx MPI, OpenMPI, IntelMPI

• Performance analysis‒ gprof‒ PAPI, Scalasca‒ HPCToolkit, Open|Speedshop

Debugging and profiling

● GNU GDB, gprof, ● Allinea DDT● Allinea MAP● Allinea Performance reports● RougeWave Totalview● Vampir

● Profiling with Allinea MAP$module load Forge/5.1-43967$map ./myprog.x

Conclusions

Read the documentation, contact support, contact me!

IT4Innovations SuperComputer Center is here to run the computer and assist you in using it

[email protected]