Upload
dinhdang
View
213
Download
0
Embed Size (px)
Citation preview
IT4Innovations national supercomputer centerBranislav Jansí[email protected]
Mission and Vision
MissionOur mission is to deliver scientifically excellent and industry relevant research in the fields of high performance computing and embedded systems. We are providing state-of-the-art technology and expertise in high performance computing and embedded systems and make it available for Czech and international research teams from academia and industry.
VisionTo became top European Centre of Excellence in IT with the emphasis on high performance computing and embedded systems. With our research, know-how and infrastructure we aspire to improve the quality of life, to increase the competitiveness of industrial sector and to promote the cross-fertilization of high-performance computing, embedded systems and other scientific and technical disciplines.
Salomon cluster HPC infrastructureStorage● HOME: 500 TB ● SCRATCH 1700 TB
Interconnect● Infiniband, 7D hypercube● 56Gb/s
Compute ● 1008 nodes● Haswell-EP 2.5GHz x86-64● 24 cores, 256 bit FMA instr.● 128 GB RAM● 864x Intel Xeon Phi 7120P● 61(244) cores, 512 bit FMA● 16 GB RAM
Switch
Salomon cluster HPC infrastructureStorage● HOME: 500 TB ● SCRATCH 1500 TB
Interconnect● Infiniband, 7D hypercube● 56Gb/s
Compute ● 1008 nodes● Haswell-EP 2.5GHz x86-64● 24 cores, 256 bit FMA instr.● 128 GB RAM● 864x Intel Xeon Phi 7120P● 61(244) cores, 512 bit FMA● 16 GB RAM
Intel Xeon Phi accelerator cards
Xeon Phi 7120P ● x86 architecture, 1.2TF● 864x Intel Xeon Phi 7120P● 61(244) cores, 512 bit FMA● 16 GB RAM
Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue
* =+
=
Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue
*
+ =
Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue
Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue
12 x
Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue
Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue
12 x
4 x
2 x
Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue
Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue
12 x
4 x 4 x
2 x
Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue
Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue
12 x
4 x 4 x
2 x 2 x
768 x
Xeon vs. Xeon PhiXeon CPU, 256 bit (Salomon)● 12 cores● 16 vector registers, ymm0 to ymm15, 256bit● FMA, out of order, dual issue
Xeon Phi, 512 bit (Salomon)● 61 cores (122)● 32 vector registers, zmm0 to zmm31, 512bit● FMA, in order, dual issue
12 x
4 x 4 x
2 x 2 x
768 x
122 x
4 x 8 x
2 x
7808 x
Logging in:
● ssh salomon.it4i.cz
● ssh login1.salomon.it4i.cz● ssh login2.salomon.it4i.cz● ssh login3.salomon.it4i.cz● ssh login4.salomon.it4i.cz
Logging inuser0 login
GUI:
● ssh -X salomon.it4i.cz● VNC, docs.it4i.cz
Switch
Modules
● Sets up the application paths, library paths and environment variables for particular application
● Lmod availablehttps://docs.it4i.cz/software/lmod/
$ module avail
$ module load impi
$ module unload impi
$ module list module purge
$ module whatis impi
Job execution$ qsub -A Project ID -q queue -l select=x jobscript
$ qsub -A Project ID -q queue -l select=x -I
● Jobscript is executed on first node of the allocation
● Jobscript is executed in HOME directory
● File $PBS_NODEFILE contains list of allocated nodes+ files in /lscratch/$PBS_JOBID
● Allocated nodes are accessible to user (ssh)
Job submissionUse the qsub command to submit your job to a queue
$ qsub -A DD-16-1 -q R* \select=2:accelerator=True,walltime=00:20:00 -I
$ qsub -A DD-16-1 -q R* \select=2:accelerator=True,walltime=00:20:00 -I
Use the qsub command to submit your job to a queue
$ qsub -A DD-16-1 -q R* -l select=2:accelerator=True -I
/home/$USER/scratch/work/$USER/scratch/temp/
/lscratch/$PBS_JOBID /lscratch/$PBS_JOBID
Job execution
Job management● Use the qstat and check-pbs-jobs to check job status
$ qstat -a$ qstat -an$ qstat -an -u username$ qstat -f jobid
$ check-pbs-jobs --check-all$ check-pbs-jobs –print-job-out$ check-pbs-jobs --ls-lscratch
• Code has to be compiled on a node with MIC / MPSS installed
• Nodes cns577 – cns1008
• To setup the MIC programming environment use:• $ module load intel• $ module load impi
• To get information about MIC accelerator use: • $ micinfo
MIC Programming on SalomonMIC programming on Salomon
[jansik@r38u13n982 ~]$ micinfoMicInfo Utility LogCreated Tue Feb 7 09:15:19 2017
System InfoHOST OS : LinuxOS Version : 2.6.32-573.12.1.el6.noc0w.x86_64Driver Version : 3.7.1-1MPSS Version : 3.7.1Host Physical Memory : 128838 MB
Device No: 0, Device Name: mic0
VersionFlash Version : 2.1.02.0391SMC Firmware Version : 1.17.6900SMC Boot Loader Version : 1.8.4326Coprocessor OS Version : 2.6.38.8+mpss3.7.1Device Serial Number : ADKC44601828
• Compile on HOST• $ icc -mmic -fopenmp vect-add-short.c \ -o vect-add-mic
•
• Connect to MIC: • ssh mic0 , or• ssh mic1
•
• Setup path OpenMP libraries • $ echo $MIC_LD_LIBRARY_PATH• mic0 $ export LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH
• Set number of OpenMP threads (1 – 240):• mic0 $ export OMP_NUM_THREADS=240
•
• Run:• mic0 $ ~/path_to_binary/vect-add-mic.
Native mode example
• Compile on HOST• $ icc -fopenmp vect-add.c \ -o vect-add
•
• Turn on offload info: • $ export OFFLOAD_REPORT=2
•
• Run:• $ ~/path_to_binary/vect-add
Offload mode example
Some debugging options
openmp_report[0|1|2] - controls the OpenMP parallelizer diagnostic level
vec-report[0|1|2] - controls the compiler based vectorization diagnostic level
MPI example• Load Intel MPI module and setup host environment
• $ module load intel• $ export I_MPI_MIC_POSTFIX=-mic• $ export I_MPI_MIC=1
• Compile on HOST• For both host and mic:• $ mpiicc -xHost -o mpi-test mpi-test.c• $ mpiicc -mmic -o mpi-test-mic mpi-test.c
• Setup environment once for all • $ vim ~/.profile
PBS generated node-files
• PBS generates a set of node-files • Host only node-file:
• /lscratch/${PBS_JOBID}/nodefile-cn-sn
•• MIC only node-file:
• /lscratch/${PBS_JOBID}/nodefile-mic-sn
•• Host and MIC node-file:
• /lscratch/${PBS_JOBID}/nodefile-mix-sn
• Each host or accelerator is listed only once per file• User has to specify how many jobs should be executed
per node using "-n" parameter of the mpirun command
MPI example, symmetric mode over multiple nodes
#!/bin/bash
# change to exec directorycd $PBS_O_WORKDIR || exit
# set environment variablesmodule load impiexport I_MPI_MIC=1export I_MPI_MIC_POSTFIX=-mic
mpirun -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_PROVIDER_LIST ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1 -genv I_MPI_PIN_PROCESSOR_LIST 1 -machinefile /lscratch/$PBS_JOBID/nodefile-mix-sn ./hellompi.x
#exitexit
Software environment• Programing environment
‒ Gnu compilers: gfortran, gcc, g++, gdb‒ Intel compilers: ifort, icc, idb‒ PGAS compilers: upc‒ Portland group compilers: pgc, pgf‒ Interpreters: Perl, python, java, ruby, bash● HPC libraries: intel MKL suite, FFTW3, GSL, PETSc
Scalapack, Plasma and Magma Comm libraries: bullx MPI, OpenMPI, IntelMPI
• Performance analysis‒ gprof‒ PAPI, Scalasca‒ HPCToolkit, Open|Speedshop
Debugging and profiling
● GNU GDB, gprof, ● Allinea DDT● Allinea MAP● Allinea Performance reports● RougeWave Totalview● Vampir
● Profiling with Allinea MAP$module load Forge/5.1-43967$map ./myprog.x
Conclusions
Read the documentation, contact support, contact me!
IT4Innovations SuperComputer Center is here to run the computer and assist you in using it