Introduction to Parallel Computing Intel Math Kernel Library Huan-Ting Yen, Department of...

Introduction to Parallel ComputingIntel Math Kernel Library

Huan-Ting Yen, Department of Mathematics, National Taiwan University2011/07/22

2011/07/22Introduction to Parallel Computing2

Parallel Computing

What is parallel computing?

Traditionally, software has been written for serial computation:

What is parallel computing?

In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:

Resource

The compute resource A single computer with multiple processors; An arbitrary number of computers connected by a network; A combination of both.

Core 1 Core 2 Core 3

Core 4

thread 1 thread 3 thread 4thread 2

Resource

core 4core 3core 2core 1

several threads

Resource

Why use parallel computing?

The primary reasons for using parallel computing: Save time – wall clock time Solve larger problems Provide concurrency (do many things at the same time)

Other reasons might include: Taking advantage of non-local resources Cost savings Overcoming memory constraints

Amdahl’s Law

Speedup of a parallel program is limited by amount of serial works.

Amdahl’s Law

Speedup of a parallel program is limited by amount of serial works.

Flynn’s Taxonomy

Classification for parallel computers and programs

Single Instruction Multiple Instruction

Single Data SISD(single core CPU)

MISD(very rare)

Multiple Data SIMD(GPU/vector processor)

MIMD(multiple core CPU)

Flynn’s Taxonomy

SISD SIMD

Flynn’s Taxonomy

MISD MIMD

Intel Math Kernel Library

Overview

2011/07/22Intel MKL Quickstart16

The Intel® Math Kernel Library (Intel® MKL) provides Fortran routines and functions that perform a wide variety of operations on vectors and matrices including sparse matrices. The library also includes fast Fourier transform (FFT) functions, as well as vector mathematical and vector statistical functions with Fortran and C interfaces.

The versions of Intel MKL intended for Windows* and Linux* operating systems also include ScaLAPACK software and Cluster FFT software for solving respective computational problems on distributed-memory parallel computers.

Intel MKL: Intel Math Kernel Library

Functionality BLAS and Sparse BLAS Routines LAPACK Routines: Linear Equations LAPACK Routines: Eigenvalue Problems ScaLAPACK Sparse Solver Routines Fast Fourier Transforms Cluster Fast Fourier Transforms

System Requirements (Hardware)

Hardware: Intel® Core™ processor family Intel® Xeon® processor family Intel® Pentium® 4 processor family Intel® Pentium® lll processor Intel® Pentium® processor (300 MHz or faster) Intel® Celeron® processor AMD Athlon* and Opteron* processors

How do you know that information about the CPUs ? $ cat /proc/cpuinfo

System Requirements (Software)

Following is the list of supposed operating system: Red Hat* Enterprise Linux* 3, 4, 5 Red Hat* Fedora* 9 Debian* GNU/Linux 4.0 Ubuntu* 8.04

How do you know that information about the operating system? $ cat /etc/*release

Following is the list of supposed C/C++ and Fortran compilers: Intel® Fortran Compiler 10.1 for Linux* Intel® C++ Compiler 10.1 for Linux* GNU Compiler Collection (gcc, g77, gfortran 4.2.0)

Installing Intel MKL on a Linux* System

Tools & Downloads http://software.intel.com/en-us/ (google “intel software”)

user@host:~/software$ wget “URL”

user@host:~/software$ ll

$ tar –zxvf l_mkl_p_10.2.x.yyy.tar.gz

cd l_mkl_p_10.2.x.yyy ./install.sh

31 Intel MKL Quickstart

Some Examples

Example

Brief examples to BLAS Level 1 Routines (vector-vector operations) BLAS Level 2 Routines (matrix-vector operations) BLAS Level 3 Routines (matrix-matrix operations) Compute the LU factorization of a matrix (LAPACK) Solve linear system (LAPACK) Solve eigen system (LAPACK) Fast Fourier Transforms

Example

Ex1. The complex dot product ( )

#include <stdio.h>#include "mkl_blas.h”#define N 5

typedef struct{ double re; double im;}mkl_complex;

int main(){ int n, incx = 1, incy = 1, i; mkl_complex x[N], y[N], res; void zdotc(); n = N; for( i = 0; i < n; i++ ){ x[i].re = (double)i; x[i].im = (double)i * 2.0; y[i].re = (double)(n - i); y[i].im = (double)i * 2.0; } zdotc( &res, &n, x, &incx, y, &incy ); printf( “The complex dot product is: ( %6.2f, %6.2f )\n", res.re, res.im ); return 0;}

Computes a dot product of a conjugate vector with another vector.

Description : The routine is declared in Fortran77 : mkl_blas.fi Fortran95 : blas.f90 C : mkl_blas.h

Input Parameters ( zdotc(&res,&n,x,&incx,y,&incy) ) n: The length of two vectors. incx: Specifies the increment for the elements of x incy: Specifies the increment for the elements of y

output Parameters ( zdotc(&res,&n,x,&inca,y,&incb) ) res: The final result

Makefile (Sequential)

Test : blas_c

CC = icc

MKL_HOME = /home/opt/intel/mkl/10.2.2.025

MKL_INCLUDE = $(MKL_HOME)/include

MKL_PATH = $(MKL_HOME)/lib/em64t

EXE = blas_c.exe

blas_c:

$(CC) -o $(EXE) blas_c.c -I$(MKL_INCLUDE) -L$(MKL_PATH)

-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread

Makefile (Parallel)

Test = blas_c

CC = icc

MKL_HOME = /home/opt/intel/mkl/10.2.2.025

MKL_INCLUDE = $(MKL_HOME)/include

MKL_PATH = $(MKL_HOME)/lib/em64t

EXE = blas_c.exe

blas_c:

$(CC) -o $(EXE) blas_c.c -I$(MKL_INCLUDE) -L$(MKL_PATH)

-Wl,--start-group -lmkl_intel_lp64 -lmkl_core

-lmkl_intel_thred -Wl,--end-group –liomp5 -lpthread

Computes a dot product of a conjugate vector with another vector.

Description : The routine is declared in Fortran77 : mkl_blas.fi Fortran95 : blas.f90 C : mkl_blas.h

Input Parameters ( zdotc(&res,&n,x,&inca,y,&incb) ) n: The length of two vectors. incx: Specifies the increment for the elements of x incy: Specifies the increment for the elements of y

output Parameters ( zdotc(&res,&n,x,&inca,y,&incb) ) res: The final result

BLAS Routines

Routines Naming Conventions BLASB routine names have the following structure: <character> <name> <mode> ()

The <character> filed indicates the data type:s real, single precisionc complex, single precisiond real, double precisionz complex, double precision

The <mode> filed indicates the data type:c conjugated vectoru unconjugated vectorg Givens rotation.

BLAS Routines

Routines Naming Conventions BLASB routine names have the following structure: <character> <name> <mode> ()

In BLAS level 2 and 3, <name> filed indicates the matrix type:ge general matrixgb general band matrixsy symmetric matrixsb symmetric band matrixhe Hermitian matrixhb Hermitian band matrixtr triangular matrixtb triangular band matrix

BLAS Level 1 Routines

Routine Data Type Description

?asum s, d, sc, dz Sum of vector magnitudes

?axpy s, d, c, z Scalar-vector product

?copy s, d, c, z Copy vector

?dot s, d Doc product

?dotc c, z Doc conjugated

?nrm2 s, d, sc, dz Vector 2-norm (Euclidean norm)

?rotg s, d, cs, zd Givens rotation of points

?rot s, d, cs, zd Plane rotation of points

?scal s, d, c, z, cs, zd

Vector-scalar product

?swap s, d, c, z Vector-vector swap

i?max s, d, c, z Index of the maximum absolute value element of a vector

Example

Ex2-1. Matrix-vector product

#include "mkl_blas.h”

int main(){ int m, n, incx, incy, lda, idxi, idxj; double alpha, beta, *x, *y, *A ; char trans; m = 3; n = 3; incx = 1; incy = 1; lda = m; alpha = 1.0; beta = 1.0; trans = 'n’;

x = (double*)malloc(sizeof(double)*n); y = (double*)malloc(sizeof(double)*n); A = (double*)malloc(sizeof(double)*m*n);

for( idxi = 0; idxi < n; idxi++ ){

*(x+idxi) = 1.0;

*(y+idxi) = 1.0;

for( idxi = 0; idxi < m; idxi++ )

for( idxj = 0; idxj < n; idxj++)

*(A+idxi*m+idxj) = (double)(idxi+1) + idxj;

dgemv(&trans, &m, &n, &alpha, A, &lda, x, &incx, &beta, y, &incy);

return 0;

Computes a matrix-vector product using a general matrix. Description : The routine is declared in

Fortran77 : mkl_blas.fi Fortran95 : blas.f90 C : mkl_blas.h

Input Parameters dgemv(&trans,&m,&n,&alpha,A,&lda,x,&incx,&beta,y,&incy)

trans: if trans = ‘N’, ‘n’, then if trans = ‘T’, ‘t’, then if trans = ‘C’, ‘c’, then m: The number of rows of the matrix A .

Input Parameters n: The number of columns of the matrix lda: The first dimension of matrix, lda = max(1,m) incx: Specifies the increment for the elements of x incy: Specifies the increment for the elements of y

output Parameters y: Updated vector y.

Ex2. Result

Vectors and PlanesIntroduction to MATLAB47

?gemv s, d, c, z Matrix-vector product using a general matrix

?gbmv s, d, c, z Matrix-vector product using a general band matrix

?symv s, d Matrix-vector product using a symmetric matrix

?sbmv s, d Matrix-vector product using a symmetric band matrix

?hemv c, z Matrix-vector product using a Hermitian matrix

?hbmv c, z Matrix-vector product using a Hermitian band matrix

?trmv c, z Matrix-vector product using a triangular matrix

?tbmv s, d, sc, dz Matrix-vector product using a triangular band matrix

Example

Ex3-1. Matrix-Matrix product

#include "mkl_blas.h”

int main(){ int m, n, k, lda, ldb, ldc, idxi, idxj; double alpha, beta, *A, *B, *C ; char transa, transb; m = 3; n = 3; k = 3; lda = m; ldb = k; ldc = m; alpha = 1.0; beta = 1.0; transa = 'n’; transb = 'n’;

A = (double*)malloc(sizeof(double)*m*n);

B = (double*)malloc(sizeof(double)*m*n);

C = (double*)malloc(sizeof(double)*m*n);

for( idxi = 0; idxi < m; idxi++ )

for( idxj = 0; idxj < n; idxj++)

*(A+idxi*m+idxj) = (double)(idxi+1) + idxj;

*(B+idxi*m+idxj) = (double)(idxi+1) + idxj;

*(C+idxi*m+idxj) = (double)(idxi+1) + idxj;

dgemm(&transa, &transb, &m, &n, &k,

&alpha, A, &lda, B, &ldb, &beta, C, &ldc);

return 0;

Input Parameters k: The number of columns of the matrix and the number

of rows of the matrix . lda: When transa=‘N’ or ‘n’, then lda = max(1,m),otherwise lda=max(1,k).

ldb: When transa=‘N’ or ‘n’, then ldb = max(1,k),otherwise lda=max(1,n).

ldc: The first dimension of matrix, ldc = max(1,m) output Parameters

C: Overwritten by m-by-n matrix.

Ex3. Result

?gemm s, d, c, z Matrix-matrix product of general matrices

?hemv c, z Matrix-matrix product of Hermitian matrices

?symm s, d, c, z Matrix-matrix product of symmetric matrices

?trmm s, d, sc, dz Matrix-matrix product of triangular matrices

Example

Ex4. LU Factorization

#include "mkl_lapack.h”

int main(){ int m, n, lda, info, idxi, idxj, *ipiv; double *A; m = 3; n = 3; lda = m; ipiv = (int*)malloc(sizeof(int)*m); A = (double*)malloc(sizeof(double)*m*n); *(A+0)=1; *(A+1)=2; *(A+2)=6; *(A+3)=-2; *(A+4)=3; *(A+5)=5; *(A+6)=4; *(A+7)=8; *(A+8)=1;

dgetrf(&m, &n, A, &lda ,ipiv, &info); return 0;}

?getrf

Description : The routine is declared in Fortran77 : mkl_lapack.fi Fortran95 : lapack.f90 C : mkl_lapack.h

Input Parameters m: The number of columns of the matrix . n: The number of rows of the matrix . lda: The first dimension of matrix . A: Array, REAL for sgetrf DOUBLE PRECISION for dgetrf COMPLEX for cgetrf DOUBLE COMPLEX for zgetrf.

?getrf

output Parameters A: Overwritten by L and U. The unit diagonal elements of L

are not stored. ipiv: An integer array, dimension at least max(1,min(m,n)). The pivot indices; row i is interchanged with row

ipiv(i) info: Integer. If info=0,the execution is successful. If info=-i,the i-th parameter had an illegal value. If info=i, The factorization has been completed, but U is singular.

Ex4-1. Result

Ex4-2. Result

LAPACK Computational Routines

generalmatrix

sysmmetricindefinite

sysmmetricpositive-definite

triangularmatrix

Factorize matrix ?getrf ?sytrf ?potrf

Solve linear systemwith a factored matrix

?getrs ?sytrs ?potrs ?trtrs

Condition number ?gecon ?sycon ?pocon ?trcon

Compute the inverse matrix using the factorization

?getri ?sytri ?potri ?trtri

LAPACK Routines: Linear Equations

To solve a particular problem, you can call two or more computational routines or call a corresponding driver routines that combines several tasks in one call. For example, to solve a system of linear equation with a general matrix, call ?getrf (LU factorization) and then ?getrs (computing the solution). Alternatively, use the driver routine ?gesv that performs all these tasks in one call.

Example

Ex5-1. Solve the Linear Eqation

#include <stdio.h>#include "mkl_lapack.h”

int main(){ int n, nrhs, lda, ldb, info, idxi, idxj, *ipiv; double *A, *b; n = 3; nrhs = 1; lda = n; ldb = n; ipiv = (int*)malloc(sizeof(int)*n); A = (double*)malloc(sizeof(double)*n*n); b = (double*)malloc(sizeof(double)*n); for( idxi = 0; idxi < n; idxi++ ) for( idxj = 0; idxj < n; idxj++)*(A+idxi*n+idxj) = (double)(idxi+1) + idxj;

Ex5. Solve the Linear Eqation

*(b+0) = 6;

*(b+1) = 9;

*(b+2) = 12;

dgesv(&n, &nrhs, A, &lda ,ipiv, b, &ldb, &info);

return 0;

Input Parameters nrhs: The number of columns of the matrix .

Output Parameters A: Overwritten by the factor L and U from the factorization

of . b: Overwritten by the solution matrix .

Ex5. Result

Example

Ex6-1. Solve the Eigen Eqation

#include "mkl_lapack.h”

int main(){ int n, lda, lwork, ldvl, ldvr, info, idxi, idxj; double *wr, *wi, *A, *work, *vl, *vr; char jobvl, jobvr; n = 3; lda = n; ldvl = 1; ldvr = n; lwork = 4*n; // not 3*n jobvl = ‘N’; jobvr = ‘V’; A = (double*)malloc(sizeof(double)*n*n); wr = (double*)malloc(sizeof(double)*n); wi = (double*)malloc(sizeof(double)*n); vl = (double*)malloc(sizeof(double)*ldvl*n); vr = (double*)malloc(sizeof(double)*ldvr*n); work = (double*)malloc(sizeof(double)*lwork);

Ex6-2. Solve the Eigen Eqation

*(A+0) = 2;

*(A+1) = -1;

*(A+2) = 0;

*(A+3) = -1;

*(A+4) = 2;

*(A+5) = -1;

*(A+6) = 0;

*(A+7) = -1;

*(A+8) = 2;

dgeev(&jobvl, &jobvr, &n, A, &lda, &wr, &wi,

vl, &ldvl, vr, &ldvr, work, &lwork, &info);

return 0;

Input Parameters jobvl: If jobvl=‘N’, the left eigenvalues of A are not

computed. If jobvl=‘V’, the left eigenvalues of A are computed. jobvr: If jobvr=‘N’, the right eigenvalues of A are not

computed. If jobvr=‘V’, the right eigenvalues of A are computed. work: A workspace array, its dimension max(1, lwork). lwork: The dimension of the array work. lwork ≥ max(1,3n), lwork < max(1,4n)(for real). ldvl, ldvr: The leading dimension of the output array vl and vr, respectively.

Output Parameters wr, wi: Contain the real and imaginary parts, respectively, of the

computed eigenvalue. vl, vr: If jobvl = ‘V’, the left eigenvectors u(j) are

stored one after another in the columns of vl, in the same order as their eigenvalues.

If jobvl = ‘N’, vl is not referenced. If the j-th eigenvalue is real, then u(j) = vl(:,j), the j-th column of vl. info: info=0, the execution is successful.

info=-i, the i-th parameter had an illegal value. info= i, then the QR algorithm failed to compute all the eigenvalues, and no eigenvector have been computed.

Ex6. Result

LAPACK Computational Routines

Orthogonal Factorizations (QR, QZ) Singular Value Decomposition Symmetric Eigenvalue Problems Generalized Symmetric-Definite Eigenvalue Problems Nonsymmetric Eigenvalue Problems Generalized Nonsymmetric Eigenvalue Problems Generalized Singular Value Decomposition

LAPACK Driver Routines

Linear Least Squares (LLS) Problems Generalized LLS Problems Symmetric Eigenproblems Nonsymmetric Eigenproblems Singular Value Decomposition Generalized Symmetric Definite Eigenproblems Generalized Nonsymmetric Eigenproblems

Example

Five Stage Usage Model for Computing FFT

Allocate a fresh descriptor for the problem with a call to the DftiCreateDescriptor function. (precision, rank, sizes, scaling factor, …)

Optionally adjust the descriptor configuration with a call to the DftiSetValue function.

Commit the descriptor with a call to the DftiCommitDescriptor function.

Compute the transform with a call to the DftiComputeForward/DftiComputeBackward function.

Deallocate the descriptor with a call to the DftiFreeDescriptor function.

Ex7-1. Three-Dimensional Complex FFT

#include "mkl_dfti.h”

#define m 1000#define n 1000#define k 1000

typedef struct{ double re; double im;} mkl_complex;

int main(){ int idxi, idxj, idxk; double backward_scale; MKL_LONG status, length[3]; mkl_complex *vec_src, *vec_tmp, *vec_dst; DFTI_DESCRIPTOR_HANDLE handle = 0;

x_src = (mkl_complex*)malloc(sizeof(mkl_complex)*m*n*k); x_tmp = (mkl_complex*)malloc(sizeof(mkl_complex)*m*n*k); x_dst = (mkl_complex*)malloc(sizeof(mkl_complex)*m*n*k);

length[0] = m; length[1] = n; length[2] = k;

memset(x_src, 0, sizeof(sizeof(mkl_complex)*m*n*k)); memset(x_tmp, 0, sizeof(sizeof(mkl_complex)*m*n*k)); memset(x_dst, 0, sizeof(sizeof(mkl_complex)*m*n*k));

for(idxk=0; idxk<k; idxk++) for(idxj=0; idxj<n; idxj++)

for(idxi=0; idxi<m; idxi++) { (x_src+idxk*k*n+idxj*n+idxi)->re=1.0; (x_src+idxk*k*n+idxj*n+idxi)->im=0.0; }

status = DftiCreateDescriptor( &handle, DFTI_DOUBLE,

DFTI_COMPLEX, 3, length );

if(status && !DftiErrorClass(status, DFTI_NO_ERROR))

printf("Error : %s\n", DftiErrorMessage(status));

printf("TEST FAILED : DftiCreatDescriptor(&hand, ...)\n");

status = DftiSetValue( handle, DFTI_PLACEMENT, DFTI_NOT_INPLACE );

status = DftiCommitDescriptor( handle );

status = DftiComputeForward( handle, vec_src, vec_tmp );

backward_scale = 1.0/((double)m*n*k);

status = DftiSetValue( handle, DFTI_BACKWARD_SCALE, backward_scale );

status = DftiCommitDescriptor( handle );

status = DftiComputeBackward( handle, vec_tmp, vec_dst);

status = DftiFreeDescriptor( &handle );

return 0;

FFT Functions

Function Name Operation

DftiCreateDescriptorAllocates memory for the descriptor data structure and preliminarily initializes it.

DftiCommitDescriptorPerforms all initialization for the actual FFT computation.

DftiCopyDescriptorCopies an existing descriptor.

DftiFreeDescriptorFrees memory allocated for a descriptor.

DftiComputeForwardComputes the forward FFT.

DftiComputeBackwardComputes the backward FFT.

DftiSetValueSets one particular configuration parameter with the specified configuration value.

DftiGetValueGets the value of one particular configuration parameter.

82 Intel MKL Quickstart

Reference

Web site form LLNL tutorials (https://computing.llnl.gov/tutorials/parallel_comp/)

Intel® Math Kernel Library Reference Manual (mklman.pdf) Intel® Math Kernel Library for the Linux OS User’s Guide

(userguide.pdf)

Reference

Introduction to Parallel Computing Intel Math Kernel Library Huan-Ting Yen, Department of...

Documents

Zhang Huan Sothebys

INuC Demo Scenario of All Modules Collated by Yen-Ting Chuang

FAR EAST AND BACK FR: TUYAN HUAN TO: TUYON-HUAN · title: far east and back fr: tuyan huan to: tuyon-huan subject: far east and back fr: tuyan huan to: tuyon-huan keywords

Statistical Properties of Radio Galaxies in the local Universe Yen-Ting Lin Princeton University Pontificia Universidad Católica de Chile Yue Shen, Michael

MQ: An Integrated Mechanism for Multimedia Multicasting By De-Nian Yang Wanjiun Liao Yen-Ting Lin Presented By- Sanchit Joshi Roshan John

Tap Huan D1232

Huan CA Bamba

Automated Method Eliminates X Bugs in RTL and Gates Kai-hui Chang, Yen-ting Liu and Chris Browy

The Glass Ceiling: A Study on Annual Salaries Group 4 Julie Shan, Brian Abe, Yu-Ting Cheng, Kathinka Tysnes, Huan Zhang, Andrew Booth

Advisor: Yen-Ting Chen Presenter: Yi-Shiang Chen 2011.4.27 Southern Taiwan University Department of Electrical Engineering

Tap Huan Fttx

Huan Chaco

STUDENT NAME: YEN-TING LIN STUDENT ID: 603415124 Computational Photography Final Project Image effect machine

Central Housing · Chrysler 300C stretch (Maximum capacity length 8.3m) Per person 3,725 yen 5,225 yen 6,725 yen 8,225 yen 9,725 yen 11,225 yen 12,725 yen 14,225 yen 1,500 yen Chrys

Huan Yang 0628441

Eric Dunn and Nathan Smith, Science Applications International Corporation Ray Hoare, Concurrent EDA Huan-Ting Meng and Jianming Jin, University of Illinois

CLAMBLE APPLE PIE SOUP NEW ENGLAND CLAMCHOWDER *MINESTRONE 400 yen 400 yen 400 yen 400 yen 400 yen 400 yen 400 yen 400 yen 400 yen 400 yen 400 yen 500 yen 350 yen 350 yen Irish [spresso

Penda Huan

Huan Zhang.ppt

ASA-CR-201983 ATOMIZATION SIMULATIONS … ATOMIZATION SIMULATIONS USING AN EULERIAN-VOF-LAGRANGIAN METHOD _, r -, /, Yen-Sen Chen, Huan-Min Shang and Paul Liaw Engineering Sciences,