22

Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel
Page 2: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Write faster

code

Approximate

calculations Data layout

Compute

flow

I/O or

Compute

bound?

Profile code

(80/20 rule)

Low hanging fruit

I/O bound?Process parallel

Loop parallel

Different quantity

Lower precision

pthreads, OpenMP Refactor Modify algorithmGet SSD

Page 3: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel
Page 4: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

// parallel vectors

Page 5: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

float metadata[N]; float metadata[N];

// parallel vectors

Page 6: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

float metadata[N]; float metadata[N];

// parallel vectors

Data not contiguous in memory

Memory jumps in accessing data

leads to slow distance calculations

x1 y1 f1 … f1 f1 x2 y2 f2 … f2 f2

x1 y1 i2 j2 … … … … … … … … xn yn

f1 f1 f1 f1 … f2 f2 f2 f2 … f2 f3 f3 f3

Data is contiguous

Page 7: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Data layout matters

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

0 32 64 128 256 512 1024 2048 4096 8192

Slo

w d

ow

n f

act

or

Jump size (Bytes)

Page 8: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel
Page 9: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Language choice

prototypingshipping

readability

timeexisting software

memory

power

speedsecurity

hardware

dependencies

Page 10: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

A simple benchmark

An algorithm that is

• well-understood

• not domain-specific

• computationally intensive

Computing Cholesky decomposition of A

A = L LT

simplifies the process of solving Ax = b

Page 11: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Python

Cython

Python

NumPy/SciPy

Numba

PyCUDA/

Scikit-CUDA

BLAS/

LAPACK

Numba-

CUDA

GPUCPU

Any algorithm

Standard algorithms

Speed

Effort

Page 12: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Python Cholesky implementations

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

32 64 128 256 512 1024 2048 4096

Exe

cuti

on

tim

e (

ms)

Matrix size

Python

NumPy

Numba

np.linalg

sp.linalg

sp.linalg.lapack

skcuda

Page 13: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

C++

C++

C++

SIMD

CUBLAS/CUDNN

BLAS/

LAPACK

Speed

Effort

GPUCPU

C++

CUDA

Compiler

options

Any algorithm

Standard algorithms

Page 14: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

C++ Cholesky implementations (M= L LT)

0.01

0.1

1

10

100

1000

10000

100000

64 128 256 512 1024 2048 4096

Exe

cuti

on

tim

e (

ms)

Matrix size

CPP-O3

CPP-Fast-Math

BLAS (n=1)

AVX

Page 15: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

C++ Cholesky implementations (M= L LT)

0.01

0.1

1

10

100

1000

10000

100000

64 128 256 512 1024 2048 4096

Exe

cuti

on

tim

e (

ms)

Matrix size

CPP-O3

CPP-Fast-Math

BLAS (n=1)

AVX

Eigen

LAPACK (n=1)

LAPACK

CUDA

Page 16: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Using CUDA from Python vs C++

0.1

1

10

100

1000

4 8 16 32 64 128 256 512 1024 2048 4096

Exe

cuti

on

tim

e (

ms)

Lo

g s

cale

Matrix size

CUDA

CUDA-compute

skcuda

skcuda-

compute

Page 17: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel
Page 18: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Using domain knowledge

SIMD implementation

• 4x less storage

• 8-12x faster feature computation

• 64x faster feature matching

Image courtesy of scikit-cuda docs

Page 19: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

C++ optimization cycle

Eigen

BLAS

LAPACK

CUDA

OpenMP

Loop unrolling

Code bloat

Correct instructions

AVX/SSE/

Arm NEON

Domain knowledge

Approximations

Find hotspots

80/20 rule

Select candidates for optimization

Page 20: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel
Page 21: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

End of general purpose H/W

Images from company web pages/press releases

Page 22: Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Thank you for listening