Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel

Write faster

code

Approximate

calculations Data layout

Compute

flow

I/O or

Compute

bound?

Profile code

(80/20 rule)

Low hanging fruit

I/O bound?Process parallel

Loop parallel

Different quantity

Lower precision

pthreads, OpenMP Refactor Modify algorithmGet SSD

// parallel vectors

float metadata[N]; float metadata[N];

// parallel vectors

float metadata[N]; float metadata[N];

// parallel vectors

Data not contiguous in memory

Memory jumps in accessing data

leads to slow distance calculations

x1 y1 f1 … f1 f1 x2 y2 f2 … f2 f2

x1 y1 i2 j2 … … … … … … … … xn yn

f1 f1 f1 f1 … f2 f2 f2 f2 … f2 f3 f3 f3

Data is contiguous

Data layout matters

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

0 32 64 128 256 512 1024 2048 4096 8192

Slo

w d

ow

n f

act

or

Jump size (Bytes)

Language choice

prototypingshipping

readability

timeexisting software

memory

power

speedsecurity

hardware

dependencies

A simple benchmark

An algorithm that is

• well-understood

• not domain-specific

• computationally intensive

Computing Cholesky decomposition of A

A = L LT

simplifies the process of solving Ax = b

Python

Cython

Python

NumPy/SciPy

Numba

PyCUDA/

Scikit-CUDA

BLAS/

LAPACK

Numba-

CUDA

GPUCPU

Any algorithm

Standard algorithms

Speed

Effort

Python Cholesky implementations

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

32 64 128 256 512 1024 2048 4096

Exe

cuti

on

tim

e (

ms)

Matrix size

Python

NumPy

Numba

np.linalg

sp.linalg

sp.linalg.lapack

skcuda

C++

C++

C++

SIMD

CUBLAS/CUDNN

BLAS/

LAPACK

Speed

Effort

GPUCPU

C++

CUDA

Compiler

options

Any algorithm

Standard algorithms

C++ Cholesky implementations (M= L LT)

0.01

0.1

1

10

100

1000

10000

100000

64 128 256 512 1024 2048 4096

Exe

cuti

on

tim

e (

ms)

Matrix size

CPP-O3

CPP-Fast-Math

BLAS (n=1)

AVX

C++ Cholesky implementations (M= L LT)

0.01

0.1

1

10

100

1000

10000

100000

64 128 256 512 1024 2048 4096

Exe

cuti

on

tim

e (

ms)

Matrix size

CPP-O3

CPP-Fast-Math

BLAS (n=1)

AVX

Eigen

LAPACK (n=1)

LAPACK

CUDA

Using CUDA from Python vs C++

0.1

1

10

100

1000

4 8 16 32 64 128 256 512 1024 2048 4096

Exe

cuti

on

tim

e (

ms)

Lo

g s

cale

Matrix size

CUDA

CUDA-compute

skcuda

skcuda-

compute

Using domain knowledge

SIMD implementation

• 4x less storage

• 8-12x faster feature computation

• 64x faster feature matching

Image courtesy of scikit-cuda docs

C++ optimization cycle

Eigen

BLAS

LAPACK

CUDA

OpenMP

Loop unrolling

Code bloat

Correct instructions

AVX/SSE/

Arm NEON

Domain knowledge

Approximations

Find hotspots

80/20 rule

Select candidates for optimization

End of general purpose H/W

Images from company web pages/press releases

Thank you for listening

Documents

Low hanging fruit...Write faster code Approximate calculations Data layout Compute flow I/O or Compute bound? Profile code (80/20 rule) Low hanging fruit I/O bound? Process parallel