20
QUASAR (GPU Programming Language) on GDaaS Accelerates Coding from Months to Days

Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

Embed Size (px)

Citation preview

Page 1: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

QUASAR (GPU Programming Language) on GDaaS Accelerates Coding

from Months to Days

2

OUTLINE

1

2

3

CAUSE

THE

OFFER

HOW DOES

IT WORK

4

5

6

DEMO

RESULTS

CONCLUSION

GPUs are

everywhere

4

Each

HW platform

requires a new

implementation

Long

development

lead times

Strong coupling

between algorithm

development and

implementation

hellip ALMOST EVERYWHERE

Low level

coding experts

are required

5

OBSERVATION

While breakthrough results are achieved still limited

usage in research

mdashScientific articles mentioning CUDA 90K

mdashScientific articles mentioning a specific scripting language

400K-1900K

A continuous investment in easy access

mdashOptimized libraries cuFFTcuDNN cuBLAShellip

mdashTools Digits

High level access increases potential market by a factor 7

Limited to specific applications

6

On GPU desktop as a service (GDAAS)

HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7

rsquoS VALUE PROPOSITION

Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 2: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

2

OUTLINE

1

2

3

CAUSE

THE

OFFER

HOW DOES

IT WORK

4

5

6

DEMO

RESULTS

CONCLUSION

GPUs are

everywhere

4

Each

HW platform

requires a new

implementation

Long

development

lead times

Strong coupling

between algorithm

development and

implementation

hellip ALMOST EVERYWHERE

Low level

coding experts

are required

5

OBSERVATION

While breakthrough results are achieved still limited

usage in research

mdashScientific articles mentioning CUDA 90K

mdashScientific articles mentioning a specific scripting language

400K-1900K

A continuous investment in easy access

mdashOptimized libraries cuFFTcuDNN cuBLAShellip

mdashTools Digits

High level access increases potential market by a factor 7

Limited to specific applications

6

On GPU desktop as a service (GDAAS)

HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7

rsquoS VALUE PROPOSITION

Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 3: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

GPUs are

everywhere

4

Each

HW platform

requires a new

implementation

Long

development

lead times

Strong coupling

between algorithm

development and

implementation

hellip ALMOST EVERYWHERE

Low level

coding experts

are required

5

OBSERVATION

While breakthrough results are achieved still limited

usage in research

mdashScientific articles mentioning CUDA 90K

mdashScientific articles mentioning a specific scripting language

400K-1900K

A continuous investment in easy access

mdashOptimized libraries cuFFTcuDNN cuBLAShellip

mdashTools Digits

High level access increases potential market by a factor 7

Limited to specific applications

6

On GPU desktop as a service (GDAAS)

HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7

rsquoS VALUE PROPOSITION

Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 4: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

4

Each

HW platform

requires a new

implementation

Long

development

lead times

Strong coupling

between algorithm

development and

implementation

hellip ALMOST EVERYWHERE

Low level

coding experts

are required

5

OBSERVATION

While breakthrough results are achieved still limited

usage in research

mdashScientific articles mentioning CUDA 90K

mdashScientific articles mentioning a specific scripting language

400K-1900K

A continuous investment in easy access

mdashOptimized libraries cuFFTcuDNN cuBLAShellip

mdashTools Digits

High level access increases potential market by a factor 7

Limited to specific applications

6

On GPU desktop as a service (GDAAS)

HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7

rsquoS VALUE PROPOSITION

Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 5: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

5

OBSERVATION

While breakthrough results are achieved still limited

usage in research

mdashScientific articles mentioning CUDA 90K

mdashScientific articles mentioning a specific scripting language

400K-1900K

A continuous investment in easy access

mdashOptimized libraries cuFFTcuDNN cuBLAShellip

mdashTools Digits

High level access increases potential market by a factor 7

Limited to specific applications

6

On GPU desktop as a service (GDAAS)

HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7

rsquoS VALUE PROPOSITION

Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 6: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

6

On GPU desktop as a service (GDAAS)

HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7

rsquoS VALUE PROPOSITION

Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 7: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

7

rsquoS VALUE PROPOSITION

Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 8: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 9: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

9

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 10: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

10

HOW DOES IT WORK

Y = sum(A + B C + D)

function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)

Code analysis and target-dependent lowering

__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))

Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 11: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 12: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 13: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 14: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 15: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 16: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 17: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000

Filter 32 taps withglobal memory

2D spatial filter 32x32separable with global

memory

Wavelet filter withglobal memory

CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 18: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 19: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

19

CONCLUSION

High level scripting language

Ideal for rapid prototyping

Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS

Quasar on GDaaS Accelerates Coding

from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

Page 20: Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400 Filter 32 taps, with global memory 2D spatial filter 32x32 separable, ... using QUASAR

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom