병렬프로그래밍 - cuvix.co.kr...Phase/Trend Major Constraints 2x Efficient App Runs… (1950-90s) Compute-constrained Processor 2x compute speed 2x users (200x-) Mobile + bigger

병렬프로그래밍 김명신, Technical Evangelist, Microsoft

먼저

Phase/Trend Major Constraints 2x Efficient App Runs…

(1950-90s) Compute-constrained Processor 2x compute speed

2x users

(200x-) Mobile + bigger experiences

(e.g., tablet, ‘smartphone’)

Power (battery life)

Processor

2x battery life

2x compute speed

(2009-) Cloud / datacenter

(e.g., Office 365, Shazam, Siri)

Server HW (57%)

Power (31%) *

0.56x nodes

0.56x power

(2009-) Heterogeneous cores

(e.g., Cell, GPGPU)

Power (dark silicon)

Processor

0.5x power envelope

2x compute speed

(2020ish-) Moore’s End Processor 2x compute speed forever

* http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx

(1995ish-2007ish) Surplus local

compute + low UI innovation

(e.g., 2nd party LOB client WIMP apps) *WIMP(Windows, Icon, Menu, Pointing Device)

Programmer time n/a

* http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx

(200x-) Mobile + bigger experiences

(e.g., tablet, ‘smartphone’)

Power (battery life)

Processor

2x battery life

2x compute speed

(2009-) Cloud / datacenter

(e.g., Office 365, Shazam, Siri)

Server HW (57%)

Power (31%) *

0.56x nodes

0.56x power

(2009-) Heterogeneous cores

(e.g., Cell, GPGPU)

Power (dark silicon)

Processor

0.5x power envelope

2x compute speed

(2020ish-) Moore’s End Processor 2x compute speed forever

Note: The final four are going to dominate for the rest of our careers.

Phase/Trend Major Constraints 2x Efficient App Runs…

Distributed Parallel Telco network, Internet, DFS,

Cluster computing, Grid computing

Multi Processor, Multi Core, NUMA

//upload.wikimedia.org/wikipedia/commons/2/21/Fivestagespipeline.png

1

1 − 𝑃 +𝑃𝑆

P : Parallel Portion

S : Speed up

1

1 − 0.5 + 0.52

= 1.333 …

50% 구간을

2배 성능 향상시

http://upload.wikimedia.org/wikipedia/commons/e/ea/AmdahlsLaw.svg

Performance Wizard Concurrency Visualizer

How(Old Features)

Multithread Programming

OpenMP PPL / TPL

How(VS2012 New Features)

Auto-Vectorization

Auto-Parallelization

C++ AMP

const int N = 1000; float a[N], b[N]; // initialize a[i] = i, b[i] = 100 + i; for (int i = 0 ; i < N ; ++i) a[i] += b[i];

By default, ON

SSE instruction in Intel / NEON instruction in ARM

Vector registers are called XMM0~XMM15

SSE 4.2 instruction set if available

To disable vectorization

#pragma loop(no_vector)

Compiler evaluate the code to find loops that might benefit form parallelization

Use, /Qpar

To enable the auto-parallelization, manually

#pragma loop(hint_parallel(n))

Accelerated Massive Parallelism

C++, not C

Just one general language extension

Portable, mix & match hardware from any vender, one exe

General and future-proof

Open specification

Documents

병렬프로그래밍 - cuvix.co.kr...Phase/Trend Major Constraints 2x Efficient App Runs… (1950-90s) Compute-constrained Processor 2x compute speed 2x users (200x-) Mobile + bigger