Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
병렬프로그래밍 김명신, Technical Evangelist, Microsoft
먼저
Phase/Trend Major Constraints 2x Efficient App Runs…
(1950-90s) Compute-constrained Processor 2x compute speed
2x users
(200x-) Mobile + bigger experiences
(e.g., tablet, ‘smartphone’)
Power (battery life)
Processor
2x battery life
2x compute speed
(2009-) Cloud / datacenter
(e.g., Office 365, Shazam, Siri)
Server HW (57%)
Power (31%) *
0.56x nodes
0.56x power
(2009-) Heterogeneous cores
(e.g., Cell, GPGPU)
Power (dark silicon)
Processor
0.5x power envelope
2x compute speed
(2020ish-) Moore’s End Processor 2x compute speed forever
* http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx
(1995ish-2007ish) Surplus local
compute + low UI innovation
(e.g., 2nd party LOB client WIMP apps) *WIMP(Windows, Icon, Menu, Pointing Device)
Programmer time n/a
* http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx
(200x-) Mobile + bigger experiences
(e.g., tablet, ‘smartphone’)
Power (battery life)
Processor
2x battery life
2x compute speed
(2009-) Cloud / datacenter
(e.g., Office 365, Shazam, Siri)
Server HW (57%)
Power (31%) *
0.56x nodes
0.56x power
(2009-) Heterogeneous cores
(e.g., Cell, GPGPU)
Power (dark silicon)
Processor
0.5x power envelope
2x compute speed
(2020ish-) Moore’s End Processor 2x compute speed forever
Note: The final four are going to dominate for the rest of our careers.
Phase/Trend Major Constraints 2x Efficient App Runs…
Distributed Parallel Telco network, Internet, DFS,
Cluster computing, Grid computing
Multi Processor, Multi Core, NUMA
1
1 − 𝑃 +𝑃𝑆
P : Parallel Portion
S : Speed up
1
1 − 0.5 + 0.52
= 1.333 …
50% 구간을
2배 성능 향상시
Performance Wizard Concurrency Visualizer
How(Old Features)
Multithread Programming
OpenMP PPL / TPL
How(VS2012 New Features)
Auto-Vectorization
Auto-Parallelization
C++ AMP
const int N = 1000; float a[N], b[N]; // initialize a[i] = i, b[i] = 100 + i; for (int i = 0 ; i < N ; ++i) a[i] += b[i];
By default, ON
SSE instruction in Intel / NEON instruction in ARM
Vector registers are called XMM0~XMM15
SSE 4.2 instruction set if available
To disable vectorization
#pragma loop(no_vector)
Compiler evaluate the code to find loops that might benefit form parallelization
Use, /Qpar
To enable the auto-parallelization, manually
#pragma loop(hint_parallel(n))
Accelerated Massive Parallelism
C++, not C
Just one general language extension
Portable, mix & match hardware from any vender, one exe
General and future-proof
Open specification