37
Blazing Fast Windows 8 Apps using Visual C++ Tarek Madkour Group Program Manager – Visual C++, Microsoft Corp.

Blazing Fast Windows 8 Apps using Visual C++

Embed Size (px)

DESCRIPTION

More info on http://www.techdays.be

Citation preview

Page 1: Blazing Fast Windows 8 Apps using Visual C++

Blazing Fast Windows 8 Apps using Visual C++

Tarek MadkourGroup Program Manager – Visual C++, Microsoft Corp.

Page 2: Blazing Fast Windows 8 Apps using Visual C++

Agenda

Windows 8 Apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Page 3: Blazing Fast Windows 8 Apps using Visual C++

Agenda

Windows 8 Apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Page 4: Blazing Fast Windows 8 Apps using Visual C++

Windows 8 Apps

New user experience

Touch-friendly

Trust

Battery-power

Fast and fluid

Page 5: Blazing Fast Windows 8 Apps using Visual C++

Windows 8 C++ App Options

XAML-based applications XAML user interface C++ code

DirectX-based applications and games DirectX user interface (D2D or D3D) C++ code

Hybrid XAML and DirectX applications XAML controls mixed with DirectX surfaces C++ code

HTML5 + JavaScript applications HTML5 user interface JS code calling into C++ code

Page 6: Blazing Fast Windows 8 Apps using Visual C++

demoFresh Paint

Page 7: Blazing Fast Windows 8 Apps using Visual C++

Agenda Checkpoint

Windows 8 apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Page 8: Blazing Fast Windows 8 Apps using Visual C++

Recap of “free” performance

Compilation Unit Optimizations

• /O2 and friends

Whole Program Optimizations

• /GL and /LTCG

Profile Guided Optimization

• /LTCG:PGI and /LTCG:PGO

.cpp

.cpp .obj

.obj

.exe

.cpp

.cpp .obj

.obj

.exe

.cpp

.cpp .obj

.obj

.exe

Run TrainingScenario

s

.exe

Page 9: Blazing Fast Windows 8 Apps using Visual C++

More “free” boosts

Automatic vectorization• Always on in VS2012• Uses “vector” instructions

where possible in loops

• Can run this loop in only 250 iterations down from 1,000!

+

r1 r2

r3

add r3, r1, r2

SCALAR(1 operation)

v1 v2

v3

+

vectorlength

vadd v3, v1, v2

VECTOR(N operations)

for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; }

Page 10: Blazing Fast Windows 8 Apps using Visual C++

More “free” boosts

Automatic parallelization• Uses multiple CPU cores• /Qpar compiler switch

• Can run this loop “vectorized” and on 4 CPU cores in parallel

#pragma loop (hint_parallel(4)) for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; }

Page 11: Blazing Fast Windows 8 Apps using Visual C++

Agenda Checkpoint

Windows 8 apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Page 12: Blazing Fast Windows 8 Apps using Visual C++

Parallel Patterns Library (PPL)

Part of the C++ Runtime No new libraries to link in Task parallelism Parallel algorithms Concurrency-safe containers Asynchronous agents

Abstracts away the notion of threads Tasks are computations that may be run in parallel

Used to express your potential concurrency Let the runtime map it to the available concurrency Scale from 1 to 256 cores

Page 13: Blazing Fast Windows 8 Apps using Visual C++

parallel_for

parallel_for iterates over a range in parallel

#include <ppl.h>

using namespace concurrency;

parallel_for( 0, 1000, [] (int i) { work(i); });

Page 14: Blazing Fast Windows 8 Apps using Visual C++

parallel_for

• Order of iteration is indeterminate.

• Cores may come and go.

• Ranges may be stolen by newly idle cores.

parallel_for(0, 1000, [] (int i) { work(i);});

Core 4Core 3

Core 1

work(0…249)

work(500…749)

work(750…999)

Core 2

work(250…499)

Page 15: Blazing Fast Windows 8 Apps using Visual C++

parallel_for

parallel_for considerations:• Designed for unbalanced loop bodies• An idle core can steal a portion of another core’s range of work• Supports cancellation• Early exit in search scenarios

For fixed-sized loop bodies that don’t need cancellation, use parallel_for_fixed.

Page 16: Blazing Fast Windows 8 Apps using Visual C++

parallel_for_each

parallel_for_each iterates over an STL container in parallel

#include <ppl.h>

using namespace concurrency;

vector<int> v = …;

parallel_for_each(v.begin(), v.end(), [] (int i) { work(i); });

Page 17: Blazing Fast Windows 8 Apps using Visual C++

parallel_for_each

Works best with containers that support random-access iterators: std::vector, std::array, std::deque, concurrency::concurrent_vector, …

Works okay, but with higher overhead on containers that support forward (or bi-di) iterators: std::list, std::map, …

Page 18: Blazing Fast Windows 8 Apps using Visual C++

parallel_invoke

• Executes function objects in parallel and waits for them to finish#include <ppl.h>#include <string>#include <iostream>using namespace concurrency; using namespace std;

template <typename T>T twice(const T& t) { return t + t; }

int main() { int n = 54; double d = 5.6; string s = "Hello"; parallel_invoke( [&n] { n = twice(n); }, [&d] { d = twice(d); }, [&s] { s = twice(s); } ); cout << n << ' ' << d << ' ' << s << endl; return 0;}

Page 19: Blazing Fast Windows 8 Apps using Visual C++

task<>

• Used to write asynchronous code• Task::then lets you create continuations that get executed when the task finishes• You need to manage the lifetime of the variables going into a task

#include <ppltasks.h>#include <iostream>using namespace concurrency; using namespace std;

int main(){ auto t = create_task([]() -> int { return 42; });

t.then([](int result) { cout << result << endl; }).wait();}

Page 20: Blazing Fast Windows 8 Apps using Visual C++

Concurrent Containers

• Thread-safe, lock-free containers provided: concurrent_vector<> concurrent_queue<> concurrent_unordered_map<> concurrent_unordered_multimap<> concurrent_unordered_set<> concurrent_unordered_multiset<>

• Functionality resembles equivalent containers provided by the STL

• Behavior is more limited to allow concurrency. For example:• concurrent_vector can push_back but not insert• concurrent_vector can clear but not pop_back or erase

Page 21: Blazing Fast Windows 8 Apps using Visual C++

concurrent_vector<T>

#include <ppl.h>#include <concurrent_vector.h>

using namespace concurrency;

concurrent_vector<int> carmVec;

parallel_for(2, 5000000, [&carmVec](int i) { if (is_carmichael(i)) carmVec.push_back(i);});

Page 22: Blazing Fast Windows 8 Apps using Visual C++

Agenda Checkpoint

Windows 8 apps

Free performance boost

Squeeze the CPU (PPL)

Smoke the GPU (C++ AMP)

Page 23: Blazing Fast Windows 8 Apps using Visual C++

CPU / GPU Comparison

Page 24: Blazing Fast Windows 8 Apps using Visual C++

What is C++ AMP?

Performance & ProductivityC++ AMP -> C++ Accelerated Massive ParallelismC++ AMP is• Programming model for expressing data parallel algorithm• Exploiting heterogeneous system using mainstream tools• C++ language extensions and library

C++ AMP delivers performance without compromising productivity

Page 25: Blazing Fast Windows 8 Apps using Visual C++

What is C++ AMP?

C++ AMP gives you…Productivity• Simple programming model

Portability• Run on hardware from NVIDIA, AMD, Intel and ARM*• Open Specification

Performance• Power of heterogeneous computing at your hands

Use it to speed up data parallel algorithms

Page 26: Blazing Fast Windows 8 Apps using Visual C++

1. #include <iostream>2. 3.

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. } 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }

Page 27: Blazing Fast Windows 8 Apps using Visual C++

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. } 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }

amp.h: header for C++ AMP library

concurrency: namespace for library

Page 28: Blazing Fast Windows 8 Apps using Visual C++

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. }

12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }

array_view: wraps the data to operate on the accelerator. array_view variables

captured and associated data copied to accelerator (on demand)

Page 29: Blazing Fast Windows 8 Apps using Visual C++

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. for (int idx = 0; idx < 11; idx++)9. {10. av[idx] += 1;11. }

12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( av[i]);14. }

array_view: wraps the data to operate on the accelerator. array_view variables

captured and associated data copied to accelerator (on demand)

Page 30: Blazing Fast Windows 8 Apps using Visual C++

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }

parallel_for_each: execute the lambda on the accelerator once

per threadextent: the parallel loop

bounds or computation “shape”

index: the thread ID that is running the lambda, used to

index into data

Page 31: Blazing Fast Windows 8 Apps using Visual C++

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }

restrict(amp): tells the compiler to check that code conforms to C+

+ subset, and tells compiler to target GPU

Page 32: Blazing Fast Windows 8 Apps using Visual C++

1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;

4. int main()5. {6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1;11. }); 12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }

array_view: automatically copied to accelerator if

required

array_view: automatically copied back to host when

and if required

Page 33: Blazing Fast Windows 8 Apps using Visual C++

C++ AMPParallel Debugger

Well known Visual Studio debugging features Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips Tool windows

Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch

New features (for both CPU and GPU) Parallel Stacks window, Parallel Watch window

New GPU-specific Emulator, GPU Threads window, race detection

concurrency::direct3d_printf, _errorf, _abort

Page 34: Blazing Fast Windows 8 Apps using Visual C++
Page 35: Blazing Fast Windows 8 Apps using Visual C++

demoCartoonizerLinear vs. Parallel vs. AMP

Page 36: Blazing Fast Windows 8 Apps using Visual C++

Summary

C++ is a great way to create fast and fluid apps for Windows 8Get the most out of the compiler’s free optimizationsUse PPL for concurrent programmingUse C++ AMP for data parallel algorithms