48
Software and Services Group Optimization Notice Intel® Parallel Building Blocks: Quickly Write Parallel Tasks Using Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1

Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Intel® Parallel Building Blocks: Quickly Write Parallel Tasks Using

Intel® Cilk™ Plus Keywords and Reducers

Brandon Hewitt

Jan. 18, 2011

1

Page 2: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Intel® Cilk™ Plus: One of the Intel® Parallel Building Blocks

• One of the three Intel® Parallel Building Blocks

2

Page 3: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

What is Intel® Cilk™ Plus

3

Intel Cilk PlusKey Benefits

• Simple syntax which is very easy to learn and use

• Array notation guarantees fast vector code

• Fork/join tasking system is simple to understand and mimics serial behavior

• Low overhead tasks offer scalability to high core counts

• Reducers give better performance than mutex locks and maintain serial semantics

• Mixes with Intel® TBB and Intel® ArBB for a complete task and vector parallel solution

Intel Cilk Plus

What is it?

• Compiler supported solution offering a tasking system via 3 simple keywords

• Includes array notation to specify vector code

• Reducers - powerful parallel data structures to efficiently prevent races

• Based on 15 years of research at MIT

• Pragmas to force vectorization of loops and attributes to specify functions that can be applied to all elements of arrays

Page 4: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Focusing on Task Parallelism

• The three keywords to implement parallelism

– cilk_spawn, cilk_for, cilk_sync

• Reducers to handle shared data safely

– e.g. reducer_opadd, reducer_ostream

• Composable

– Nest parallel regions, mix with other Intel® PBB

• Serial Semantics

– Parallel code can be understood and executed as an equivalent serial code

– Benefits are improved applicability of industry standard debugging and analysis tools, support for deterministic behavior, and tools that are able to provide strong guarantees of correctness and/or performance

4

Page 5: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Agenda

• Other technical presentations review Intel® Cilk™ Plus and Intel® PBB at a feature level.

– http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/

• In this technical presentation, we’ll take a more in-depth look at the task parallel parts of Cilk Plus using specific example codes

– Syntax

– Debugging

– Using analysis tools

– Fun with Reducers

– And more…

5

Page 6: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Systems Used for Examples

• In all cases I used Intel® C++ Composer XE for Windows* and Linux* update 1.

• The Inspector example uses Intel® Inspector XE for Windows* initial release.

• Linux* system is a 4 core Intel® Xeon® X5560 cpu running 64-bit Fedora Core* 9.0

• Windows* system is a dual core Intel® Core™ i5 660 cpu running 64-bit Windows Server 2008* R2 Enterprise.

• All examples shown are provided as-is, and you are encourage to validate any conclusions yourself.

6

Page 7: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using cilk_spawn and cilk_sync$ cat cilk_spawn_sample.c

#include <stdio.h> // for printf

#include <stdlib.h> // for strtol

#include <cilk/cilk.h> // for cilk keywords

#include <cilk/cilk_api.h> // for cilk functions

int fib(int x) {

int tmp1, tmp2;

printf("fib() run by Cilk Plus worker %d\n", __cilkrts_get_worker_number());

if (x <= 1)

return x;

else {

tmp1 = cilk_spawn fib(x-1);

tmp2 = fib(x-2);

cilk_sync;

return tmp1+tmp2;

}

}

int main(int argc, char** argv) {

int input = strtol(argv[1], NULL, 0);

printf("fib of %d is %d\n", input, fib(input));

return(0);

}

7

Page 8: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using cilk_spawn and cilk_sync

$ ./a.out 1

fib() run by Cilk Plus worker 8

fib of 1 is 1

$ ./a.out 2

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 1

fib of 2 is 1

$ ./a.out 3

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 1

fib of 3 is 2

8

Page 9: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using cilk_spawn and cilk_sync

$ ./a.out 6

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 8

fib() run by Cilk Plus worker 2

fib() run by Cilk Plus worker 1

fib() run by Cilk Plus worker 1

fib() run by Cilk Plus worker 1

fib() run by Cilk Plus worker 1

fib() run by Cilk Plus worker 1

fib() run by Cilk Plus worker 0

fib() run by Cilk Plus worker 1

fib() run by Cilk Plus worker 4

fib() run by Cilk Plus worker 4

fib() run by Cilk Plus worker 4

fib() run by Cilk Plus worker 0

fib() run by Cilk Plus worker 0

fib() run by Cilk Plus worker 0

fib() run by Cilk Plus worker 0

fib of 6 is 8

9

Page 10: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using cilk_for

$ cat cilk_for_sample.c

#include <stdio.h> // for printf

#include <stdlib.h> // for strtol

#include <cilk/cilk.h> // for cilk keywords

#include <cilk/cilk_api.h> // for cilk functions

int main(int argc, char** argv) {

int input = strtol(argv[1], NULL, 0);

int i, tmp = 0;

cilk_for(i = 1; i <= input; i++) {

printf("for loop run by Cilk Plus worker %d\n", __cilkrts_get_worker_number());

tmp += i;

}

printf("triangular of %d is %d\n", input, tmp);

return(0);

}

10

Page 11: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using cilk_for

$ ./a.out 1

for loop run by Cilk Plus worker 8

triangular of 1 is 1

$ ./a.out 2

for loop run by Cilk Plus worker 8

for loop run by Cilk Plus worker 8

triangular of 2 is 3

$ ./a.out 3

for loop run by Cilk Plus worker 8

for loop run by Cilk Plus worker 1

for loop run by Cilk Plus worker 1

triangular of 3 is 6

$ ./a.out 4

for loop run by Cilk Plus worker 8

for loop run by Cilk Plus worker 8

for loop run by Cilk Plus worker 8

for loop run by Cilk Plus worker 8

triangular of 4 is 10

$ ./a.out 4

for loop run by Cilk Plus worker 8

for loop run by Cilk Plus worker 8

for loop run by Cilk Plus worker 3

for loop run by Cilk Plus worker 3

triangular of 4 is 10

11

Page 12: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Uh-oh

• The cilk_for code seems to work reliably for lower numbers.

• But if tmp is made a long int, and we increase the input significantly, we start seeing non-deterministic output:

– Change line “int i, tmp = 0;” to:

int i; //, tmp = 0;

long tmp = 0;

– Change %d in printf to %ld for tmp

– Also remove printf of Cilk Plus worker id to preserve sanity

$ ./a.out 1055555

triangular of 1055555 is 218727597107

$ ./a.out 1055555

triangular of 1055555 is 306057677066

$ ./a.out 1055555

triangular of 1055555 is 257123500732

12

Page 13: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

What do we do now?

• Try to debug using the Intel® Parallel Debug Extensions (Windows*) or Intel® Debugger (Linux*)

• Try Intel® Parallel Inspector 2011 or Intel® Inspector XE

• Try Cilkscreen utility

13

Page 14: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Debug it!

• Need to compile with options:

– /Zi /debug:parallel on Windows*

– -g –debug parallel on Linux*

• Use Intel® Parallel Debug Extensions in Microsoft Visual Studio* Debugger or

• Use Intel® Debugger (IDB) on Linux

• Next example uses IDB, flow is the same on Windows, but uses GUI instead of text commands

14

Page 15: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using IDB’s Thread Data Sharing Detection

$ icc -O2 -g -debug parallel cilk_for2.c

$ idbc ./a.out

Intel(R) Debugger for applications running on Intel(R) 64, Version 12.0, Build [1.3842.2.154]

------------------

object file name: ./a.out

Reading symbols from a.out...done.

(idb) set args 1055555

(idb) idb sharing on

(idb) run

Starting program: /var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/a.out

[New Thread 140059247015680 (LWP 10832)]

[New Thread 140059247015680 (LWP 10832)]

[New Thread 1090242896 (LWP 10833)]

[New Thread 1111206224 (LWP 10834)]

[New Thread 1121696080 (LWP 10835)]

[New Thread 1132185936 (LWP 10836)]

[New Thread 1142675792 (LWP 10837)]

[New Thread 1153165648 (LWP 10838)]

[New Thread 1163655504 (LWP 10839)]

[New Thread 1174145360 (LWP 10840)]

Data sharing event 1: 0x601320 8 bytes, 4 accesses from 3 threads.

__$U0 (this=0x7f6215f48f80, =1, =362484044) at /var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16

16 tmp += i;

(idb) idb sharing event expand

Data sharing event 1: 0x601320 8 bytes, 4 accesses from 3 threads.

/var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16 = 0x400c5b write, Thread 3

/var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16 = 0x400c5b write, Thread 7

/var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16 = 0x400c72 read, Thread 7

/var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16 = 0x400c72 read, Thread 9

15

Page 16: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

What does this mean?

• IDB is telling us that multiple threads are reading and writing to the same memory location (tmp) at the same time.

• We have a data race, which can cause non-deterministic behavior (i.e. behavior that can change depending on the order different threads execute).

16

Page 17: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

So how do we fix data races?

• The traditional solution is to use a lock around the accesses to the shared data.

• A lock allows only one thread to access the protected code at a time.

• Intel® TBB provides a mutex locking construct, but since the sample is in C and not C++, we’ll use a POSIX thread mutex lock.

17

Page 18: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Locking Solution

$ cat cilk_for_with_lock.c

#include <stdio.h> // for printf

#include <stdlib.h> // for strtol

#include <pthread.h> // for lock

#include <cilk/cilk.h> // for cilk keywords

#include <cilk/cilk_api.h> // for cilk functions

pthread_mutex_t lock_sum;

int main(int argc, char** argv) {

int input = strtol(argv[1], NULL, 0);

int i;

long tmp = 0;

pthread_mutex_init(&lock_sum, NULL);

cilk_for(i = 1; i <= input; i++) {

pthread_mutex_lock(&lock_sum);

tmp += i;

pthread_mutex_unlock(&lock_sum);

}

pthread_mutex_destroy(&lock_sum);

printf("triangular of %d is %ld\n", input, tmp);

return(0);

}

18

Page 19: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Results of Locking

• We now get good, consistent answers $ ./a.out 1055555

triangular of 1055555 is 557098706790

• But performance suffers as iterations increasetime ./a.out 105555555

triangular of 105555555 is 5570987648456790

real 0m9.183s

user 0m7.133s

sys 0m59.067s

19

Page 20: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Intel® Cilk™ Plus Reducers

• Need a way to protect accesses to shared data that doesn’t suffer from contention and bottle-necks.

• Cilk Plus provides reducers – Constructs that provide unique views of shared data to each worker that are then merged at a cilk_sync.

• Reducer design eliminates lock contention, and also have other benefits.

20

Page 21: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Solution with reducer_opadd in C$ cat cilk_for_with_reducer.c

#include <stdio.h> // for printf

#include <stdlib.h> // for strtol

#include <cilk/cilk.h> // for cilk keywords

#include <cilk/cilk_api.h> // for cilk functions

#include <cilk/reducer_opadd.h> // for Reducer

int main(int argc, char** argv) {

int input = strtol(argv[1], NULL, 0);

int i;

CILK_C_REDUCER_OPADD(tmp, long, 0);

CILK_C_REGISTER_REDUCER(tmp);

cilk_for(i = 1; i <= input; i++) {

REDUCER_VIEW(tmp) += i;

}

printf("triangular of %d is %ld\n", input, tmp.value);

CILK_C_UNREGISTER_REDUCER(tmp);

return(0);

}

21

Page 22: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Results with Reducer

$ icc cilk_for_with_reducer.c

time ./a.out 105555555

triangular of 105555555 is 5570987648456790

real 0m0.094s

user 0m0.062s

sys 0m0.103s

22

Page 23: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Catching Data Races before they Manifest

• Intel provides a couple tools for detecting data races in Cilk Plus codes

• Intel® Parallel Inspector 2011 / Intel® Inspector XE

– Some limitations, including false positives, and potential misses if no steals occur

• Cilkscreen

– Only detects data races (Inspector also detects memory errors)

23

Page 24: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using Intel® Inspector XE for Windows*

24

Page 25: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Start Data Race Detection

25

Page 26: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Get Results

26

Page 27: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

View Details

27

Page 28: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Results After Adding Reducer

28

Page 29: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Another Benefit of Reducers: Serial Semantics

$ cat test_with_lock.cpp

#include <iostream>

#include <cstring>

#include <pthread.h>

#include <cilk/cilk.h>

pthread_mutex_t cout_lock;

int main(int argc, char* argv[]) {

const int length = std::strlen(argv[1]);

pthread_mutex_init(&cout_lock, NULL);

cilk_for(int i = 0; i < length; i++) {

pthread_mutex_lock(&cout_lock);

std::cout << argv[1][i];

pthread_mutex_unlock(&cout_lock);

}

std::cout << std::endl;

pthread_mutex_destroy(&cout_lock);

return(0);

}

29

Page 30: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Answer With Pthread Mutexes

$ icc test_with_lock.cpp -lpthread

$ ./a.out "hello world,hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world"

hello world,hello world, hello world, hello wor hello world, hello world, hello wolrldd,hello world, hello world, h,e lhleol lwo helhlo ewlolrol rwdoo,r he llwlolordld,d ,hweo rlllho,e whelllloorlo d ,w ohrellldo, whodrh,l hedl,l oe lwlohrelldol,o h ewl lo wwoororlrdl, ldd,, hhelleol lwoor lwdo,r lhde,l lhoe llwoo rwlodr,l dh,e lhleol lwoo rwlodr,l dh,ello world, hello ello world, hello world, hello world, hello world, hello world, hello worldworld, hello world, hello world, hello world,, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello worldworld, hello world, hello world, hello world,

30

Page 31: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Now Use an ostream Reducer

$ cat test_with_ostream_reducer.cpp

#include <iostream>

#include <cstring>

#include <cilk/cilk.h>

#include <cilk/reducer_ostream.h>

int main(int argc, char* argv[]) {

const int length = std::strlen(argv[1]);

cilk::reducer_ostream cout_reducer(std::cout);

cilk_for(int i = 0; i < length; i++)

cout_reducer << argv[1][i];

std::cout << std::endl;

return(0);

}

31

Page 32: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Results Are Always In Order

$ icc test_with_ostream_reducer.cpp

$ ./a.out "hello world,hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world"

hello world,hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world

32

Page 33: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Customer Case

33

PickupTruck

Body Chassis EngineDriveTrain

Cab Doors Flatbed

Goal: Find all “collisions” between an assembly and a target object.

Page 34: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

First Attempt to Use cilk_for

34

std::list<Node *>output_list;void walk(Node &x, Node &target)) {

if (x.is_internal()) {

cilk_for(Node::iterator child = x.begin(); child != x.end(); ++child) {

walk(child, target); }

}else

if (target.collides_with(x)) output_list.push_back(x);

}

Parallel update of list is a Data Race!

In parallel, traverse tree

At leaf, collect collisions

Page 35: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using Locks

35

std::list<Node *>output_list;void walk(Node &x, Node &target)) {

if (x.is_internal()) {

cilk_for(Node::iterator child = x.begin(); child != x.end(); ++child) {

walk(child, target); }

}else

if (target.collides_with(x)) {

m.lock();output_list.push_back(x);m.unlock();

}}

Add lock•Poor performance•Order not deterministic.

In parallel, traverse tree

At leaf, collect collisions

Page 36: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Using STL List Reducer

36

cilk::reducer_list_append<Node *>output_list;

void walk(Node &x, Node &target)) {if (x.is_internal()) {

cilk_for(Node::iterator child = x.begin(); child != x.end(); ++child) {

walk(child, target); }

}else

if (target.collides_with(x)) output_list.push_back(x);

}

Change list to hyper-object•Good performance. Serial order!

In parallel, traverse tree

At leaf, collect collisions

Page 37: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

A Look at cilk_for vs. cilk_spawn

• Take the cilk_for example from slide 21

$ cat cilk_for_with_reducer.c

#include <stdio.h> // for printf

#include <stdlib.h> // for strtol

#include <cilk/cilk.h> // for cilk keywords

#include <cilk/cilk_api.h> // for cilk functions

#include <cilk/reducer_opadd.h> // for Reducer

int main(int argc, char** argv) {

int input = strtol(argv[1], NULL, 0);

int i;

CILK_C_REDUCER_OPADD(tmp, long, 0);

CILK_C_REGISTER_REDUCER(tmp);

cilk_for(i = 1; i <= input; i++) {

REDUCER_VIEW(tmp) += i;

}

printf("triangular of %d is %ld\n", input, tmp.value);

CILK_C_UNREGISTER_REDUCER(tmp);

return(0);

}

37

Page 38: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

cilk_for vs. cilk_spawn

• What if we rewrite the cilk_for to a serial for loop over cilk_spawn function calls?

$ cat cilk_spawn_with_reducer.c

<snip includes>

void foo(long * x, int y) {

*x += y;

}

int main(int argc, char** argv) {

int input = strtol(argv[1], NULL, 0);

int i;

CILK_C_REDUCER_OPADD(tmp, long, 0);

CILK_C_REGISTER_REDUCER(tmp);

for(i = 1; i <= input; i++) {

cilk_spawn foo(&(REDUCER_VIEW(tmp)), i);

}

cilk_sync;

printf("triangular of %d is %ld\n", input, tmp.value);

CILK_C_UNREGISTER_REDUCER(tmp);

return(0);

}

38

Page 39: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Time taken by cilk_spawn

• Results:– $ time ./a.out 105555555

– triangular of 105555555 is 5570987648456790

– real 0m4.185s

– user 0m10.250s

– sys 0m23.050s

• Why is the cilk_spawn so much slower? Work stealing has a significant overhead for light workloads. cilk_for better distributes the work, minimizing steals.

39

Page 40: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

More Analysis/Debugging/Usability Features

• Serialize the code. Just add /Qcilk-serialize (Windows*) or –cilk-serialize (Linux*)

– Just stubs out cilk_spawn and cilk_sync and replaces cilk_for with for

• Set number of workers explicitly:

– set CILK_NWORKERS=1 (Windows*)

– export CILK_NWORKERS=1 (Linux bash)

– setenv CILK_NWORKERS 1 (Linux cshell)

– __cilkrts_set_param(“NWORKERS”, “1”);

40

Page 41: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Runtime Functions for Worker Management

• __cilkrts_get_worker_number()

– Returns an integer id specific to the Cilk Plus worker running the code.

• __cilkrts_get_nworkers()

– Returns the number of workers available to handle Cilk Plus tasks. Returns 1 in serial code. Once called, the worker count can’t be changed later.

• __cilkrts_get_total_workers()

– Returns the total number of worker “slots”. The Cilk Plus runtime has an allocation of workers that can well be greater than the number of active workers. You can use this API to replace shared data with an array of shared data specific to each thread and then use __cilkrts_get_worker_number() as an index into the array.

41

Page 42: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Compiling Code with Non-Intel Compilers

• If you try to compile Cilk Plus code with say gcc, you will get errors:$ g++ test_with_ostream_reducer.cpp

test_with_ostream_reducer.cpp:3:23: error: cilk/cilk.h: No such file or directory

test_with_ostream_reducer.cpp:4:34: error: cilk/reducer_ostream.h: No such file or directory

test_with_ostream_reducer.cpp: In function “int main(int, char**)”:

test_with_ostream_reducer.cpp:8: error: “cilk” has not been declared

test_with_ostream_reducer.cpp:8: error: expected `;' before “cout_reducer”

test_with_ostream_reducer.cpp:10: error: expected primary-expression before “int”

test_with_ostream_reducer.cpp:10: error: “I” was not declared in this scope

test_with_ostream_reducer.cpp:10: error: expected `;' before “)” token

• Add –I <compiler include/cilk> –include cilk/cilk_stub.h to get serial version that compiles– $g++ -I /opt/intel/Compiler/12.0/108/compilerpro-12.0.1.108/compiler/include

-include cilk/cilk_stub.h -g test_with_ostream_reducer.cpp

– For Microsoft, use –I <compiler include\cilk> and /FI cilk\cilk_stub.h

42

Page 43: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Next Steps

• Try Cilk Plus for yourself

– Download an evaluation of the Intel® C++ Composer XE at http://intel.com/software/products

– Try out the sample codes distributed with the product

– Go to http://cilk.com and check out the content, including the Evaluation Guide and the Cilk Plus specification

– If you’re interested in 1:1 consulting, let us know in the feedback form for this presentation

43

Page 44: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

What we didn’t get to

• Intel® Parallel Building Blocks: Quickly Manipulate Data in Parallel Using Intel® Cilk™ Plus Array Notation/Elemental FunctionsTuesday, February 1, 2011 9:00 AM - 10:00 AM PST (GMT-8)

– http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/

• Mixing Cilk Plus keywords and array notations

– http://software.intel.com/en-us/articles/intel-parallel-building-blocks-getting-started-tutorial-and-hands-on-lab/?wapkw=(PBB+lab)

• Mixing Cilk Plus with Intel® TBB

• Using Cilkview and Cilkscreen

• Using Intel® VTune™ Amplifier XE with Cilk Plus– http://software.intel.com/en-us/articles/intel-cilk-plus-support-in-intel-parallel-amplifier-

2011/?wapkw=(Parallel+Amplifier+and+Cilk+Plus)

• Writing Custom Reducers

– Refer to linear-recurrence sample provided with Intel® Parallel Composer 2011 or C++ Composer XE

44

Page 45: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Optimization Notice

Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for

instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not

optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that

are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler

options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User

and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler

products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your

code and other factors, you likely will get extra performance on Intel microprocessors.

Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-

Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel®

Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming

SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability,

functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and

non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet

your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please

let us know if you find we do not.

Notice revision #20101101

45

Page 46: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number

46

Page 47: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations: [See slide 6]. For more information go to http://www.intel.com/performance

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement)

Intel Xeon, Core, and Cilk Plus are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2011 Intel Corporation. All rights reserved.

47

Page 48: Intel® Cilk™ Plus Keywords and · Intel® Cilk™ Plus Keywords and Reducers Brandon Hewitt Jan. 18, 2011 1. ... development-products-technical-presentations/ • In this technical

Software and Services Group Optimization Notice

48