Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Software and Services Group Optimization Notice
Intel® Parallel Building Blocks: Quickly Write Parallel Tasks Using
Intel® Cilk™ Plus Keywords and Reducers
Brandon Hewitt
Jan. 18, 2011
1
Software and Services Group Optimization Notice
Intel® Cilk™ Plus: One of the Intel® Parallel Building Blocks
• One of the three Intel® Parallel Building Blocks
2
Software and Services Group Optimization Notice
What is Intel® Cilk™ Plus
3
Intel Cilk PlusKey Benefits
• Simple syntax which is very easy to learn and use
• Array notation guarantees fast vector code
• Fork/join tasking system is simple to understand and mimics serial behavior
• Low overhead tasks offer scalability to high core counts
• Reducers give better performance than mutex locks and maintain serial semantics
• Mixes with Intel® TBB and Intel® ArBB for a complete task and vector parallel solution
Intel Cilk Plus
What is it?
• Compiler supported solution offering a tasking system via 3 simple keywords
• Includes array notation to specify vector code
• Reducers - powerful parallel data structures to efficiently prevent races
• Based on 15 years of research at MIT
• Pragmas to force vectorization of loops and attributes to specify functions that can be applied to all elements of arrays
Software and Services Group Optimization Notice
Focusing on Task Parallelism
• The three keywords to implement parallelism
– cilk_spawn, cilk_for, cilk_sync
• Reducers to handle shared data safely
– e.g. reducer_opadd, reducer_ostream
• Composable
– Nest parallel regions, mix with other Intel® PBB
• Serial Semantics
– Parallel code can be understood and executed as an equivalent serial code
– Benefits are improved applicability of industry standard debugging and analysis tools, support for deterministic behavior, and tools that are able to provide strong guarantees of correctness and/or performance
4
Software and Services Group Optimization Notice
Agenda
• Other technical presentations review Intel® Cilk™ Plus and Intel® PBB at a feature level.
– http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/
• In this technical presentation, we’ll take a more in-depth look at the task parallel parts of Cilk Plus using specific example codes
– Syntax
– Debugging
– Using analysis tools
– Fun with Reducers
– And more…
5
Software and Services Group Optimization Notice
Systems Used for Examples
• In all cases I used Intel® C++ Composer XE for Windows* and Linux* update 1.
• The Inspector example uses Intel® Inspector XE for Windows* initial release.
• Linux* system is a 4 core Intel® Xeon® X5560 cpu running 64-bit Fedora Core* 9.0
• Windows* system is a dual core Intel® Core™ i5 660 cpu running 64-bit Windows Server 2008* R2 Enterprise.
• All examples shown are provided as-is, and you are encourage to validate any conclusions yourself.
6
Software and Services Group Optimization Notice
Using cilk_spawn and cilk_sync$ cat cilk_spawn_sample.c
#include <stdio.h> // for printf
#include <stdlib.h> // for strtol
#include <cilk/cilk.h> // for cilk keywords
#include <cilk/cilk_api.h> // for cilk functions
int fib(int x) {
int tmp1, tmp2;
printf("fib() run by Cilk Plus worker %d\n", __cilkrts_get_worker_number());
if (x <= 1)
return x;
else {
tmp1 = cilk_spawn fib(x-1);
tmp2 = fib(x-2);
cilk_sync;
return tmp1+tmp2;
}
}
int main(int argc, char** argv) {
int input = strtol(argv[1], NULL, 0);
printf("fib of %d is %d\n", input, fib(input));
return(0);
}
7
Software and Services Group Optimization Notice
Using cilk_spawn and cilk_sync
$ ./a.out 1
fib() run by Cilk Plus worker 8
fib of 1 is 1
$ ./a.out 2
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 1
fib of 2 is 1
$ ./a.out 3
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 1
fib of 3 is 2
8
Software and Services Group Optimization Notice
Using cilk_spawn and cilk_sync
$ ./a.out 6
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 8
fib() run by Cilk Plus worker 2
fib() run by Cilk Plus worker 1
fib() run by Cilk Plus worker 1
fib() run by Cilk Plus worker 1
fib() run by Cilk Plus worker 1
fib() run by Cilk Plus worker 1
fib() run by Cilk Plus worker 0
fib() run by Cilk Plus worker 1
fib() run by Cilk Plus worker 4
fib() run by Cilk Plus worker 4
fib() run by Cilk Plus worker 4
fib() run by Cilk Plus worker 0
fib() run by Cilk Plus worker 0
fib() run by Cilk Plus worker 0
fib() run by Cilk Plus worker 0
fib of 6 is 8
9
Software and Services Group Optimization Notice
Using cilk_for
$ cat cilk_for_sample.c
#include <stdio.h> // for printf
#include <stdlib.h> // for strtol
#include <cilk/cilk.h> // for cilk keywords
#include <cilk/cilk_api.h> // for cilk functions
int main(int argc, char** argv) {
int input = strtol(argv[1], NULL, 0);
int i, tmp = 0;
cilk_for(i = 1; i <= input; i++) {
printf("for loop run by Cilk Plus worker %d\n", __cilkrts_get_worker_number());
tmp += i;
}
printf("triangular of %d is %d\n", input, tmp);
return(0);
}
10
Software and Services Group Optimization Notice
Using cilk_for
$ ./a.out 1
for loop run by Cilk Plus worker 8
triangular of 1 is 1
$ ./a.out 2
for loop run by Cilk Plus worker 8
for loop run by Cilk Plus worker 8
triangular of 2 is 3
$ ./a.out 3
for loop run by Cilk Plus worker 8
for loop run by Cilk Plus worker 1
for loop run by Cilk Plus worker 1
triangular of 3 is 6
$ ./a.out 4
for loop run by Cilk Plus worker 8
for loop run by Cilk Plus worker 8
for loop run by Cilk Plus worker 8
for loop run by Cilk Plus worker 8
triangular of 4 is 10
$ ./a.out 4
for loop run by Cilk Plus worker 8
for loop run by Cilk Plus worker 8
for loop run by Cilk Plus worker 3
for loop run by Cilk Plus worker 3
triangular of 4 is 10
11
Software and Services Group Optimization Notice
Uh-oh
• The cilk_for code seems to work reliably for lower numbers.
• But if tmp is made a long int, and we increase the input significantly, we start seeing non-deterministic output:
– Change line “int i, tmp = 0;” to:
int i; //, tmp = 0;
long tmp = 0;
– Change %d in printf to %ld for tmp
– Also remove printf of Cilk Plus worker id to preserve sanity
$ ./a.out 1055555
triangular of 1055555 is 218727597107
$ ./a.out 1055555
triangular of 1055555 is 306057677066
$ ./a.out 1055555
triangular of 1055555 is 257123500732
12
Software and Services Group Optimization Notice
What do we do now?
• Try to debug using the Intel® Parallel Debug Extensions (Windows*) or Intel® Debugger (Linux*)
• Try Intel® Parallel Inspector 2011 or Intel® Inspector XE
• Try Cilkscreen utility
13
Software and Services Group Optimization Notice
Debug it!
• Need to compile with options:
– /Zi /debug:parallel on Windows*
– -g –debug parallel on Linux*
• Use Intel® Parallel Debug Extensions in Microsoft Visual Studio* Debugger or
• Use Intel® Debugger (IDB) on Linux
• Next example uses IDB, flow is the same on Windows, but uses GUI instead of text commands
14
Software and Services Group Optimization Notice
Using IDB’s Thread Data Sharing Detection
$ icc -O2 -g -debug parallel cilk_for2.c
$ idbc ./a.out
Intel(R) Debugger for applications running on Intel(R) 64, Version 12.0, Build [1.3842.2.154]
------------------
object file name: ./a.out
Reading symbols from a.out...done.
(idb) set args 1055555
(idb) idb sharing on
(idb) run
Starting program: /var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/a.out
[New Thread 140059247015680 (LWP 10832)]
[New Thread 140059247015680 (LWP 10832)]
[New Thread 1090242896 (LWP 10833)]
[New Thread 1111206224 (LWP 10834)]
[New Thread 1121696080 (LWP 10835)]
[New Thread 1132185936 (LWP 10836)]
[New Thread 1142675792 (LWP 10837)]
[New Thread 1153165648 (LWP 10838)]
[New Thread 1163655504 (LWP 10839)]
[New Thread 1174145360 (LWP 10840)]
Data sharing event 1: 0x601320 8 bytes, 4 accesses from 3 threads.
__$U0 (this=0x7f6215f48f80, =1, =362484044) at /var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16
16 tmp += i;
(idb) idb sharing event expand
Data sharing event 1: 0x601320 8 bytes, 4 accesses from 3 threads.
/var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16 = 0x400c5b write, Thread 3
/var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16 = 0x400c5b write, Thread 7
/var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16 = 0x400c72 read, Thread 7
/var/quad/blhewitt/cilk/samples-for-webinar/cilk_for/cilk_for2.c:16 = 0x400c72 read, Thread 9
15
Software and Services Group Optimization Notice
What does this mean?
• IDB is telling us that multiple threads are reading and writing to the same memory location (tmp) at the same time.
• We have a data race, which can cause non-deterministic behavior (i.e. behavior that can change depending on the order different threads execute).
16
Software and Services Group Optimization Notice
So how do we fix data races?
• The traditional solution is to use a lock around the accesses to the shared data.
• A lock allows only one thread to access the protected code at a time.
• Intel® TBB provides a mutex locking construct, but since the sample is in C and not C++, we’ll use a POSIX thread mutex lock.
17
Software and Services Group Optimization Notice
Locking Solution
$ cat cilk_for_with_lock.c
#include <stdio.h> // for printf
#include <stdlib.h> // for strtol
#include <pthread.h> // for lock
#include <cilk/cilk.h> // for cilk keywords
#include <cilk/cilk_api.h> // for cilk functions
pthread_mutex_t lock_sum;
int main(int argc, char** argv) {
int input = strtol(argv[1], NULL, 0);
int i;
long tmp = 0;
pthread_mutex_init(&lock_sum, NULL);
cilk_for(i = 1; i <= input; i++) {
pthread_mutex_lock(&lock_sum);
tmp += i;
pthread_mutex_unlock(&lock_sum);
}
pthread_mutex_destroy(&lock_sum);
printf("triangular of %d is %ld\n", input, tmp);
return(0);
}
18
Software and Services Group Optimization Notice
Results of Locking
• We now get good, consistent answers $ ./a.out 1055555
triangular of 1055555 is 557098706790
• But performance suffers as iterations increasetime ./a.out 105555555
triangular of 105555555 is 5570987648456790
real 0m9.183s
user 0m7.133s
sys 0m59.067s
19
Software and Services Group Optimization Notice
Intel® Cilk™ Plus Reducers
• Need a way to protect accesses to shared data that doesn’t suffer from contention and bottle-necks.
• Cilk Plus provides reducers – Constructs that provide unique views of shared data to each worker that are then merged at a cilk_sync.
• Reducer design eliminates lock contention, and also have other benefits.
20
Software and Services Group Optimization Notice
Solution with reducer_opadd in C$ cat cilk_for_with_reducer.c
#include <stdio.h> // for printf
#include <stdlib.h> // for strtol
#include <cilk/cilk.h> // for cilk keywords
#include <cilk/cilk_api.h> // for cilk functions
#include <cilk/reducer_opadd.h> // for Reducer
int main(int argc, char** argv) {
int input = strtol(argv[1], NULL, 0);
int i;
CILK_C_REDUCER_OPADD(tmp, long, 0);
CILK_C_REGISTER_REDUCER(tmp);
cilk_for(i = 1; i <= input; i++) {
REDUCER_VIEW(tmp) += i;
}
printf("triangular of %d is %ld\n", input, tmp.value);
CILK_C_UNREGISTER_REDUCER(tmp);
return(0);
}
21
Software and Services Group Optimization Notice
Results with Reducer
$ icc cilk_for_with_reducer.c
time ./a.out 105555555
triangular of 105555555 is 5570987648456790
real 0m0.094s
user 0m0.062s
sys 0m0.103s
22
Software and Services Group Optimization Notice
Catching Data Races before they Manifest
• Intel provides a couple tools for detecting data races in Cilk Plus codes
• Intel® Parallel Inspector 2011 / Intel® Inspector XE
– Some limitations, including false positives, and potential misses if no steals occur
• Cilkscreen
– Only detects data races (Inspector also detects memory errors)
23
Software and Services Group Optimization Notice
Using Intel® Inspector XE for Windows*
24
Software and Services Group Optimization Notice
Start Data Race Detection
25
Software and Services Group Optimization Notice
Get Results
26
Software and Services Group Optimization Notice
View Details
27
Software and Services Group Optimization Notice
Results After Adding Reducer
28
Software and Services Group Optimization Notice
Another Benefit of Reducers: Serial Semantics
$ cat test_with_lock.cpp
#include <iostream>
#include <cstring>
#include <pthread.h>
#include <cilk/cilk.h>
pthread_mutex_t cout_lock;
int main(int argc, char* argv[]) {
const int length = std::strlen(argv[1]);
pthread_mutex_init(&cout_lock, NULL);
cilk_for(int i = 0; i < length; i++) {
pthread_mutex_lock(&cout_lock);
std::cout << argv[1][i];
pthread_mutex_unlock(&cout_lock);
}
std::cout << std::endl;
pthread_mutex_destroy(&cout_lock);
return(0);
}
29
Software and Services Group Optimization Notice
Answer With Pthread Mutexes
$ icc test_with_lock.cpp -lpthread
$ ./a.out "hello world,hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world"
hello world,hello world, hello world, hello wor hello world, hello world, hello wolrldd,hello world, hello world, h,e lhleol lwo helhlo ewlolrol rwdoo,r he llwlolordld,d ,hweo rlllho,e whelllloorlo d ,w ohrellldo, whodrh,l hedl,l oe lwlohrelldol,o h ewl lo wwoororlrdl, ldd,, hhelleol lwoor lwdo,r lhde,l lhoe llwoo rwlodr,l dh,e lhleol lwoo rwlodr,l dh,ello world, hello ello world, hello world, hello world, hello world, hello world, hello worldworld, hello world, hello world, hello world,, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello worldworld, hello world, hello world, hello world,
30
Software and Services Group Optimization Notice
Now Use an ostream Reducer
$ cat test_with_ostream_reducer.cpp
#include <iostream>
#include <cstring>
#include <cilk/cilk.h>
#include <cilk/reducer_ostream.h>
int main(int argc, char* argv[]) {
const int length = std::strlen(argv[1]);
cilk::reducer_ostream cout_reducer(std::cout);
cilk_for(int i = 0; i < length; i++)
cout_reducer << argv[1][i];
std::cout << std::endl;
return(0);
}
31
Software and Services Group Optimization Notice
Results Are Always In Order
$ icc test_with_ostream_reducer.cpp
$ ./a.out "hello world,hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world"
hello world,hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world, hello world
32
Software and Services Group Optimization Notice
Customer Case
33
PickupTruck
Body Chassis EngineDriveTrain
Cab Doors Flatbed
Goal: Find all “collisions” between an assembly and a target object.
Software and Services Group Optimization Notice
First Attempt to Use cilk_for
34
std::list<Node *>output_list;void walk(Node &x, Node &target)) {
if (x.is_internal()) {
cilk_for(Node::iterator child = x.begin(); child != x.end(); ++child) {
walk(child, target); }
}else
if (target.collides_with(x)) output_list.push_back(x);
}
Parallel update of list is a Data Race!
In parallel, traverse tree
At leaf, collect collisions
Software and Services Group Optimization Notice
Using Locks
35
std::list<Node *>output_list;void walk(Node &x, Node &target)) {
if (x.is_internal()) {
cilk_for(Node::iterator child = x.begin(); child != x.end(); ++child) {
walk(child, target); }
}else
if (target.collides_with(x)) {
m.lock();output_list.push_back(x);m.unlock();
}}
Add lock•Poor performance•Order not deterministic.
In parallel, traverse tree
At leaf, collect collisions
Software and Services Group Optimization Notice
Using STL List Reducer
36
cilk::reducer_list_append<Node *>output_list;
void walk(Node &x, Node &target)) {if (x.is_internal()) {
cilk_for(Node::iterator child = x.begin(); child != x.end(); ++child) {
walk(child, target); }
}else
if (target.collides_with(x)) output_list.push_back(x);
}
Change list to hyper-object•Good performance. Serial order!
In parallel, traverse tree
At leaf, collect collisions
Software and Services Group Optimization Notice
A Look at cilk_for vs. cilk_spawn
• Take the cilk_for example from slide 21
$ cat cilk_for_with_reducer.c
#include <stdio.h> // for printf
#include <stdlib.h> // for strtol
#include <cilk/cilk.h> // for cilk keywords
#include <cilk/cilk_api.h> // for cilk functions
#include <cilk/reducer_opadd.h> // for Reducer
int main(int argc, char** argv) {
int input = strtol(argv[1], NULL, 0);
int i;
CILK_C_REDUCER_OPADD(tmp, long, 0);
CILK_C_REGISTER_REDUCER(tmp);
cilk_for(i = 1; i <= input; i++) {
REDUCER_VIEW(tmp) += i;
}
printf("triangular of %d is %ld\n", input, tmp.value);
CILK_C_UNREGISTER_REDUCER(tmp);
return(0);
}
37
Software and Services Group Optimization Notice
cilk_for vs. cilk_spawn
• What if we rewrite the cilk_for to a serial for loop over cilk_spawn function calls?
$ cat cilk_spawn_with_reducer.c
<snip includes>
void foo(long * x, int y) {
*x += y;
}
int main(int argc, char** argv) {
int input = strtol(argv[1], NULL, 0);
int i;
CILK_C_REDUCER_OPADD(tmp, long, 0);
CILK_C_REGISTER_REDUCER(tmp);
for(i = 1; i <= input; i++) {
cilk_spawn foo(&(REDUCER_VIEW(tmp)), i);
}
cilk_sync;
printf("triangular of %d is %ld\n", input, tmp.value);
CILK_C_UNREGISTER_REDUCER(tmp);
return(0);
}
38
Software and Services Group Optimization Notice
Time taken by cilk_spawn
• Results:– $ time ./a.out 105555555
– triangular of 105555555 is 5570987648456790
– real 0m4.185s
– user 0m10.250s
– sys 0m23.050s
• Why is the cilk_spawn so much slower? Work stealing has a significant overhead for light workloads. cilk_for better distributes the work, minimizing steals.
39
Software and Services Group Optimization Notice
More Analysis/Debugging/Usability Features
• Serialize the code. Just add /Qcilk-serialize (Windows*) or –cilk-serialize (Linux*)
– Just stubs out cilk_spawn and cilk_sync and replaces cilk_for with for
• Set number of workers explicitly:
– set CILK_NWORKERS=1 (Windows*)
– export CILK_NWORKERS=1 (Linux bash)
– setenv CILK_NWORKERS 1 (Linux cshell)
– __cilkrts_set_param(“NWORKERS”, “1”);
40
Software and Services Group Optimization Notice
Runtime Functions for Worker Management
• __cilkrts_get_worker_number()
– Returns an integer id specific to the Cilk Plus worker running the code.
• __cilkrts_get_nworkers()
– Returns the number of workers available to handle Cilk Plus tasks. Returns 1 in serial code. Once called, the worker count can’t be changed later.
• __cilkrts_get_total_workers()
– Returns the total number of worker “slots”. The Cilk Plus runtime has an allocation of workers that can well be greater than the number of active workers. You can use this API to replace shared data with an array of shared data specific to each thread and then use __cilkrts_get_worker_number() as an index into the array.
41
Software and Services Group Optimization Notice
Compiling Code with Non-Intel Compilers
• If you try to compile Cilk Plus code with say gcc, you will get errors:$ g++ test_with_ostream_reducer.cpp
test_with_ostream_reducer.cpp:3:23: error: cilk/cilk.h: No such file or directory
test_with_ostream_reducer.cpp:4:34: error: cilk/reducer_ostream.h: No such file or directory
test_with_ostream_reducer.cpp: In function “int main(int, char**)”:
test_with_ostream_reducer.cpp:8: error: “cilk” has not been declared
test_with_ostream_reducer.cpp:8: error: expected `;' before “cout_reducer”
test_with_ostream_reducer.cpp:10: error: expected primary-expression before “int”
test_with_ostream_reducer.cpp:10: error: “I” was not declared in this scope
test_with_ostream_reducer.cpp:10: error: expected `;' before “)” token
• Add –I <compiler include/cilk> –include cilk/cilk_stub.h to get serial version that compiles– $g++ -I /opt/intel/Compiler/12.0/108/compilerpro-12.0.1.108/compiler/include
-include cilk/cilk_stub.h -g test_with_ostream_reducer.cpp
– For Microsoft, use –I <compiler include\cilk> and /FI cilk\cilk_stub.h
42
Software and Services Group Optimization Notice
Next Steps
• Try Cilk Plus for yourself
– Download an evaluation of the Intel® C++ Composer XE at http://intel.com/software/products
– Try out the sample codes distributed with the product
– Go to http://cilk.com and check out the content, including the Evaluation Guide and the Cilk Plus specification
– If you’re interested in 1:1 consulting, let us know in the feedback form for this presentation
43
Software and Services Group Optimization Notice
What we didn’t get to
• Intel® Parallel Building Blocks: Quickly Manipulate Data in Parallel Using Intel® Cilk™ Plus Array Notation/Elemental FunctionsTuesday, February 1, 2011 9:00 AM - 10:00 AM PST (GMT-8)
– http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/
• Mixing Cilk Plus keywords and array notations
– http://software.intel.com/en-us/articles/intel-parallel-building-blocks-getting-started-tutorial-and-hands-on-lab/?wapkw=(PBB+lab)
• Mixing Cilk Plus with Intel® TBB
• Using Cilkview and Cilkscreen
• Using Intel® VTune™ Amplifier XE with Cilk Plus– http://software.intel.com/en-us/articles/intel-cilk-plus-support-in-intel-parallel-amplifier-
2011/?wapkw=(Parallel+Amplifier+and+Cilk+Plus)
• Writing Custom Reducers
– Refer to linear-recurrence sample provided with Intel® Parallel Composer 2011 or C++ Composer XE
44
Software and Services Group Optimization Notice
Optimization Notice
Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for
instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not
optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that
are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler
options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User
and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler
products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your
code and other factors, you likely will get extra performance on Intel microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-
Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel®
Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming
SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and
non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet
your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please
let us know if you find we do not.
Notice revision #20101101
45
Software and Services Group Optimization Notice
• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number
46
Software and Services Group Optimization Notice
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [See slide 6]. For more information go to http://www.intel.com/performance
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement)
Intel Xeon, Core, and Cilk Plus are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2011 Intel Corporation. All rights reserved.
47
Software and Services Group Optimization Notice
48