94
Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science

Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern

Parallel Computing CIS 410/510

Department of Computer and Information Science

Page 2: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Outline q  Partitioning q  What is the stencil pattern? ❍ Update alternatives ❍ 2D Jacobi iteration ❍ SOR and Red/Black SOR

q  Implementing stencil with shift q  Stencil and cache optimizations q  Stencil and communication optimizations q  Recurrence

2 Introduction to Parallel Computing, University of Oregon, IPCC

Page 3: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Partitioning q 

q  Data is divided into ❍  non-overlapping regions (avoid write conflicts, race conditions) ❍  equal-sized regions (improve load balancing)

3

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisherand is confidential until formal publication.

McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 191 — #191

6.6 Geometric Decomposition and Partition 191

might get expanded into a new string of characters, or deleted, according to a set of rules. Such apattern is often used in computer graphics modeling where the general approach is often called dataamplification. In fact, this pattern is supported in current graphics rendering pipelines in the form ofgeometry shaders. To simplify implementation, the number of outputs might be bounded to allow forstatic memory allocation, and this is, in fact, done in geometry shaders. This pattern can also be usedto implement variable-rate output from a map, for example, as required for lossless compression.

As with pack, fusing expand with map makes sense since unnecessary write bandwidth can thenbe avoided. The expand pattern also corresponds to the use of push_back on C++ STL collections inserial loops.

The implementation of expand is more complex than pack but can follow similar principles. If thescan approach is used, we scan integers representing the number of outputs from each element ratherthan only zeros and ones. In addition, we should tile the implementation so that, on a single processor,a serial “local expand” is used which is trivial to implement.

6.6 GEOMETRIC DECOMPOSITION AND PARTITIONA common strategy to parallelize an algorithm is to divide up the computational domain into sections,work on the sections individually, and then combine the results. Most generally, this strategy is knownas divide-and-conquer and is also used in the design of recursive serial algorithms. The parallelizationof the general form of divide-and-conquer is supported by the fork–join pattern, which is discussedextensively in Chapter 8.

Frequently, the data for a problem can also be subdivided following a divide-and-conquer strategy.This is obvious when the problem itself has a spatially regular organization, such as an image or aregular grid, but it can also apply to more abstract problems such as sorting and graphs. When thesubdivision is spatially motivated, it is often also known as geometric decomposition.

As a special case of geometric decomposition, the data is subdivided into uniform non-overlappingsections that cover the domain of computation. We will call this the partition pattern. An example of apartition in 1D is shown in Figure 6.16. The partition pattern can also be applied in higher dimensionsas is shown in Figure 6.17.

The sections of a partition are non-overlapping. This is an important property to avoid write con-flicts and race conditions. A partition is often followed by a map over the set of sections with eachinstance of the elemental function in the map being given access to one of the sections. In this case, ifwe ensure that the instance has exclusive access to that section, then within the partition serial scatterpatterns, such as random writes, can be used without problems with race conditions. It is also possibleto apply the pattern recursively, subdividing a section into subsections for nested parallelism. This canbe a good way to map a problem onto hierarchically organized parallel hardware.

FIGURE 6.16

Partitioning. Data is divided into non-overlapping, equal-sized regions.

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisherand is confidential until formal publication.

McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 192 — #192

192 CHAPTER 6 Data Reorganization

FIGURE 6.17

Partitioning in 2D. The partition pattern can be extended to multiple dimensions.

These diagrams show only the simplest case, where the sections of the partition fit exactly into thedomain. In practice, there may be boundary conditions where partial sections are required along theedges. These may need to be treated with special-purpose code, but even in this case the majority ofthe sections will be regular, which lends itself to vectorization. Ideally, to get good memory behaviorand to allow efficient vectorization, we also normally want to partition data, especially for writes, sothat it aligns with cache line and vectorization boundaries. You should be aware of how data is actuallylaid out in memory when partitioning data. For example, in a multidimensional partitioning, typicallyonly one dimension of an array is contiguous in memory, so only this one benefits directly from spatiallocality. This is also the only dimension that benefits from alignment with cache lines and vectorizationunless the data will be transposed as part of the computation. Partitioning is related to strip-mining thestencil pattern, which is discussed in Section 7.3.

Partitioning can be generalized to another pattern that we will call segmentation. Segmentation stillrequires non-overlapping sections, but now the sections can vary in size. This is shown in Figure 6.18.Various algorithms have been designed to operate on segmented data, including segmented versionsof scan and reduce that can operate on each segment of the array but in a perfectly load-balancedfashion, regardless of the irregularities in the lengths of the segments [BHC+93]. These segmentedalgorithms can actually be implemented in terms of the normal scan and reduce algorithms by usinga suitable combiner function and some auxiliary data. Other algorithms, such as quicksort [Ble90,Ble96], can in turn be implemented in a vectorized fashion with a segmented data structure using theseprimitives.

In order to represent a segmented collection, additional data is required to keep track of the bound-aries between sections. The two most common representations are shown in Figure 6.19. Using an array

1D

2D

Introduction to Parallel Computing, University of Oregon, IPCC

Page 4: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Partitioning q 

q  Data is divided into ❍  non-overlapping regions (avoid write conflicts, race conditions) ❍  equal-sized regions (improve load balancing)

4

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisherand is confidential until formal publication.

McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 191 — #191

6.6 Geometric Decomposition and Partition 191

might get expanded into a new string of characters, or deleted, according to a set of rules. Such apattern is often used in computer graphics modeling where the general approach is often called dataamplification. In fact, this pattern is supported in current graphics rendering pipelines in the form ofgeometry shaders. To simplify implementation, the number of outputs might be bounded to allow forstatic memory allocation, and this is, in fact, done in geometry shaders. This pattern can also be usedto implement variable-rate output from a map, for example, as required for lossless compression.

As with pack, fusing expand with map makes sense since unnecessary write bandwidth can thenbe avoided. The expand pattern also corresponds to the use of push_back on C++ STL collections inserial loops.

The implementation of expand is more complex than pack but can follow similar principles. If thescan approach is used, we scan integers representing the number of outputs from each element ratherthan only zeros and ones. In addition, we should tile the implementation so that, on a single processor,a serial “local expand” is used which is trivial to implement.

6.6 GEOMETRIC DECOMPOSITION AND PARTITIONA common strategy to parallelize an algorithm is to divide up the computational domain into sections,work on the sections individually, and then combine the results. Most generally, this strategy is knownas divide-and-conquer and is also used in the design of recursive serial algorithms. The parallelizationof the general form of divide-and-conquer is supported by the fork–join pattern, which is discussedextensively in Chapter 8.

Frequently, the data for a problem can also be subdivided following a divide-and-conquer strategy.This is obvious when the problem itself has a spatially regular organization, such as an image or aregular grid, but it can also apply to more abstract problems such as sorting and graphs. When thesubdivision is spatially motivated, it is often also known as geometric decomposition.

As a special case of geometric decomposition, the data is subdivided into uniform non-overlappingsections that cover the domain of computation. We will call this the partition pattern. An example of apartition in 1D is shown in Figure 6.16. The partition pattern can also be applied in higher dimensionsas is shown in Figure 6.17.

The sections of a partition are non-overlapping. This is an important property to avoid write con-flicts and race conditions. A partition is often followed by a map over the set of sections with eachinstance of the elemental function in the map being given access to one of the sections. In this case, ifwe ensure that the instance has exclusive access to that section, then within the partition serial scatterpatterns, such as random writes, can be used without problems with race conditions. It is also possibleto apply the pattern recursively, subdividing a section into subsections for nested parallelism. This canbe a good way to map a problem onto hierarchically organized parallel hardware.

FIGURE 6.16

Partitioning. Data is divided into non-overlapping, equal-sized regions.

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s),reviewer(s), Elsevier and typesetter diacriTech. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisherand is confidential until formal publication.

McCool — e9780124159938 — 2012/6/6 — 23:09 — Page 192 — #192

192 CHAPTER 6 Data Reorganization

FIGURE 6.17

Partitioning in 2D. The partition pattern can be extended to multiple dimensions.

These diagrams show only the simplest case, where the sections of the partition fit exactly into thedomain. In practice, there may be boundary conditions where partial sections are required along theedges. These may need to be treated with special-purpose code, but even in this case the majority ofthe sections will be regular, which lends itself to vectorization. Ideally, to get good memory behaviorand to allow efficient vectorization, we also normally want to partition data, especially for writes, sothat it aligns with cache line and vectorization boundaries. You should be aware of how data is actuallylaid out in memory when partitioning data. For example, in a multidimensional partitioning, typicallyonly one dimension of an array is contiguous in memory, so only this one benefits directly from spatiallocality. This is also the only dimension that benefits from alignment with cache lines and vectorizationunless the data will be transposed as part of the computation. Partitioning is related to strip-mining thestencil pattern, which is discussed in Section 7.3.

Partitioning can be generalized to another pattern that we will call segmentation. Segmentation stillrequires non-overlapping sections, but now the sections can vary in size. This is shown in Figure 6.18.Various algorithms have been designed to operate on segmented data, including segmented versionsof scan and reduce that can operate on each segment of the array but in a perfectly load-balancedfashion, regardless of the irregularities in the lengths of the segments [BHC+93]. These segmentedalgorithms can actually be implemented in terms of the normal scan and reduce algorithms by usinga suitable combiner function and some auxiliary data. Other algorithms, such as quicksort [Ble90,Ble96], can in turn be implemented in a vectorized fashion with a segmented data structure using theseprimitives.

In order to represent a segmented collection, additional data is required to keep track of the bound-aries between sections. The two most common representations are shown in Figure 6.19. Using an array

1D

2D

3D

Introduction to Parallel Computing, University of Oregon, IPCC

Page 5: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Outline q  Partitioning q  What is the stencil pattern? ❍ Update alternatives ❍ 2D Jacobi iteration ❍ SOR and Red/Black SOR

q  Implementing stencil with shift q  Stencil and cache optimizations q  Stencil and communication optimizations q  Recurrence

5 Introduction to Parallel Computing, University of Oregon, IPCC

Page 6: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern q  A stencil pattern is a map where each output

depends on a “neighborhood” of inputs q  These inputs are a set of fixed offsets relative to

the output position q  A stencil output is a function of a “neighborhood”

of elements in an input collection ❍ Applies the stencil to select the inputs

q  Data access patterns of stencils are regular ❍ Stencil is the “shape” of “neighborhood” ❍ Stencil remains the same

6 Introduction to Parallel Computing, University of Oregon, IPCC

Page 7: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Serial Stencil Example (part 1)

7 Introduction to Parallel Computing, University of Oregon, IPCC

Page 8: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Serial Stencil Example (part 2)

8

How would we parallelize this?

Introduction to Parallel Computing, University of Oregon, IPCC

Page 9: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

9 Introduction to Parallel Computing, University of Oregon, IPCC

Page 10: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

Input array

10 Introduction to Parallel Computing, University of Oregon, IPCC

Page 11: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

Function

11 Introduction to Parallel Computing, University of Oregon, IPCC

Page 12: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

Output Array

12 Introduction to Parallel Computing, University of Oregon, IPCC

Page 13: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

This stencil has 3 elements in the neighborhood: i-­‐1, i, i+1

13

i i-1 i+1

neighborhood

Introduction to Parallel Computing, University of Oregon, IPCC

Page 14: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

Applies some function to them…

14

i i-1 i+1

neighborhood

Introduction to Parallel Computing, University of Oregon, IPCC

Page 15: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

And outputs to the ith position of the output array

15 Introduction to Parallel Computing, University of Oregon, IPCC

Page 16: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Patterns q  Stencils can operate on one dimensional and

multidimensional data q  Stencil neighborhoods can range from compact to

sparse, square to cube, and anything else! q  It is the pattern of the stencil that determines how

the stencil operates in an application

16 Introduction to Parallel Computing, University of Oregon, IPCC

Page 17: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern 17

2-Dimensional Stencils

5-point stencil

Source:  h.p://en.wikipedia.org/wiki/Stencil_code  

9-point stencil 4-point stencil

Center cell (P) is used as well

Center cell (C) is used as well

Center cell (P) is not used

Introduction to Parallel Computing, University of Oregon, IPCC

Page 18: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern 18

3-Dimensional Stencils

Source:  h.p://en.wikipedia.org/wiki/Stencil_code  

6-point stencil (7-point stencil)

24-point stencil (25-point stencil)

Introduction to Parallel Computing, University of Oregon, IPCC

Page 19: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Example

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

•  Here is our array, A

19

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 20: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Example

•  Here is our array A •  B is the output array

–  Initialize to all 0

•  Apply a stencil operation to the inner square of the form: B(i,j) = avg( A(i,j),

A(i-1,j), A(i+1,j), A(i,j-1), A(i,j+1)

)

20

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

A

What is the stencil? Introduction to Parallel Computing, University of Oregon, IPCC

Page 21: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Procedure

1)  Average all blue squares

21

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 22: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Procedure

1)  Average all blue squares 2)  Store result in B

22

0 0

0

0

0 0 0 0

0

0

0 0

0

4.4

0

0

B

Introduction to Parallel Computing, University of Oregon, IPCC

Page 23: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Procedure

1)  Average all blue squares 2)  Store result in B 3)  Repeat 1 and 2 for all green

squares

23

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 24: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Practice!

24 Introduction to Parallel Computing, University of Oregon, IPCC

Page 25: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Practice

25

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

0 0

0

0

0 0 0 0

0

0

0 0

0

4.4

0

0

B A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 26: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Practice

26

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

0 0

0

0

0 0 0 0

0

0

0 0

0

4.4

0

4.0

B A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 27: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Practice

27

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

A

0 0

0

0

0 0 0 0

0

0

0 0

3.8

4.4

0

4.0

B

Introduction to Parallel Computing, University of Oregon, IPCC

Page 28: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Practice

28

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

A

0 0

0

0

0 0 0 0

0

0

0 0

3.8

4.4

3.4

4.0

B

Introduction to Parallel Computing, University of Oregon, IPCC

Page 29: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Outline q  Partitioning q  What is the stencil pattern? ❍ Update alternatives ❍ 2D Jacobi iteration ❍ SOR and Red/Black SOR

q  Implementing stencil with shift q  Stencil and cache optimizations q  Stencil and communication optimizations q  Recurrence

29 Introduction to Parallel Computing, University of Oregon, IPCC

Page 30: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Serial Stencil Example (part 1)

30 Introduction to Parallel Computing, University of Oregon, IPCC

Page 31: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Serial Stencil Example (part 2)

31

How would we parallelize this?

a[i]!Updates occur in place!!!

Introduction to Parallel Computing, University of Oregon, IPCC

Page 32: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern with In Place Update

32 Introduction to Parallel Computing, University of Oregon, IPCC

Page 33: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern with In Place Update

Input array

33 Introduction to Parallel Computing, University of Oregon, IPCC

Page 34: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern with In Place Update

Function

34 Introduction to Parallel Computing, University of Oregon, IPCC

Page 35: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern with In Place Update

Input Array !!!

35 Introduction to Parallel Computing, University of Oregon, IPCC

Page 36: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Example

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

•  Here is our array, A

36

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 37: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Example

•  Here is our array A •  Update A in place •  Apply a stencil operation to

the inner square of the form: A(i,j) = avg( A(i,j),

A(i-1,j), A(i+1,j), A(i,j-1), A(i,j+1)

)

37

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

A

What is the stencil? Introduction to Parallel Computing, University of Oregon, IPCC

Page 38: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Procedure

1)  Average all blue squares

38

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

Introduction to Parallel Computing, University of Oregon, IPCC

Page 39: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Procedure

1)  Average all blue squares 2)  Store result in red square

39

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

Introduction to Parallel Computing, University of Oregon, IPCC

Page 40: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Procedure

1)  Average all blue squares 2)  Store result in red square 3)  Repeat 1 and 2 for all green

squares

40

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

Introduction to Parallel Computing, University of Oregon, IPCC

Page 41: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Practice!

41 Introduction to Parallel Computing, University of Oregon, IPCC

Page 42: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Practice

42

0 0

0

0

0 0 0 0

0

0

0 0

6

9

4

7

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 43: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil Pattern Practice

43

0 0

0

0

0 0 0 0

0

0

0 0

6

4.4

4

7

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 44: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

44

0 0

0

0

0 0 0 0

0

0

0 0

6

4.4

4

7

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 45: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

45

0 0

0

0

0 0 0 0

0

0

0 0

6 4.4

4

3.08

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 46: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

46

0 0

0

0

0 0 0 0

0

0

0 0

6

4.4

4

3.08

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 47: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

47

0 0

0

0

0 0 0 0

0

0

0 0

2.88

4.4

4

3.08

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 48: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

48

0 0

0

0

0 0 0 0

0

0

0 0

2.88

4.4

4

3.08

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 49: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

What is the stencil pattern?

49

0 0

0

0

0 0 0 0

0

0

0 0

2.88 4.4

1.992

3.08

A

Introduction to Parallel Computing, University of Oregon, IPCC

Page 50: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Different Cases

50

6

9

4

7

Input  

4.4

3.4 3.8

4.0

Output  

6

9

4

7

Input  

4.4 1.992 2.88

3.08

Output  

Separate output array

Updates occur in place

Introduction to Parallel Computing, University of Oregon, IPCC

Page 51: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Which is correct?

51

6

9

4

7

Input  

4.4 1.992 2.88

3.08

Output  

Is this output incorrect?

Introduction to Parallel Computing, University of Oregon, IPCC

Page 52: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Outline q  Partitioning q  What is the stencil pattern? ❍ Update alternatives ❍ 2D Jacobi iteration ❍ SOR and Red/Black SOR

q  Implementing stencil with shift q  Stencil and cache optimizations q  Stencil and communication optimizations q  Recurrence

52 Introduction to Parallel Computing, University of Oregon, IPCC

Page 53: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Iterative Codes q  Iterative codes are ones that update their data in steps

❍  At each step, a new value of an element is computed using a formula based on other elements

❍  Once all elements are updated, the computation proceeds to the next step or completes

q  Iterative codes are most commonly found in computer simulations of physical systems for scientific and engineering applications ❍  Computational fluid dynamics ❍  Electromagnetics modeling

q  They are often applied to solve partial differential equations ❍  Jacobi iteration ❍  Gauss-Seidel iteration ❍  Successive over relaxation

53 Introduction to Parallel Computing, University of Oregon, IPCC

Page 54: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Iterative Codes and Stencils q  Stencils essentially define which elements are used

in the update formula q  Because the data is organized in a regular manner,

stencils can be applied across the data uniformly

54 Introduction to Parallel Computing, University of Oregon, IPCC

Page 55: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Simple 2D Example q  Consider the following code

for k=1, 1000 for i=1, N-2 for j = 1, N-2 a[i][j] = 0.25 * (a[i][j] + a[i-1][j]

+ a[i+1][j] + a[i][j-1] + a[i][j+1])

} } }

55

5-point stencil

Do you see anything interesting?

How would you parallelize?

Introduction to Parallel Computing, University of Oregon, IPCC

Page 56: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

2-Dimension Jacobi Iteration q  Consider a 2D array of elements q  Initialize each array element to some value q  At each step, update each array

element to the arithmetic mean of its N, S, E, W neighbors

q  Iterate until array values converge q  Here we are using a 4-point stencil q  It is different from before because we want to

update all array elements simultaneously … How?

56

4-point stencil

Introduction to Parallel Computing, University of Oregon, IPCC

Page 57: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

2-Dimension Jacobi Iteration q  Consider a 2D array of elements q  Initialize each array element to some value q  At each step, update each array

element to the arithmetic mean of its N, S, E, W neighbors

q  Iterate until array values converge

57

Step 0 Step 200 Step 400 Step 600 Step 800 Step 1000

4-point stencil cold hot Heat equation simulation

Introduction to Parallel Computing, University of Oregon, IPCC

Page 58: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Successive Over Relaxation (SOR) q  SOR is an alternate method of solving partial

differential equations q  While the Jacobi iteration scheme is very simple

and parallelizable, its slow convergent rate renders it impractical for any "real world" applications

q  One way to speed up the convergent rate would be to "over predict" the new solution by linear extrapolation

q  It also allows a method known as Red-Black SOR to be used to enable parallel updates in place

58 Introduction to Parallel Computing, University of Oregon, IPCC

Page 59: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern 59

Red / Black SOR

Introduction to Parallel Computing, University of Oregon, IPCC

Page 60: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern 60

Red / Black SOR

Introduction to Parallel Computing, University of Oregon, IPCC

Page 61: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Outline q  Partitioning q  What is the stencil pattern? ❍ Update alternatives ❍ 2D Jacobi iteration ❍ SOR and Red/Black SOR

q  Implementing stencil with shift q  Stencil and cache optimizations q  Stencil and communication optimizations q  Recurrence

61 Introduction to Parallel Computing, University of Oregon, IPCC

Page 62: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Implementing Stencil with Shift

q  One possible implementation of the stencil pattern includes shifting the input data

q  For each offset in the stencil, we gather a new input vector by shifting the original input by the offset amount

62 Introduction to Parallel Computing, University of Oregon, IPCC

Page 63: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Implementing Stencil with Shift

1 2 3 4 5 6 7 8 7 6 5 43 2

2 2

2

2 1 1

1

3 3 4 4 4

4 3

3 5 5 5

5

1 6 6 6

6

7 7

7

7 8 8

8

8

All input arrays are derived from the same original input array

0 9

63 Introduction to Parallel Computing, University of Oregon, IPCC

Page 64: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Implementing Stencil with Shift

q  This implementation is only beneficial for one dimensional stencils or the memory-contiguous dimension of a multidimensional stencil

q  Memory traffic to external memory is not reduced with shifts

q  But, shifts allow vectorization of the data reads, which may reduce the total number of instructions

64 Introduction to Parallel Computing, University of Oregon, IPCC

Page 65: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Contents q  Partitioning q  What is the stencil pattern? ❍ Update alternatives ❍ 2D Jacobi iteration ❍ SOR and Red/Black SOR

q  Implementing stencil with shift q  Stencil and cache optimizations q  Stencil and communication optimizations q  Recurrence

65 Introduction to Parallel Computing, University of Oregon, IPCC

Page 66: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil and Cache Optimizations

q  Assuming 2D array where rows are contiguous in memory… ❍ Horizontally related data will tend to belong to the

same cache line ❍ Vertical offset accesses will most likely result in cache

misses

66 Introduction to Parallel Computing, University of Oregon, IPCC

Page 67: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil and Cache Optimizations

q  Assigning rows to cores: ❍ Maximizes horizontal data locality ❍ Assuming vertical offsets in stencil, this will create

redundant reads of adjacent rows from each core q  Assigning columns to cores: ❍ Redundantly read data from same cache line ❍ Create false sharing as cores write to same cache line

67 Introduction to Parallel Computing, University of Oregon, IPCC

Page 68: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil and Cache Optimizations

q  Assigning “strips” to each core can be a better solution

q  Strip-mining: an optimization in a stencil computation that groups elements in a way that avoids redundant memory accesses and aligns memory accesses with cache lines

68 Introduction to Parallel Computing, University of Oregon, IPCC

Page 69: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil and Cache Optimizations

q  A strip’s size is a multiple of a cache line in width, and the height of the 2D array

q  Strip widths are in increments of the cache line size so as to avoid false sharing and redundant reads

q  Each strip is processed serially from top to bottom within each core

69 Introduction to Parallel Computing, University of Oregon, IPCC

Page 70: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil and Cache Optimizations

Height of array

m*sizeof(cacheLine)

70 Introduction to Parallel Computing, University of Oregon, IPCC

Page 71: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Outline q  Partitioning q  What is the stencil pattern? ❍ Update alternatives ❍ 2D Jacobi iteration ❍ SOR and Red/Black SOR

q  Implementing stencil with shift q  Stencil and cache optimizations q  Stencil and communication optimizations q  Recurrence

71 Introduction to Parallel Computing, University of Oregon, IPCC

Page 72: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

But first… Conway’s Game of Life

72

q  The Game of Life is a cellular automaton created by John Conway in 1970

q  The evolution of the game is entirely based on the input state – zero player game

q  To play: create initial state, observe how the system evolves over successive time steps

2D landscape

Introduction to Parallel Computing, University of Oregon, IPCC

Page 73: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

73

q  Typical rules for the Game of Life ❍  Infinite 2D grid of square cells, each cell is either

“alive” or “dead” ❍ Each cell will interact with all 8 of its neighbors

◆ Any cell with < 2 live neighbors dies (under-population) ◆ Any cell with 2 or 3 live neighbors lives to next gen. ◆ Any cell with > 3 live neighbors dies (overcrowding) ◆ Any dead cell with 3 live neighbors becomes a live cell

Introduction to Parallel Computing, University of Oregon, IPCC

Page 74: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life: Examples

74 Introduction to Parallel Computing, University of Oregon, IPCC

Page 75: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

75

q  The Game of Life computation can easily fit into the stencil pattern!

q  Each larger, black box is owned by a thread q  What will happen at the boundaries?

MY  DATA  

Introduction to Parallel Computing, University of Oregon, IPCC

Page 76: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

76

q  We need some way to preserve information from the previous iteration without overwriting it

q  Ghost Cells are one solution to the boundary and update issues of a stencil computation

q  Each thread keeps a copy of neighbors’ data to use in its local computations

q  These ghost cells must be updated after each iteration of the stencil

Introduction to Parallel Computing, University of Oregon, IPCC

Page 77: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

77

q  Working with ghost cells

MY  DATA  

Introduction to Parallel Computing, University of Oregon, IPCC

Page 78: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

78

q  Working with ghost cells

Introduction to Parallel Computing, University of Oregon, IPCC

Page 79: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

79

q  Working with ghost cells Compute  the  new  value  for  this  cell  

Introduction to Parallel Computing, University of Oregon, IPCC

Page 80: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

80

q  Working with ghost cells

Five  of  its  eight  neighbors  already  belong  to  this  thread  

But  three  of  its  neighbors  belong  to  a  different  thread  

Introduction to Parallel Computing, University of Oregon, IPCC

Page 81: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

81

q  Working with ghost cells

Before  any  updates  are  done  in  a  new  iteraLon,  all  threads  must  update  their  ghost  cells  

Introduction to Parallel Computing, University of Oregon, IPCC

Page 82: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

82

q  Working with ghost cells Data  this  thread  can  use  (including  ghost  cells  from  neighbors)  

Introduction to Parallel Computing, University of Oregon, IPCC

Page 83: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

83

q  Working with ghost cells

Updated  cells  

Introduction to Parallel Computing, University of Oregon, IPCC

Page 84: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conway’s Game of Life

84

q  Things to consider… ❍ What might happen to our ghost cells as we increase

the number of threads? ◆ the ghost cells to total cells ratio will rapidly increase causing

a greater demand on memory

❍ What would be the benefits of using a larger number of ghost cells per thread? Negatives? ◆ in the Game of Life example, we could double or triple our

ghost cell boundary, allowing us to perform several iterations without stopping for a ghost cell update

Introduction to Parallel Computing, University of Oregon, IPCC

Page 85: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil and Communication Optimizations q  When data is distributed, ghost cells must be

explicitly communicated among nodes between loop iterations

0 0

0

0

0 0 0 0

0

0

0 0

1 0

1 1

PE 0

PE 2

PE 1

PE 3

r  Darker cells are PE 0’s ghost cells

r  After first iteration of stencil computation ¦  PE 0 must request PE 1 &

PE 2’s stencil results ¦  PE 0 can perform another

iteration of stencil

85 Introduction to Parallel Computing, University of Oregon, IPCC

Page 86: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil and Communication Optimizations

q  Generally better to replicate ghost cells in each local memory and swap after each iteration than to share memory ❍ Fine-grained sharing can lead to increased

communication cost

86 Introduction to Parallel Computing, University of Oregon, IPCC

Page 87: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Stencil and Communication Optimizations

q  Halo: set of all ghost cells q  Halo must contain all neighbors needed for one

iteration q  Larger halo (deep halo)

❍  Trade off ◆ less communications and more independence, but… ◆ more redundant computation and more memory used

q  Latency Hiding: Compute interior of stencil while

waiting for ghost cell updates

87 Introduction to Parallel Computing, University of Oregon, IPCC

Page 88: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Recurrence

q  What if we have several nested loops with data dependencies between them when doing a stencil computation?

88 Introduction to Parallel Computing, University of Oregon, IPCC

Page 89: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Recurrence

89 Introduction to Parallel Computing, University of Oregon, IPCC

Page 90: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Recurrence

Data dependencies between loops

90 Introduction to Parallel Computing, University of Oregon, IPCC

Page 91: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Recurrence Parallelization q  This can still be parallelized! q  Trick: find a plane that cuts through grid of

intermediate results ❍ Previously computed values on one side of plane ❍ Values to still be computed on other side of plane ❍ Computation proceeds perpendicular to plane through

time (this is known as a sweep) q  This plane is called a separating hyperplane

91 Introduction to Parallel Computing, University of Oregon, IPCC

Page 92: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Recurrence

Hyperplanes

92 Introduction to Parallel Computing, University of Oregon, IPCC

Page 93: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

•  Same grid of intermediate results

•  Each level corresponds to a loop iteration

•  Computation proceeds downward

Recurrence

93 Introduction to Parallel Computing, University of Oregon, IPCC

Page 94: Stencil Pattern - University of Oregonipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdfLecture 8 – Stencil Pattern Partitioning ! ! Data is divided into non-overlapping regions

Lecture 8 – Stencil Pattern

Conclusion q  Examined the stencil and recurrence pattern ❍ Both have a regular pattern of communication and data

access q  In both patterns we can convert a set of offset

memory accesses to shifts q  Stencils can use strip-mining to optimize cache use q  Ghost cells should be considered when stencil data

is distributed across different memory spaces

94 Introduction to Parallel Computing, University of Oregon, IPCC