1
CATEGORY: ASTRONOMY & ASTROPHYSICS POSTER AS01 CONTACT NAME Long Wang: wangl@sccas.cn Weighted essentially non-oscillatory (WENO) is a high order finite-difference method, based on structured grids, designed for shock capturing. It has been widely used in applications for high-resolution supersonic flow simulations, typically the cosmological hydrodynamic involving both shocks and complicated smooth solution structures. Consider the system of hyperbolic conservation laws, 3D Euler equation for the fluid without any viscous: What is WENO? Implementa2on on CPU/GPU Results Speedup of 128 3 WENO scheme on single GPU C2075 with Fermi architecture K20m with Kepler architecture Future work 1. We can reduce the data-copy time by porting all the computations related to the data in WENO computations to GPU. 2. We can hide the MPI-communication time by overlapping it with the computing time. The ghost data can be updated first. Then, we proceed the communication of the ghost data and the left data updating simultaneously. 3. We can develop a hybrid algorithm that both CPU and GPU take part in computing simultaneously. In some supercomputers (Titan,Tianhe-1A), one node is equipped with a lot of CPU cores but only one GPU. It is important that how to distribute computational ta sk between CPU and GPU for good load balancing. The high-order and high-resolution of WENO scheme call for more computations. The advances of GPUs open new horizons to further accelerating WENO in large-scale cosmological simulations. But there are several challenges: Big data for 3D problem Double precision for high accuracy More ghost data for high order More memory access for calculations of weights. High requirements for memory space and bandwidth seem not to be welcomed by GPUs. Can WENO computations benefit from GPUs? The answer is “YES!”. Now our implementation has been applied for Wigeon ”, which is a hybrid cosmological simulation software based on WENO scheme. Accelera2on of a 3D highorder FiniteDifference WENO Scheme for LargeScale Cosmological Simula2ons on GPU Chen Meng 1 , Long Wang 1 , Zongyan Cao 1 ,XianFeng Ye 1 ,LongLong Feng 2 Can memory-bound code benefit from the great power of GPUs? U t = f ( U ) X + g( U ) Y + h( U ) Z = F (t , U ) ρ ρu ρυ ρ w E ! " # # # # # # $ % & & & & & & ρu ρu 2 + P ρu υ ρuw u( E + P ) ! " # # # # # # $ % & & & & & & ρ v ρu υ ρ v 2 + P ρ vw v( E + P ) ! " # # # # # # $ % & & & & & & ρ w ρuw ρ vw ρu 2 + P w( E + P ) ! " # # # # # # $ % & & & & & & U f ( U ) g( U ) h( U ) = = = = f (u) x x = x j 1 Δx ( ˆ f j + 1/2 ˆ f j 1/2 ) ˆ f j + 1/2 = w 1 ˆ f (1) j + 1/2 + w 2 ˆ f (2) j + 1/2 + w 3 ˆ f (3) j + 1/2 We used 5 th order WENO scheme for the discretization of the fluxes in double precision, for exampleThe success of WENO scheme relies on the design of the nonlinear weights , which is the key to achieve automatically high-order accuracy and non-oscillatory property near discontinuities. The calculations of takes most of the resources on hardware and time. And WENO computation becomes memory-bound derived from it. Problems w i w i Decomposition based on MPI We used a hybrid parallel mode that involves CPUs and GPUs co-processing. The large computational workload should be suitably distributed among the processors first. 5 th order Mapping strategy on GPU The subdomain distributed on each process would undergo a second decomposition on the local GPU. The whole WENO part is split into three kernels for the finite difference operations in X-axis, Y-axis, Z-axis. For WENO_ fx, each point on the Y-Z plane of subdomain is assi- -gned on a differ- -ent thread in a different block and corresponds to the processing of a group of all points along X-axis. The parallel strategy for WENO_gy, WENO_hz is in the same but along the different axis directions. WENO scheme requires a five– point stencil at least in the each direction, which is called the ghost data. We subdivided the domain along each of three axial directions to achieve the smallest amounts of ghost data on the same scale and the best scalability. Then GPU processes the main WENO computations in the subdomain on one process. Op2miza2on Technologies Optimizations Memory throughout Address pattern in global memory Hierarchical memory Instruction throughout Latency 1Initial Kernel 2 Consolidate ‘if’ statements and unroll small ‘for’ bodies 3Global memory coalescing by transposing array of structure(AOS) to structure of array(SOA) 4Using Texture memory for read-only data in the procedure of WENO computations 5Selecting optimal block-size by multi - running 6Selecting optimal compiler arguments, like “-Xptxas –dlcm =cg” to shut down L1 cache 7Control use of registers to reduce “register spilling” 8Split big Kernel to multi-kernels 9Using Shared Memory for new kernels Euler solver based on WENO scheme on multi-CPU/GPU 28 iterations CPU:AMD Phenom(tm) 9850 GPU:C2075 with Fermi architecture The strong scaling of 256*256*128 The weak scaling of 128*128*128 per process NP MPI(s) MPI+ CUDA(s) MPI efficiency MPI+CUDA efficiency 1 671.14 70.56 2 340.26 41.35 98.62% 85.32% 4 188.36 24.30 89.07% 72.60% 8 97.69 15.56 85.87% 56.68% 15.13 22.14 22.92 19.62 12.75 25.52 33.21 36.43 31.34 17.00 weno_Fx weno_Gy weno_Hz weno weno+data_copy F_SpeedUp K_SpeedUp 0 2 4 6 8 10 12 14 16 18 20 22 1 2 3 4 5 6 7 8 9 WENO op(miza(on curve on Tesla C2075 GPU_speedup MPI bind Texture unbind Texture set ghost data Runge kutta GPU transpose AOS->SOA WENO_Fx calculations WENO_Fx writing back WENO_hz calculations WENO_hz writing back transpose SOA->AOS WENO_fx WENO_gy WENO_hz Memcpy Memcpy Memcpy MPI_sendrecv CPU 1. Supercompu?ng Center, Computer Network Informa?on Center, Chinese Academy of Sciences 2. Purple Mountain Observatory, Chinese Academy of Sciences

of GPUs? · Category: Astronomy & Astrophysics poster As01 contact name Long Wang: [email protected] Weighted essentially non-oscillatory (WENO) is a high order finite-difference method,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: of GPUs? · Category: Astronomy & Astrophysics poster As01 contact name Long Wang: wangl@sccas.cn Weighted essentially non-oscillatory (WENO) is a high order finite-difference method,

Category: Astronomy & Astrophysicsposter

As01contact name

Long Wang: [email protected]

Weighted essentially non-oscillatory (WENO) is a high order finite-difference method, based on structured grids, designed for shock capturing. It has been widely used in applications for high-resolution supersonic flow simulations, typically the cosmological hydrodynamic involving both shocks and complicated smooth solution structures. Consider the system of hyperbol ic conservation laws, 3D Euler equation for the fluid without any viscous:

What  is  WENO?  

Implementa2on  on  CPU/GPU  

Results  Speedup of 1283 WENO scheme on single GPU

•  C2075 with Fermi architecture •  K20m with Kepler architecture

Future  work  1.  We can reduce the data-copy time by

porting all the computations related to the data in WENO computations to GPU.

2.  We can hide the MPI-communication time by overlapping it with the computing time. The ghost data can be updated f i r s t . T h e n , w e p r o c e e d t h e communication of the ghost data and the left data updating simultaneously.

3.  We can develop a hybrid algorithm that both CPU and GPU take part in computing simultaneously. In some supercomputers (Titan,Tianhe-1A), one node is equipped with a lot of CPU cores but only one GPU. It is important that how to distribute computational task between CPU and GPU for good load balancing.

The high-order and high-resolution of WENO scheme call for more computations. The advances of GPUs open new horizons to further accelerating WENO in large-scale cosmological simulations. But there are several challenges:

•  Big data for 3D problem •  Double precision for high accuracy •  More ghost data for high order •  M o r e m e m o r y a c c e s s f o r

calculations of weights. High requirements for memory space and bandwidth seem not to be welcomed by GPUs. Can WENO computations benefit from GPUs? The answer is “YES!”. Now our implementation has been applied for “Wigeon”, which is a hybrid cosmological simulation software based on WENO scheme.

——Accelera2on  of  a  3D  high-­‐order  Finite-­‐Difference  WENO  Scheme  for  Large-­‐Scale  Cosmological  Simula2ons  on  GPU

 Chen  Meng1,  Long  Wang1,  Zongyan  Cao1,XianFeng  Ye1,Long-­‐Long  Feng2  

Can memory-bound code benefit from the great power of GPUs?

Ut =∂f (U)∂X

+∂g(U)∂Y

+∂h(U)∂Z

= F(t,U)

ρ

ρuρυ

ρwE

!

"

######

$

%

&&&&&&

ρuρu2 +Pρuυρuwu(E +P)

!

"

######

$

%

&&&&&&

ρvρuυρv2 +Pρvwv(E +P)

!

"

######

$

%

&&&&&&

ρwρuwρvwρu2 +Pw(E +P)

!

"

######

$

%

&&&&&&

U f (U) g(U) h(U)= = = =

∂f (u)∂x

x = x j ≈1Δx( f̂ j+1/2 − f̂ j−1/2 ) f̂ j+1/2 = w1 f̂

(1)j+1/2 +w2 f̂

(2)j+1/2 +w3 f̂

(3)j+1/2

We used 5th order WENO scheme for the discretization of the fluxes in double precision, for example:

The success of WENO scheme relies on the design of the nonlinear weights , which is the key to achieve automatically high-order accuracy and non-oscillatory property near discontinuities. The calculations of takes most of the resources on hardware and time. And WENO computation becomes memory-bound derived from it.

Problems  

wi

wi

!

•  Decomposition based on MPI We used a hybrid parallel mode that involves CPUs and GPUs co-processing. The large computational workload should be suitably distributed among the processors first. 5th order

•  Mapping strategy on GPU The subdomain distributed on each process would undergo a second decomposition on the local GPU. The whole WENO part is split into three kernels for the finite difference operations in X-axis, Y-axis, Z-axis. For WENO_ fx, each point on the Y-Z plane of subdomain is assi- -gned on a differ- -ent thread in a different block and corresponds to the processing of a group of all points along X-axis. The parallel strategy for WENO_gy, WENO_hz is in the same but along the different axis directions.

WENO scheme requires a five–point stencil at least in the each direction, which i s ca l l ed the ghost data.

We subdivided the domain along each of three axial directions to achieve the smallest amounts of ghost data on the same scale and the best scalability. Then GPU processes the main WENO computations in the subdomain on one process.

Op2miza2on  Technologies  Optimizations

•  Memory throughout •  Address pattern in global memory •  Hierarchical memory

•  Instruction throughout •  Latency

1、Initial Kernel 2 Consolidate ‘if’ statements and unroll small ‘for’ bodies 3、Global memory coalescing by transposing array of structure(AOS) to structure of array(SOA) 4、Using Texture memory for read-only data in the procedure of WENO computations 5、Selecting optimal block-size by multi - running 6、Selecting optimal compiler arguments, like “-Xptxas –dlcm =cg” to shut down L1 cache 7、Control use of registers to reduce “register spilling” 8、Split big Kernel to multi-kernels 9、Using Shared Memory for new kernels

Euler solver based on WENO scheme on multi-CPU/GPU

28 iterations CPU:AMD Phenom(tm) 9850

GPU:C2075 with Fermi architecture

•  The strong scaling of 256*256*128

•  The weak scaling of 128*128*128 per process

NP MPI(s) MPI+

CUDA(s)

MPI

efficiency

MPI+CUDA

efficiency

1 671.14 70.56

2 340.26 41.35 98.62% 85.32%

4 188.36 24.30 89.07% 72.60%

8 97.69 15.56 85.87% 56.68%

15.13%%

22.14%% 22.92%%19.62%%

12.75%%

25.52%%

33.21%%36.43%%

31.34%%

17.00%%

weno_

Fx%

weno_

Gy%

weno_

Hz%weno%

weno+

data_c

opy%

F_SpeedUp% K_SpeedUp%

0"

2"

4"

6"

8"

10"

12"

14"

16"

18"

20"

22"

1" 2" 3" 4" 5" 6" 7" 8" 9"

WENO%op(miza(on%curve%on%Tesla%C2075%

GPU_speedup"

MPI

bind Texture

unbind Texture

set ghostdata

Rungekutta

GPU

transposeAOS->SOA

WENO_Fxcalculations

WENO_Fxwriting back

WENO_hzcalculations

WENO_hzwriting back

transposeSOA->AOS

WENO_fx

WENO_gy

WENO_hz

Memcpy

Memcpy

Memcpy

MPI_sendrecvCPU

���������

1.  Supercompu?ng  Center,  Computer  Network                            Informa?on  Center,  Chinese  Academy  of  Sciences  

2.  Purple  Mountain  Observatory,  Chinese  Academy  of  Sciences