Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
• Combining goodness of different type of accelerators: GPU + FPGA• GPU is still an essential accelerator for simple and large degree of
parallelism to provide ~10 TFLOPS peak performance• FPGA is a new type of accelerator for application-specific hardware with
programmability and speeded up based on pipelining of calculation• FPGA is good for external communication between them with advanced
high speed interconnection up to 100Gbps x4 chan.
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Multi-Hybrid Accelerated Computing Platform
Supercomputer at CCS: Cygnus
OpenCL-ready High Speed FPGA Networking [1]
comp.
node…
IB EDR Network (100Gbps x4/node)
Ordinary inter-node communication channel for CPU and GPU, but
they can also request it to FPGA
comp.
nodecomp.
node…comp.
node
Deneb nodesAlbireo nodes
comp.
node
comp.
node
Ordinary inter-node network (CPU, GPU) by IB EDR
With 4-ports x full bisection b/w
……
Inter-FPGA direct network
• Our new supercomputer “Cygnus”• Operation started in May 2019• 2x Intel Xeon CPUs, 4x NVIDIA V100 GPUs, 2x Intel
Stratix10 FPGAs• Deneb: 48 CPU+GPU nodes• Albireo: 32 CPU+GPU+FPGA nodes
with 2D-torus dedicated network for FPGAs (100Gbpsx4)
Albireo node (x32)
Deneb node (x48)
Specification of Cygnus
Target GPU:NVIDIA Tesla V100
Target FPGA:Nallatech 520N
Item Specification
Peak performance
2.4 PFLOPS DP(GPU: 2.2 PFLOPS, CPU: 0.2 PFLOPS, FPGA: 0.6 PFLOPS SP)⇨ enhanced by mixed precision and variable precision on FPGA
# of nodes80 (32 Albireo (GPU+FPGA) nodes, 48 Deneb (GPU-only) nodes)
Memory192 GiB DDR4-2666/node = 256GB/s, 32GiB x 4 for GPU/node = 3.6TB/s
CPU / node Intel Xeon Gold (SKL) x2 sockets
GPU / node NVIDIA V100 x4 (PCIe)
FPGA / nodeIntel Stratix10 x2 (each with 100Gbps x4 links/FPGA and x8 links/node)
Global File System
Lustre, RAID6, 2.5 PB
Interconnection Network
Mellanox InfiniBand HDR100 x4 (two cables of HDR200 / node)4 TB/s aggregated bandwidthj
Programming Language
CPU: C, C++, Fortran, OpenMP, GPU: OpenACC, CUDAFPGA: OpenCL, Verilog HDL
System Vendor
NEC
• FPGA design plan• Router- For the dedicated
network, this impl. is mandatory.
- Forwarding packets to destinations
• User Logic- OpenCL kernel runs
here.- Inter-FPGA comm.
can be controlled from OpenCL kernel.
• SL3- SerialLite III : Intel
FPGA IP- Including transceiver
modules for Inter-FPGA data transfer.
- Users don’t need to care
CPU
PCIe network (switch)
G
PU
G
PU
FPGA
HCA HCA
Inter-FPGA
direct network(100Gbps x4)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
PU
G
PU
FPGA
HCA HCA
Inter-FPGA
direct network(100Gbps x4)
SINGLE
NODE
(with FPGA)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
PU
G
PU
HCA HCA
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
PU
G
PU
HCA HCA
SINGLE
NODE
(without FPGA)
Network switch
(100Gbps x2)
FPGA FPGA FPGA
FPGA FPGA FPGA
FPGA FPGA FPGA
(only for Albirero nodes)
Inter-FPGA direct network
64 FPGAs on Albireo nodes are
connected directly as 2D-Torus
configuration without Ethernet sw.
Cluster System with FPGAs
sender(__global float* restrict x, int n) {for (int i = 0; i < n; i++) {float v = x[i];
write_channel_intel(simple_out, v);}
}
receiver(__global float* restrict x, int n) {for (int i = 0; i < n; i++) {float v = read_channel_intel(simple_in);
x[i] = v;}
}
Communication Integrated Reconfigurable CompUting System (CIRCUS)CIRCUS enables OpenCL code communicate
with other FPGAs on different nodesExtending Intel’s channel mechanism to
external communications Pipeline manner: sending/receiving data
from/to compute pipeline directly
・ I/O Channel- connects OpenCLwith peripherals
- We used this feature
Comm. w/o channels
Comm. w/ channels
・ Channel Extension: Transferring data between kernels directly (low latency and high bandwidth)・ We can use multiple kernel design to exploit space parallelism in an FPGA
FPGA-based parallel comp. with OpenCL- Needs a communication system being suitable to OpenCL and Intel FPGAs
- Using of Intel FPGA SDK for OpenCL
CIRCUS
Backends
sender code on FPGA1
receiver code on FPGA2
Our proposed method Pipelined communication experiment
90.7Gbps↑
Recv.
Comp.
Send
Authentic Radiation Transfer [2]
• Accelerated Radiative transfer on grids Oct-Tree (ARGOT) has been developer in Center for Computational Sciences, University of Tsukuba• ART is one of algorithms used in ARGOT and
dominant part (90% or more of computation time) of ARGOT program
• ART is ray tracing based algorithm• problem space is divided
into meshes and reactions are computed on each mesh
• Memory access pattern depends on ray direction
• Not suitable for SIMD architecture
0
200
400
600
800
1000
1200
1400
(16,16,16) (32,32,32) (64,64,64) (128,128,128)
Perf
orm
anc
e [M
me
sh/s
]
mesh size
CPU(14C)
CPU(28C)
P100(x1)
FPGA
better
Table 2: Resource usage and clock frequency of the implementation.
size # of PEs ALMs (% ) Registers (% ) M20K (% ) MLAB DSP (% ) Freq. [MHz](16, 16, 16) (2, 2, 2) 132,283 31% 267,828 31% 739 27% 14,310 312 21% 193.2(32, 32, 32) (2, 2, 2) 169,882 40% 344,447 40% 796 29% 21,100 312 21% 173.8(64, 64, 64) (2, 2, 2) 169,549 40% 344,512 40% 796 29% 21,250 312 21% 167.0
(128, 128, 128) (2, 2, 2) 169,662 40% 344,505 40% 796 29% 21,250 312 21% 170.4
Table 3: Performance comparison between FPGA, CPU andGPU implementations. The unit is M mesh/sec.
Size CPU(14C) CPU(28C) P100 FPGA(16,16,16) 112.4 77.2 105.3 1282.8(32,32,32) 158.9 183.4 490.4 1165.2(64,64,64) 175.0 227.2 1041.4 1111.0
(128,128,128) 95.4 165.0 1116.1 1133.5
per link) multiple interconnection links (up to 4 channels) onit. Additionally, HLS such as OpenCL programming envi-ronment is provided, and there are several tyeps of researchto involve them in FPGA computing. In [3], Kobayashi, etal. show the basic feature to utilize the high speed intercon-nection over FPGA driven by OpenCL kernels. Therefore,although the performance of our implementation is almostsame as NVIDIA P100 GPU, the overall performance withmultiple computation nodes with FPGA to be connected di-rectly can easily overcome the GPU implementation whichrequires host CPU control and kernel switching for inter-node communication. Networking overhead on FPGAs ismuch lower than one on GPUs. To improve current ARTmethod implementation with such an interconnection fea-ture on FPGA is our next step toward high performanceparallel FPGA computing.
8. CONCLUSIONIn this paper, we optimized ART method used in ARGOTprogram which solves a fundamental calculation in earlystage universe with space radiative transfer phenomenon,on an FPGA using Intel FPGA SDK for OpenCL. W e par-allelized the algorithm using the SDK ’s channel extensionin an FPGA. W e achieved 4.89 times faster performancethan the CPU implementation using OpenMP as well as al-most same performance as the GPU implementation usingCUDA.Although the performance of the FPGA implementationis comparable to NVIDIA P100 GPU, it has room to im-prove its performance. The most important optimizationis resource optimization. I f we can implement larger num-ber of PEs than one of the current, we can improve per-formance. However, it is difficult for us to reduce usageof ALMs and registers because we do not describe them di-rectly in OpenCL code. Not only resource but also frequencyis important. W e suppose Arria 10 with OpenCL design canrun on 200MHz or higher frequency.W ewill implement the network functionality into theARTdesign to parallelize it among multiple FPGAs. W e con-sider networking using FPGAs is an important feature forparallel applications using FPGAs. Although GPUs havehigher computation performance FLOPS and higher mem-ory bandwidth than FPGAs, I/O including networking is a
weak point for GPUs because they are connected to NICsthrough PCIe bus. In addition to networking, we will try torun our code on Stratix 10 FPGA which is the next gener-ation Intel FPGA. W e expect we can implement more PEsthan Arria 10 FPGA because it has 2.2 times more ALMblocks and 3.8 times more DPS blocks.
9. REFERENCES[1] K . M . Górski, E . Hivon, A. J. Banday, B. D. W andelt,F. K . Hansen, M . Reinecke, and M . Bartelmann.Healpix: A framework for high-resolution discretizationand fast analysis of data distributed on the sphere. TheAstrophysical Journal, 622(2):759, 2005.
[2] K . Hill, S. Craciun, A. George, and H. Lam.Comparative analysis of opencl vs. hdl withimage-processing kernels on stratix-v fpga. In 2015IEEE 26th International Conference onApplication-specific Systems, Architectures andProcessors (ASAP) , pages 189–193, July 2015.
[3] R. Kobayashi, Y . Oobata, N. Fujita, Y . Yamaguchi,and T . Boku. Opencl-ready high speed fpga network forreconfigurable high performance computing. InProceedings of the International Conference on HighPerformance Computing in Asia-Pacific Region, HPCAsia 2018, pages 192–201, New York, NY , USA, 2018.ACM .
[4] Y . Luo, X . W en, K . Yoshii, S. Ogrenci-Memik,G. Memik, H. Finkel, and F. Cappello. Evaluatingirregular memory access on opencl fpga platforms: Acase study with xsbench. In 2017 27th InternationalConference on Field Programmable Logic andApplications (FPL) , pages 1–4, Sept 2017.
[5] T . Okamoto, K . Yoshikawa, and M . Umemura. argot:accelerated radiative transfer on grids using oct-tree.Monthly Notices of the Royal Astronomical Society,419(4):2855–2866, 2012.
[6] S. Tanaka, K . Yoshikawa, T . Okamoto, andK . Hasegawa. A new ray-tracing scheme for 3d diffuseradiation transfer on highly parallel architectures.Publications of the Astronomical Society of Japan,67(4):62, 2015.
[7] H. R. Zohouri, N. M aruyama, A. Smith, M . Matsuda,and S. Matsuoka. Evaluating and optimizing openclkernels for high performance computing with fpgas. InProceedings of the International Conference for HighPerformance Computing, Networking, Storage andAnalysis, SC ’16, pages 35:1–35:12, Piscataway, NJ,USA, 2016. IEEE Press.
(16x16x16) (8x8x8)
mesh
• Problem space is divided into small blocks• e.g. (16, 16, 16) → 8 × (8, 8, 8)• PE is assigned to each of small blocks
• PEs are connected by channels each other• PE: Processing Element• BE: Boundary Element
• Kernel of PEs and BEs are started automatically by autorun attribute• Lower control overhead and resource usage
because of decreasing number of host controlled kernels
4.9x faster
almost equal performance
Reference
[1] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, and Taisuke Boku, Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.479-488, May 2019
[2] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Yuuma Oobata, Taisuke Boku, Makito Abe, Kohji Yoshikawa, and Masayuki Umemura: Accelerating Space Radiate Transfer on FPGA using OpenCL (Accepted), International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2018)
Acknowledgment
This research is a part of the project titled “Development of Computing-Communication Unified Supercomputer in Next Generation” under the program of “Research and Development for Next-Generation Supercomputing Technology” by MEXT. We thank Intel University Program for providing us both of hardware and software.