8

Automatic TensorCore Scheduling Xiaoyong Liu PAI (Platform of AI) Alibaba Cloud Intelligence Presenting the work of PAI team

Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

Download PDF Report

Upload
others
View
6
Download
0

Embed Size (px)

Citation preview

Page 1: Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

Automatic TensorCore Scheduling

Xiaoyong Liu

PAI (Platform of AI)Alibaba Cloud Intelligence

Presenting the work of PAI team�

Page 2: Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

The Solution• Generate TensorCore code automatically

ØThread-Level Schedule for CUDA codegen

• warp tile shape– (16x16x16) : CUDA9– (32x8x16, 8x32x16) : CUDA10+

ØKind of Auto Tensorization

• IR passes to transform sub-tree to TensorCoreIntrinsics

Page 3: Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

PatternMatching

Page 4: Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

Matrix Identification

• input matrix attribute: matrix_a / matrix_b, row_major / col_major.

• Retrieve indices of input from ComputeOp : index0, index1

• Compare the indices to the axis/reduce_axis of ComputeOp

Page 5: Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

Thread Index Unification

• Thread index inside a warp should be the same for wmma::load/store

Ø threadIdx.x-> 0

Ø threadIdx.y-> threadIdx.y/warpDim.y*warpDim.y

warpDim.y = 32/warpDim.x = 32/blockDim.x

Page 6: Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

Loop Scaling

• “wmma::mma_sync(c, a, b, c)” = “c = float(a)*float(b) + c” x (16x16x16/32)

• Find the IterVar to scale according to the access indices of fragment registersfor (int k_inner_inner = 0; k_inner_inner < 16; ++k_inner_inner) {

for (int j_c = 0; j_c < 8; ++j_c) {compute_local[j_c] = (compute_local[j_c] + ((float)(A_shared_local[k_inner_inner] * B_shared_local[((k_inner_inner * 8) + j_c)])));

} }

for (int k_inner_inner = 0; k_inner_inner < 1; ++k_inner_inner) {for (int j_c = 0; j_c < 1; ++j_c) {

wmma::mma_sync(compute_local[0], B_shared_local[0], A_shared_local[0], compute_local[0]);}

}

Page 7: Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

Performance Optimization• Same as non-TensorCore CUDA codegen

Ø Auto tune tiling sizes

Ø Vectorized load/store for higher bandwidth utilization

Ø Double buffer to hide memory load latency

Ø Storage align to reduce bank conflicts of shared memory

Ø Virtual threads for data reuse (on-going)

Page 8: Automatic TensorCore SchedulingThe Solution •Generate TensorCorecode automatically ØThread-Level Schedule for CUDA codegen •warp tile shape –(16x16x16) : CUDA9 –(32x8x16,

Comparing with cublas TensorCore

8

INT8/4/1 on T4FP16 on V100

https://docs.tvm.ai/tutorials/optimize/opt_matmul_auto_tensorcore.html

cublas

https://docs.tvm.ai/tutorials/optimize/opt_matmul_auto_tensorcore.html

Fast JIT Code Generation - LLVMllvm.org/devmtg/2014-04/PDFs/LightningTalks/fast-jit-code-generatio… · in terms of compile time/runtime performance. tiny-llvm-codegen ... No performance

Fast JIT Code Generation - LLVMllvm.org/devmtg/2014-04/PDFs/LightningTalks/fast-jit-code-generatio… · in terms of compile time/runtime performance. tiny-llvm-codegen ... No performance

Documents

Deploying Predictive Maintenance Solutions To The Cloud & The … · Explore and automate feature extraction & machine learning tasks Target edge devices through C/C++ codegen Integrate

Deploying Predictive Maintenance Solutions To The Cloud & The … · Explore and automate feature extraction & machine learning tasks Target edge devices through C/C++ codegen Integrate

Documents

NumericalComputation• No external memory (DDR) access is required • Lower resource utilization than DDRaccess without channels with channels (16x16x16) (8x8x8) mesh • Problem

NumericalComputation• No external memory (DDR) access is required • Lower resource utilization than DDRaccess without channels with channels (16x16x16) (8x8x8) mesh • Problem

Documents

NASA JPL Systems Environment · (MMS, TES, Artifactory) • Approach: –Express the REST API endpoints of these servers in OpenAPI standard specification –Use Swagger codegen to

NASA JPL Systems Environment · (MMS, TES, Artifactory) • Approach: –Express the REST API endpoints of these servers in OpenAPI standard specification –Use Swagger codegen to

Documents

Technology

API First Strategy · • ASP.NET Core 1.0 • Go Server • NancyFX 46 Swagger Codegen (Server Stub): 23 • Java JAX-RS 2.0 spec • Java JAX-RS (Java JAX-RS (Jersey) • Java JAX-RS

API First Strategy · • ASP.NET Core 1.0 • Go Server • NancyFX 46 Swagger Codegen (Server Stub): 23 • Java JAX-RS 2.0 spec • Java JAX-RS (Java JAX-RS (Jersey) • Java JAX-RS

Documents

Tutorial: Building a backend in 24 hours · From IR to assembler: codegen pipeline 2. MC 3. Parts of a backend 4. Example step-by-step. The Pipeline IR Passes DAG Combine Legalize

Tutorial: Building a backend in 24 hours · From IR to assembler: codegen pipeline 2. MC 3. Parts of a backend 4. Example step-by-step. The Pipeline IR Passes DAG Combine Legalize

Documents

Modular Codegen - LLVM

Modular Codegen - LLVM

Documents

FOSDEM 2019 Joris Geessels // @jolos · 8 ④Lowering callables, setattr, getattr, … from numba.extending import lower_builtin from numba import cgutils # llvm codegen utils @lower_builtin(MyPoint,

FOSDEM 2019 Joris Geessels // @jolos · 8 ④Lowering callables, setattr, getattr, … from numba.extending import lower_builtin from numba import cgutils # llvm codegen utils @lower_builtin(MyPoint,

Documents

Program generation for schema-based, typed data accessCopyright 2004-present Facebook. All Rights Reserved. Further reading • Introducing "hack-codegen" •

Program generation for schema-based, typed data accessCopyright 2004-present Facebook. All Rights Reserved. Further reading • Introducing "hack-codegen" •

Documents

Future Developments in MPI - dsi.unive.it code from the ... 8x8x8 Blocks 16x16x16 Blocks. ... ♦ Example: MPICH2. University of Chicago Department of Energy MPICH2Authors: William

Future Developments in MPI - dsi.unive.it code from the ... 8x8x8 Blocks 16x16x16 Blocks. ... ♦ Example: MPICH2. University of Chicago Department of Energy MPICH2Authors: William

Documents

Documents

MODELO 200 MODELO 201REFERENCIA 1189 1229avicarton.es/wp-content/uploads/2018/11/catalogo_cajas...MODELO 201REFERENCIA REFERENCIA MEDIDAS 1046 16X11X15 1060 16X16X16 1155 21.5X15.5X16

MODELO 200 MODELO 201REFERENCIA 1189 1229avicarton.es/wp-content/uploads/2018/11/catalogo_cajas...MODELO 201REFERENCIA REFERENCIA MEDIDAS 1046 16X11X15 1060 16X16X16 1155 21.5X15.5X16

Documents

TensorCore Optimized DNN for Efficient Low Latency

TensorCore Optimized DNN for Efficient Low Latency

Documents

Souper-Charging Peepholes with Target Machine Info - LLVM · 2019. 11. 14. · Using Souper Offline 17 Mid-End Optimizations InstCombine Input LLVM IR Backend CodeGen X86 Machine

Souper-Charging Peepholes with Target Machine Info - LLVM · 2019. 11. 14. · Using Souper Offline 17 Mid-End Optimizations InstCombine Input LLVM IR Backend CodeGen X86 Machine

Documents

Multi-Threaded Queries - CMU 15-721...Multi-Threaded Queries S20 15-721 Final Presentation Memory Access & Optimization Parallel Scan Codegen 75% Goal Parallel Scan in C++ DataTable

Multi-Threaded Queries - CMU 15-721...Multi-Threaded Queries S20 15-721 Final Presentation Memory Access & Optimization Parallel Scan Codegen 75% Goal Parallel Scan in C++ DataTable

Documents

Nu verkrijgbaar. Codegen Q3354-CA 460 Watt Midi ATX

Nu verkrijgbaar. Codegen Q3354-CA 460 Watt Midi ATX

Documents

© 2015 The MathWorks, Inc....22 Tips for basic use Add %#codegen directive to your function No need to delete display functions like plot, disp, axis, figure Keep unsupported function

© 2015 The MathWorks, Inc....22 Tips for basic use Add %#codegen directive to your function No need to delete display functions like plot, disp, axis, figure Keep unsupported function

Documents

High Accuracy Low Precision QR Factorization and Least Square … · 2019-12-12 · High Accuracy Low Precision QR Factorization and Least Square Solver on GPU with TensorCore Conference’17,

High Accuracy Low Precision QR Factorization and Least Square … · 2019-12-12 · High Accuracy Low Precision QR Factorization and Least Square Solver on GPU with TensorCore Conference’17,

Documents

NVVM IR Specification 1 - Nvidia...The PTX codegen part of a NVVM compiler needs to know the source language because of the difference in DCI (driver/compiler interface). NVVM IR is

NVVM IR Specification 1 - Nvidia...The PTX codegen part of a NVVM compiler needs to know the source language because of the difference in DCI (driver/compiler interface). NVVM IR is

Documents

Large-Scale Point Cloud Classification Benchmark · 2016. 8. 15. · IGP & CVG, ETH Zürich Timo Hackel, timo.hackel@geod.baug.ethz.ch | 7/6/2016 | 25 Voxelization details 16x16x16

Large-Scale Point Cloud Classification Benchmark · 2016. 8. 15. · IGP & CVG, ETH Zürich Timo Hackel, [email protected] | 7/6/2016 | 25 Voxelization details 16x16x16

Documents

@johannes fiala - Java Forum Stuttgart 2017 (HOW TO) USE ... · PDF file01/03/2017 · Html, Swagger, Yaml, Swift, Tizen, TypescriptAngular, Undertow, ... Swagger-Codegen build integration

@johannes fiala - Java Forum Stuttgart 2017 (HOW TO) USE ... · PDF file01/03/2017 · Html, Swagger, Yaml, Swift, Tizen, TypescriptAngular, Undertow, ... Swagger-Codegen build integration

Documents

POSJETITE NAS ONLINE , Proljetna HIT … · • Integrisana VGA • MidiTower Codegen Qori 3348-A11 500W • Tastatura i optički miš WINSTAR MAX • Intel Core i3 550 3.2GHz •

POSJETITE NAS ONLINE , Proljetna HIT … · • Integrisana VGA • MidiTower Codegen Qori 3348-A11 500W • Tastatura i optički miš WINSTAR MAX • Intel Core i3 550 3.2GHz •

Documents

HLSL to SPIR-V for Vulkan DXC Update...-Google contributing SPIR-V CodeGen since April 2017-Share front-end parsing, HLSL validation-Recommended DXC for HLSL to SPIR-V compilation

HLSL to SPIR-V for Vulkan DXC Update...-Google contributing SPIR-V CodeGen since April 2017-Share front-end parsing, HLSL validation-Recommended DXC for HLSL to SPIR-V compilation

Documents

Architectural-Level Synthesis Giovanni De Micheliusers.ece.utexas.edu/~gerstl/ee382v-ics_f09/lectures/lecture_11.pdf · Architectural-level synthesis ... lex parse optimization codegen

Architectural-Level Synthesis Giovanni De Micheliusers.ece.utexas.edu/~gerstl/ee382v-ics_f09/lectures/lecture_11.pdf · Architectural-level synthesis ... lex parse optimization codegen

Documents

NET Core Libraries Common Language Runtime CodeGen Garbage Collector Security Model Exception Handling Loader & Binder Profiling & Debugging APIs Entity

NET Core Libraries Common Language Runtime CodeGen Garbage Collector Security Model Exception Handling Loader & Binder Profiling & Debugging APIs Entity

Documents

gpucc: An Open-Source GPGPU Compiler - …wujingyue.github.io/docs/gpucc-tutorial.pdf · gpucc: An Open-Source GPGPU Compiler Jingyue Wu, ... PTX assembly IR optimizer NVPTX codegen

gpucc: An Open-Source GPGPU Compiler - …wujingyue.github.io/docs/gpucc-tutorial.pdf · gpucc: An Open-Source GPGPU Compiler Jingyue Wu, ... PTX assembly IR optimizer NVPTX codegen

Documents

Î Ð È ÈÄ Ï Ê ÉÇ Ä É À... · AWS ML services for distributed training Amazon EC2 • Amazon EC2 P3dn.24xlarge for compute (NVIDIA V100 TensorCore GPU) • Elastic Fabric

Î Ð È ÈÄ Ï Ê ÉÇ Ä É À... · AWS ML services for distributed training Amazon EC2 • Amazon EC2 P3dn.24xlarge for compute (NVIDIA V100 TensorCore GPU) • Elastic Fabric

Documents

Building High Performance iPhone Applications with Flash CS5 · 2018-04-11 · AS3 AS3 SWC timeline assets Flash CS5 ABC1 ABC2 timeline assets.swf ABC Compiler LLVM Codegen AOT LLVM

Building High Performance iPhone Applications with Flash CS5 · 2018-04-11 · AS3 AS3 SWC timeline assets Flash CS5 ABC1 ABC2 timeline assets.swf ABC Compiler LLVM Codegen AOT LLVM

Documents

Eclipse Modeling Framework (EMF) und das Graphical Editing ... · Eclipse Modeling Framework (EMF) EMF-Modellimport EMOF und ECore EMF Edit & Codegen Graphical Editing Framework (GEF)

Eclipse Modeling Framework (EMF) und das Graphical Editing ... · Eclipse Modeling Framework (EMF) EMF-Modellimport EMOF und ECore EMF Edit & Codegen Graphical Editing Framework (GEF)

Documents