88
TVM Stack Overview Tianqi Chen

TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM Stack OverviewTianqi Chen

Page 2: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Beginning of TVM Story

Page 3: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Beginning of TVM Story

Acclerator

Page 4: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Beginning of TVM Story

Acclerator

Page 5: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Beginning of TVM Story

// Pseudo-code for convolution program for the VIA accelerator// Virtual Thread 00x00: LOAD(PARAM[ 0-71]) // LD@TID00x01: LOAD(ACTIV[ 0-24]) // LD@TID00x02: LOAD(LDBUF[ 0-31]) // LD@TID00x03: PUSH(LD->EX) // LD@TID00x04: POP (LD->EX) // EX@TID00x05: EXE (ACTIV[ 0-24],PARAM[ 0-71],LDBUF[ 0-31],STBUF[ 0- 7]) // EX@TID00x06: PUSH(EX->LD) // EX@TID00x07: PUSH(EX->ST) // EX@TID00x08: POP (EX->ST) // ST@TID00x09: STOR(STBUF[ 0- 7]) // ST@TID00x0A: PUSH(ST->EX) // ST@TID0// Virtual Thread 10x0B: LOAD(ACTIV[25-50]) // LD@TID10x0C: LOAD(LDBUF[32-63]) // LD@TID10x0D: PUSH(LD->EX) // LD@TID10x0E: POP (LD->EX) // EX@TID10x0F: EXE (ACTIV[25-50],PARAM[ 0-71],LDBUF[32-63],STBUF[32-39]) // EX@TID10x10: PUSH(EX->LD) // EX@TID10x11: PUSH(EX->ST) // EX@TID10x12: POP (EX->ST) // ST@TID10x13: STOR(STBUF[32-39]) // ST@TID10x14: PUSH(ST->EX) // ST@TID1// Virtual Thread 20x15: POP (EX->LD) // LD@TID20x16: LOAD(PARAM[ 0-71]) // LD@TID20x17: LOAD(ACTIV[ 0-24]) // LD@TID20x18: LOAD(LDBUF[ 0-31]) // LD@TID20x19: PUSH(LD->EX) // LD@TID20x1A: POP (LD->EX) // EX@TID20x1B: POP (ST->EX) // EX@TID20x1C: EXE (ACTIV[ 0-24],PARAM[ 0-71],LDBUF[ 0-31],STBUF[ 0- 7]) // EX@TID20x1D: PUSH(EX->ST) // EX@TID20x1E: POP (EX->ST) // ST@TID20x1F: STOR(STBUF[ 0- 7]) // ST@TID2// Virtual Thread 30x20: POP (EX->LD) // LD@TID30x21: LOAD(ACTIV[25-50]) // LD@TID30x22: LOAD(LDBUF[32-63]) // LD@TID30x23: PUSH(LD->EX) // LD@TID30x24: POP (LD->EX) // EX@TID30x25: POP (ST->EX) // EX@TID20x26: EXE (ACTIV[25-50],PARAM[ 0-71],LDBUF[32-63],STBUF[32-39]) // EX@TID30x27: PUSH(EX->ST) // EX@TID30x28: POP (EX->ST) // ST@TID30x29: STOR(STBUF[32-39]) // ST@TID3

// Convolution access pattern dictated by micro-coded program.// Each register index is derived as a 2-D affine function.// e.g. idxrf = arfy+brfx+crf

0, where crf0 is specified by

// micro op 0 fields.for y in [0…i) for x in [0…j) rf[idxrf

0] += GEVM(act[idxact0], par[idxpar

0]) rf[idxrf

1] += GEVM(act[idxact1], par[idxpar

1]) … rf[idxrf

n] += GEVM(act[idxactn], par[idxpar

n])

(b) Convolution micro-coded program

// Max-pool, batch normalization and activation function// access pattern dictated by micro-coded program.// Each register index is derived as a 2D affine function.// e.g. idxdst = adsty+bdstx+cdst

0, where cdst0 is specified by

// micro op 0 fields.for y in [0…i) for x in [0…j) // max pooling rf[idxdst

0] = MAX(rf[idxdst0], rf[idxsrc

0]) rf[idxdst

1] = MAX(rf[idxdst1], rf[idxsrc

1]) … // batch norm rf[idxdst

m] = MUL(rf[idxdstm], rf[idxsrc

m]) rf[idxdst

m+1] = ADD(rf[idxdstm+1], rf[idxsrc

m+1]) rf[idxdst

m+2] = MUL(rf[idxdstm+2], rf[idxsrc

m+2]) rf[idxdst

m+3] = ADD(rf[idxdstm+3], rf[idxsrc

m+3]) … // activation rf[idxdst

n-1] = RELU(rf[idxdstn-1], rf[idxsrc

n-1]) rf[idxdst

n] = RELU(rf[idxdstn], rf[idxsrc

n])

(c) Max pool, batch norm and activationmicro-coded program

(a) Blocked convolution program with multiple thread contexts

Acclerator

Page 6: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach

Hardware

Frameworks

Page 7: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach

High-level data flow graph

Hardware

Frameworks

Page 8: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach

High-level data flow graph

Hardware

Primitive Tensor operators such as Conv2D

Frameworks

Page 9: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach

High-level data flow graph

Hardware

Primitive Tensor operators such as Conv2D

eg. cuDNN Offload to heavily optimized DNN operator library

Frameworks

Page 10: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach: Engineer Optimized Tensor Operators

!4

Page 11: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach: Engineer Optimized Tensor Operators

!4

Page 12: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach: Engineer Optimized Tensor Operators

!4

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

Page 13: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach: Engineer Optimized Tensor Operators

!4

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

Page 14: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach: Engineer Optimized Tensor Operators

!4

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

for y in range(1024): for x in range(1024): C[y][x] = 0 for k in range(1024): C[y][x] += A[k][y] * B[k][x]

Vanilla Code

Page 15: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach: Engineer Optimized Tensor Operators

!5

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

for yo in range(128): for xo in range(128): C[yo*8:yo*8+8][xo*8:xo*8+8] = 0 for ko in range(128): for yi in range(8): for xi in range(8): for ki in range(8): C[yo*8+yi][xo*8+xi] += A[ko*8+ki][yo*8+yi] * B[ko*8+ki][xo*8+xi]

Loop Tiling for Locality

Page 16: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Existing Approach: Engineer Optimized Tensor Operators

!6

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

inp_buffer AL[8][8], BL[8][8]acc_buffer CL[8][8]for yo in range(128): for xo in range(128): vdla.fill_zero(CL) for ko in range(128): vdla.dma_copy2d(AL, A[ko*8:ko*8+8][yo*8:yo*8+8]) vdla.dma_copy2d(BL, B[ko*8:ko*8+8][xo*8:xo*8+8]) vdla.fused_gemm8x8_add(CL, AL, BL) vdla.dma_copy2d(C[yo*8:yo*8+8,xo*8:xo*8+8], CL)

Map to Accelerators

Human exploration of optimized code

Page 17: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

Page 18: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

Page 19: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

Page 20: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

Page 21: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

Page 22: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Page 23: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Page 24: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Page 25: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Limitations of Existing Approach

cuDNN

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Engineering intensive

Page 26: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Learning-based Learning System

High-level data flow graph and optimizations

Hardware

Frameworks

Page 27: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Learning-based Learning System

High-level data flow graph and optimizations

Hardware

Frameworks

Page 28: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware aware Search Space of Optimized Tensor Programs

Learning-based Learning System

High-level data flow graph and optimizations

Hardware

Frameworks

Page 29: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Learning-based Learning System

High-level data flow graph and optimizations

Hardware

Frameworks

Page 30: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Learning-based Learning System

High-level data flow graph and optimizations

directly generate optimized program for new operator workloads and hardware

Hardware

Frameworks

Page 31: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Hardware

Page 32: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Hardware

Page 33: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

Hardware

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language (Specification)

Page 34: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

Hardware

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language (Specification)

Define search space of hardware aware mappings from expression to hardware program

Based on Halide’s compute/schedule separation

Page 35: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

L1D L1IL2

L3

L1D L1IL2

implicitly managed

Compute Primitives

scalar vector

Memory Subsystem

Loop Transformations

Cache Locality Vectorization

CPUs

Reuse primitives from prior work: Halide, Loopy

Page 36: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Challenge to Support Diverse Hardware Backends

CPUs GPUs TPU-like specialized Accelerators

Page 37: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search SpaceCompute Primitives

scalar vector

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

GPUs

Page 38: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search SpaceCompute Primitives

scalar vector

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

GPUs

Shared memory among compute cores

Page 39: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search SpaceCompute Primitives

scalar vector

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

Thread Cooperation

Use of Shared Memory

GPUs

Shared memory among compute cores

Page 40: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Page 41: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Page 42: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Tensorization ChallengeCompute primitives

Page 43: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Tensorization ChallengeCompute primitives

scalar

Page 44: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Tensorization ChallengeCompute primitives

scalar vector

Page 45: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Tensorization ChallengeCompute primitives

scalar vector tensor

Page 46: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Tensorization ChallengeCompute primitives

scalar vector tensor

w, x = t.placeholder((8, 8)), t.placeholder((8, 8))k = t.reduce_axis((0, 8))y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] * x[j, k], axis=k))

def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr(“r") xx_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("w") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) reset = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update

gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)

declare behavior

lowering rule to generatehardware intrinsics to carry out the computation

Hardware designer: declare tensor instruction interface

with Tensor Expression

Page 47: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Tensorization ChallengeCompute primitives

scalar vector tensor

w, x = t.placeholder((8, 8)), t.placeholder((8, 8))k = t.reduce_axis((0, 8))y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] * x[j, k], axis=k))

def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr(“r") xx_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("w") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) reset = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update

gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)

declare behavior

lowering rule to generatehardware intrinsics to carry out the computation

Hardware designer: declare tensor instruction interface

with Tensor Expression

Tensorize: transform program

to use tensor instructions

scalar tensor

Page 48: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Page 49: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Page 50: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

Hardware

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 51: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Primitives in prior work:Halide, Loopy

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 52: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Primitives in prior work:Halide, Loopy

Thread Cooperation Tensorization Latency

Hiding …New primitives for GPUs, and enable TPU-like Accelerators

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 53: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 54: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 55: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

Billions of possible optimization choices

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 56: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Hardware

Page 57: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Hardware

Page 58: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

AutoTVM: Machine Learning based Program Optimizer

Hardware

Page 59: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Some Quick Results

Page 60: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

End to End Inference Performance (Nvidia Titan X)

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Tensorflow-XLA

Apache MXNetTensorflow

Page 61: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

End to End Inference Performance (Nvidia Titan X)

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Tensorflow-XLA

Apache MXNetTensorflow

Backed by cuDNN

Page 62: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

End to End Inference Performance (Nvidia Titan X)

Tensorflow-XLA

Apache MXNetTensorflow

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: without graph optimizations

Page 63: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

End to End Inference Performance (Nvidia Titan X)

Tensorflow-XLA

Apache MXNetTensorflow TVM: without graph optimizations

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: all optimizations

Page 64: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

End to End Inference Performance (Nvidia Titan X)

Tensorflow-XLA

Apache MXNetTensorflow TVM: without graph optimizations

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: all optimizations

Competitive on standard models

Page 65: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

End to End Inference Performance (Nvidia Titan X)

Tensorflow-XLA

Apache MXNetTensorflow TVM: without graph optimizations

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: all optimizations

Competitive on standard models

Bigger gap on less conventional models

Page 66: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Across Hardware Platforms

ResNet-18 MobileNet0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

Tim

e(m

s)

DQN0.0

2.0

4.0

6.0

8.0

10.0

12.0

Tensorflow Lite TVM w/o graph opt TVM

float32 float16ResNet-18

float32 float16MobileNet

0.0

50.0

100.0

150.0

200.0

250.0

Tim

e(m

s)

float32 float16DQN

0.0

1.0

2.0

3.0

4.0

5.0

ARMComputeLib TVM w/o graph opt TVM

ARM CPU(A53) ARM GPU(MALI)

Page 67: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Supporting New Specialized Accelerators Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

LLVM CUDA

Page 68: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Supporting New Specialized Accelerators Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

LLVM CUDAVTA: Open, Customizable

Deep Learning Accelerator

Data Center FPGAEdge FPGA ASIC

Page 69: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

More on the High-Level Optimizations Frameworks

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

High-level data flow graph and optimizations

LLVM CUDAVTA: Open, Customizable

Deep Learning Accelerator

Data Center FPGAEdge FPGA ASIC

Page 70: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

More on the High-Level Optimizations Frameworks

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

LLVM CUDAVTA: Open, Customizable

Deep Learning Accelerator

Data Center FPGAEdge FPGA ASIC

Page 71: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

More on the High-Level Optimizations Frameworks

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Relay: High-Level Differentiable IR

LLVM CUDAVTA: Open, Customizable

Deep Learning Accelerator

Data Center FPGAEdge FPGA ASIC

Page 72: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

Page 73: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Page 74: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

Page 75: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal

Page 76: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA ASIC

Page 77: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA ASIC

Optimization

AutoTVM

AutoVTA

Hardware Fleet

Page 78: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA ASIC

Optimization

AutoTVM

AutoVTA

Hardware Fleet

Page 79: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA ASIC

Optimization

AutoTVM

AutoVTA

Hardware Fleet

Page 80: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM Open Source Community

• Prefer public archivable discussion

• Open RFC discussion

• Bring in new members by merit

https://docs.tvm.ai/contribute/community.html

Page 81: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

Page 82: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Page 83: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

Page 84: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal

Page 85: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA ASIC

Page 86: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA ASIC

Optimization

AutoTVM

AutoVTA

Hardware Fleet

Page 87: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA ASIC

Optimization

AutoTVM

AutoVTA

Hardware Fleet

Page 88: TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

TVM: Learning-based Deep Learning Compiler

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA ASIC

Optimization

AutoTVM

AutoVTA

Hardware Fleet