View
8
Download
0
Category
Preview:
Citation preview
Using ML for ML to span the gamut of TinyML hardware
TinyML 2020 Jason Knight - Co-founder and CPO at OctoML
Motivation
CNN GAN RNN MLP DQNNCambrian explosion of models, workloads, and use cases.
Rapidly evolving ML software ecosystem
Silicon scaling limitations (Dennard and Moore):
Cambrian explosion of HW backends. Heterogeneous HW.
Growing set of requirements: Cost Latency Power Security
2
Motivation
CNN GAN RNN MLP DQNNCambrian explosion of models, workloads, and use cases.
Rapidly evolving ML software ecosystem
Silicon scaling limitations (Dennard and Moore):
Cambrian explosion of HW backends. Heterogeneous HW.
Growing set of requirements: Cost Latency Power Security
3
Hard enough on server class hardware…
But now we want to do it in on bare-metal devices?
Bare-Metal Devices
4
FPGA ARM M class,
RISC-V MCUs, etc
Programming bare-metal devices is cumbersome
5
ML Model
?
Running ML on bare-metal hardware is hard
6
ML Model
Running ML on bare-metal hardware is hard
7
ML Model
ML from scratch in C…
ML FrameworkML Model
Running ML on bare-metal hardware is pretty hard
8
ML FrameworkML Model
Running ML on bare-metal hardware is pretty hard
9
Slow kernels,Incomplete coverage
ML FrameworkML Model
Hand- optimize kernels
Running ML on bare-metal hardware is really hard
10
ML FrameworkML Model
Hand- optimize kernels
Running ML on bare-metal hardware is really hard
11
Not much help here…
ML FrameworkML Model
Hand- optimize kernels
Running ML on bare-metal hardware is really hard
12
Bugs!
ML Framework
Run ML models
Debug kernels
Hand- optimize kernels
Running ML on bare-metal hardware is really hard
13
ML Framework
Run ML models
Debug kernels
Hand- optimize kernels
Optimizing ML on bare-metal hardware is a long road
14
μTVM simplifies ML software on bare-metal
15
ML Framework
Run ML models
Debug kernels
Hand- optimize kernels
Automatically
Our Goal
16
Minimize the effort to run and optimize ML on bare-metal.
Our Goal
17
Minimize the effort to run and optimize ML on bare-metal.
So what’s our approach?
The TVM in µTVM
18
The TVM in µTVM
19
FPGA ASIC
The TVM in µTVM
20
FPGA ASIC
How do we get from here…
to here?
TVM
The TVM in µTVM
21
FPGA ASIC
???
The TVM in µTVM
22
FPGA ASIC
???
The TVM in µTVM
23
High-Level Differentiable IR
FPGA ASIC
???
The TVM in µTVM
24
High-Level Differentiable IR
Tensor Expression IR
FPGA ASIC
The TVM in µTVM
25
FPGA ASIC
High-Level Differentiable IR
Tensor Expression IR
VTALLVM, CUDA
The TVM in µTVM
26
FPGA ASIC
High-Level Differentiable IR
Tensor Expression IR
VTALLVM, CUDA
Hand-optimization?
The TVM in µTVM
27
FPGA ASIC
High-Level Differentiable IR
Tensor Expression IR
VTALLVM, CUDA
AutoTVM
Where TVM is being used today
28
Every “Alexa” wake-up today across all devices uses a model optimized with TVM
“[TVM enabled] real-time on mobile CPUs for free...We are excited about the performance TVM achieves.” More than 85x speed-up for speech recognition model.
Bing query understanding: 112ms (Tensorflow) -> 34ms (TVM). QnA bot: 73ms->28ms (CPU), 10.1ms->5.5ms (GPU)
“TVM is key to ML Access on Hexagon”
Where TVM is being used today
29
Every “Alexa” wake-up today across all devices uses a model optimized with TVM
“[TVM enabled] real-time on mobile CPUs for free...We are excited about the performance TVM achieves.” More than 85x speed-up for speech recognition model.
Bing query understanding: 112ms (Tensorflow) -> 34ms (TVM). QnA bot: 73ms->28ms (CPU), 10.1ms->5.5ms (GPU)
“TVM is key to ML Access on Hexagon”
Cf: TVM Conference 2019 (Dec) - videos and slides available online
The TVM in µTVM
30
FPGA ASIC
High-Level Differentiable IR
Tensor Expression IR
VTALLVM, CUDA
AutoTVM
Many hardware targets already enjoy speedups from TVM
Motivation
31
Except for microcontrollers...
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDAµTVM
Today’s Talk
32
Today’s Talk
33
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDAµTVM
1
Today’s Talk
34
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDAµTVM
1
2
Today’s Talk
35
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDAµTVM
1
2
3
Today’s Talk
36
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDAµTVM
1
2
3
AutoTVM
37
TVM + Program
AutoTVM
38
TVM + Programcompile program
Hardware Backend
AutoTVM
39
TVM + Program
runandmeasure
compile programHardware Backend
AutoTVM
40
AutoTVM
runandmeasure
compile program
return timing feedback
Hardware BackendTVM + Program
AutoTVM
41
AutoTVM
runandmeasure
compile program
improveimplementation
Hardware Backend
return timing feedback
TVM + Program
AutoTVM
42
runandmeasure
compile program
improveimplementation
Hardware Backend
return timing feedback
TVM + Program
AutoTVM
AutoTVM
43
runandmeasure
compile program
improveimplementation
Hardware Backend
return timing feedback
TVM + Program
AutoTVM
AutoTVM
44
runandmeasure
compile program
improveimplementation
Hardware Backend
return timing feedback
TVM + Program
AutoTVM
AutoTVM
45
runandmeasure
compile program
improveimplementation
Hardware Backend
return timing feedback
TVM + Program
AutoTVM
AutoTVM
Automatically adapt to hardware type by learning
46
µTVM
Today’s Talk
47
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDA
1
2
3
Today’s Talk
48
µTVM
Using µTVM µTVM Runtime
Graph Runtime
Today’s Talk
49
µTVM
Using µTVM µTVM Runtime
Graph Runtime
How easy is it to use μTVM?
50
How easy is it to use μTVM?
51
• Implement an interface to: - Read and write to device memory - Start code execution
How easy is it to use μTVM?
52
• Implement an interface to: - Read and write to device memory - Start code execution
• Provide a cross-compiler for the device
How easy is it to use μTVM?
53
+
• Implement an interface to: - Read and write to device memory - Start code execution
• Provide a cross-compiler for the device
ResNet Example in Mainline TVM
54
ResNet Example in Mainline TVM
55
image =
ResNet Example in Mainline TVM
56
resnet, params = get_resnet() module = graph_runtime_build(resnet, params)
image =
ResNet Example in Mainline TVM
57
resnet, params = get_resnet() module = graph_runtime_build(resnet, params) module.run(image)
image =
> 'cat'
TVM → µTVM
58
resnet, params = get_resnet()
module = graph_runtime_build(resnet, params)
image =
resnet, params = get_resnet() with micro.Session('openocd') as sess: module = sess.build(resnet, params)
image =
TVM → µTVM
59
resnet, params = get_resnet()
module = graph_runtime_build(resnet, params)
image =
resnet, params = get_resnet() with micro.Session('openocd') as sess: module = sess.build(resnet, params)
image =wrap in a session
TVM → µTVM
60
resnet, params = get_resnet()
module = graph_runtime_build(resnet, params)
image =
resnet, params = get_resnet() with micro.Session('openocd') as sess: module = sess.build(resnet, params)
image = different build function
wrap in a session
Today’s Talk
61
µTVM
Using µTVM µTVM Runtime
Graph Runtime
Today’s Talk
62
µTVM
Using µTVM µTVM Runtime
Graph Runtime
Today’s Talk
63
C CodeGenerator
μDeviceInterface
µTVMRuntime
µTVM
Code Generation
64
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDA
µTVM
Code Generation
65
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDA
why not LLVM?
µTVM
Code Generation
66
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDA
lack of support…
µTVM
Code Generation
67
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDAC
µDevice Interface
68
Read Write Execute
µDevice Interface
69
Read Write Execute
Sockets? JTAG?Doesn't matter.
The Runtime
70
μTVM Runtime
C CodeGenerator
μDeviceInterface
The Runtime in Action
71
μTVM Runtime
C CodeGenerator
μDeviceInterface
Set up a RWX interface with the board
72
communication interface
μTVM Runtime
C CodeGenerator
μDeviceInterface
Code generator lowers TVM IR to C code
73
func.c
IR→code
communication interface
μTVM Runtime
C CodeGenerator
μDeviceInterface
Device compiler cross compiles generated code
74
func.ogcc
communication interface
μTVM Runtime
C CodeGenerator
μDeviceInterface
func.c
IR→code
μTVM remaps generated binary to custom device memory locations for multiple library loading
75
remapfunc
ld linker
communication interface
μTVM Runtime
C CodeGenerator
μDeviceInterface
func.c
IR→code
func.ogcc
Remapped binary is loaded onto device
76
customloader
communication interface
μTVM Runtime
C CodeGenerator
μDeviceInterface
func.c
IR→code
func.ogcc remap
func
ld linker
Function invocation on host transfers parameters to the device for execution
77
run
send parameters
'cat'
μTVM Runtime
C CodeGenerator
μDeviceInterface
func.c
IR→code
func.ogcc remap
func
ld linker
customloader
µTVM
Today’s Talk
78
High-Level Differentiable IR
Tensor Expression IR AutoTVM
VTA
FPGA ASIC
LLVM, CUDA
1
2
3
AutoTVM on µTVM
79
● Same pipeline as usual ● Load kernels into RAM
instead of flash
End-to-End CIFAR-10 Evaluation
80
Conv2D
MaxPool2DConv2D AvgPool2D
Conv2DAvgPool2D
Dense
3@32x32
32@32x32 32@16x16 32@16x16 32@8x8
64@8x8 64@4x4
1x10
Replicated an int8-quantized CNN from an ARM Mbed tutorial
Preliminary CIFAR-10 CNN Results
81
● Ran on ARM Cortex-M7
● Compared against CMSIS-NN
● Vanilla template
● ~5 hours of tuning
● No vectorization
Preliminary Int-8 Conv2D Results
82
arm_convolve_HWC_q7_fast in CMSIS-NN arm_convolve_HWC_q7_RGB in CMSIS-NN
Coming Soon - Self-hosted models
83
Coming Soon - Ultra low bit-width quantization
84Riptide: Fast End-to-End Binarized Neural Networks. MLSys 2020 (March 3rd) Joshua Fromm · Meghan Cowan · Matthai Philipose · Luis Ceze · Shwetak Patel
Squeezenet on RaspberryPi 3
• TVM already supports flexible code generation for a variety of data types
• Natural to combine this with µTVM
• See Josh Fromm’s MLSys talk on March 3rd to learn more
Coming soon - µTVM as a service
• OctoML offering hosted version of AutoTVM for cloud, mobile, and embedded • Cloud CPU/GPU • ARM A class CPU/GPU • ARM M class microcontrollers • More on request
• Trained model in, optimized binary out • Currently in private beta
TensorFlow, Pytorch, ONNX serialized models
Optimized deployment artifacts
Octomizer
API and web UI
Auto-tuning using OctoML clusters
85
Status and next steps
86
• Functional on x86, ARM, and RISC-V • In-depth writeup early March on our blog with upstream TVM
documentation/tutorial • Follow us on https://octoml.ai or @octoml
• Interop with CMSIS-NN library and TF Lite Micro (better together!) • First class SIMD vectorization support for ARM schedules, (in addition
to current tensorization approach)
Questions or interest? Please reach out:
Jason Knight jknight@octoml.ai Acknowledgments
Logan Weber - AutoTVM Joshua Fromm - RipTide
Thanks!
88
Copyright NoticeThe presentation(s) in this publication comprise the proceedings of tinyML® Summit 2020. The content reflects the opinion of the authors and their respective companies. This version of the presentation may differ from the version that was presented at the tinyML Summit. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.
There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.
tinyML is a registered trademark of the tinyML Foundation.
www.tinyML.org
Recommended