Using ML for ML to span the gamut of · Using ML for ML to span the gamut of TinyML hardware TinyML...

Using ML for ML to span the gamut of TinyML hardware

TinyML 2020 Jason Knight - Co-founder and CPO at OctoML

Motivation

CNN GAN RNN MLP DQNNCambrian explosion of models, workloads, and use cases.

Rapidly evolving ML software ecosystem

Silicon scaling limitations (Dennard and Moore):

Cambrian explosion of HW backends. Heterogeneous HW.

Growing set of requirements: Cost Latency Power Security

Motivation

CNN GAN RNN MLP DQNNCambrian explosion of models, workloads, and use cases.

Rapidly evolving ML software ecosystem

Silicon scaling limitations (Dennard and Moore):

Cambrian explosion of HW backends. Heterogeneous HW.

Growing set of requirements: Cost Latency Power Security

Hard enough on server class hardware…

But now we want to do it in on bare-metal devices?

Bare-Metal Devices

FPGA ARM M class,

RISC-V MCUs, etc

Programming bare-metal devices is cumbersome

ML Model

Running ML on bare-metal hardware is hard

ML Model

Running ML on bare-metal hardware is hard

ML Model

ML from scratch in C…

ML FrameworkML Model

Running ML on bare-metal hardware is pretty hard

Slow kernels,Incomplete coverage

Hand- optimize kernels

Running ML on bare-metal hardware is really hard

Not much help here…

ML Framework

Run ML models

Debug kernels

ML Framework

Run ML models

Debug kernels

Optimizing ML on bare-metal hardware is a long road

μTVM simplifies ML software on bare-metal

ML Framework

Run ML models

Debug kernels

Automatically

Our Goal

Minimize the effort to run and optimize ML on bare-metal.

Our Goal

Minimize the effort to run and optimize ML on bare-metal.

So what’s our approach?

The TVM in µTVM

FPGA ASIC

The TVM in µTVM

FPGA ASIC

How do we get from here…

to here?

The TVM in µTVM

FPGA ASIC

The TVM in µTVM

FPGA ASIC

The TVM in µTVM

High-Level Differentiable IR

FPGA ASIC

The TVM in µTVM

Tensor Expression IR

FPGA ASIC

The TVM in µTVM

FPGA ASIC

VTALLVM, CUDA

The TVM in µTVM

FPGA ASIC

VTALLVM, CUDA

Hand-optimization?

The TVM in µTVM

FPGA ASIC

VTALLVM, CUDA

AutoTVM

Where TVM is being used today

Every “Alexa” wake-up today across all devices uses a model optimized with TVM

“[TVM enabled] real-time on mobile CPUs for free...We are excited about the performance TVM achieves.” More than 85x speed-up for speech recognition model.

Bing query understanding: 112ms (Tensorflow) -> 34ms (TVM). QnA bot: 73ms->28ms (CPU), 10.1ms->5.5ms (GPU)

“TVM is key to ML Access on Hexagon”

Where TVM is being used today

Every “Alexa” wake-up today across all devices uses a model optimized with TVM

“[TVM enabled] real-time on mobile CPUs for free...We are excited about the performance TVM achieves.” More than 85x speed-up for speech recognition model.

Bing query understanding: 112ms (Tensorflow) -> 34ms (TVM). QnA bot: 73ms->28ms (CPU), 10.1ms->5.5ms (GPU)

“TVM is key to ML Access on Hexagon”

Cf: TVM Conference 2019 (Dec) - videos and slides available online

The TVM in µTVM

FPGA ASIC

VTALLVM, CUDA

AutoTVM

Many hardware targets already enjoy speedups from TVM

Motivation

Except for microcontrollers...

Tensor Expression IR AutoTVM

FPGA ASIC

LLVM, CUDAµTVM

Today’s Talk

FPGA ASIC

LLVM, CUDAµTVM

Today’s Talk

FPGA ASIC

LLVM, CUDAµTVM

Today’s Talk

FPGA ASIC

LLVM, CUDAµTVM

Today’s Talk

FPGA ASIC

LLVM, CUDAµTVM

AutoTVM

TVM + Program

AutoTVM

TVM + Programcompile program

Hardware Backend

AutoTVM

TVM + Program

runandmeasure

compile programHardware Backend

AutoTVM

runandmeasure

compile program

return timing feedback

Hardware BackendTVM + Program

AutoTVM

runandmeasure

compile program

improveimplementation

Hardware Backend

TVM + Program

AutoTVM

runandmeasure

compile program

Hardware Backend

TVM + Program

AutoTVM

runandmeasure

compile program

Hardware Backend

TVM + Program

AutoTVM

runandmeasure

compile program

Hardware Backend

TVM + Program

AutoTVM

runandmeasure

compile program

Hardware Backend

TVM + Program

AutoTVM

Automatically adapt to hardware type by learning

Today’s Talk

FPGA ASIC

LLVM, CUDA

Today’s Talk

Using µTVM µTVM Runtime

Graph Runtime

Today’s Talk

Graph Runtime

How easy is it to use μTVM?

• Implement an interface to: - Read and write to device memory - Start code execution

• Provide a cross-compiler for the device

ResNet Example in Mainline TVM

image =

resnet, params = get_resnet() module = graph_runtime_build(resnet, params)

image =

resnet, params = get_resnet() module = graph_runtime_build(resnet, params) module.run(image)

image =

> 'cat'

TVM → µTVM

resnet, params = get_resnet()

module = graph_runtime_build(resnet, params)

image =

resnet, params = get_resnet() with micro.Session('openocd') as sess: module = sess.build(resnet, params)

image =

TVM → µTVM

image =

image =wrap in a session

TVM → µTVM

image =

image = different build function

wrap in a session

Today’s Talk

Graph Runtime

Today’s Talk

Graph Runtime

Today’s Talk

C CodeGenerator

μDeviceInterface

µTVMRuntime

Code Generation

FPGA ASIC

LLVM, CUDA

Code Generation

FPGA ASIC

LLVM, CUDA

why not LLVM?

Code Generation

FPGA ASIC

LLVM, CUDA

lack of support…

Code Generation

FPGA ASIC

LLVM, CUDAC

µDevice Interface

Read Write Execute

µDevice Interface

Read Write Execute

Sockets? JTAG?Doesn't matter.

The Runtime

μTVM Runtime

C CodeGenerator

μDeviceInterface

The Runtime in Action

μTVM Runtime

C CodeGenerator

μDeviceInterface

Set up a RWX interface with the board

communication interface

μTVM Runtime

C CodeGenerator

μDeviceInterface

Code generator lowers TVM IR to C code

func.c

IR→code

μTVM Runtime

C CodeGenerator

μDeviceInterface

Device compiler cross compiles generated code

func.ogcc

μTVM Runtime

C CodeGenerator

μDeviceInterface

func.c

IR→code

μTVM remaps generated binary to custom device memory locations for multiple library loading

remapfunc

ld linker

μTVM Runtime

C CodeGenerator

μDeviceInterface

func.c

IR→code

func.ogcc

Remapped binary is loaded onto device

customloader

μTVM Runtime

C CodeGenerator

μDeviceInterface

func.c

IR→code

func.ogcc remap

ld linker

Function invocation on host transfers parameters to the device for execution

send parameters

μTVM Runtime

C CodeGenerator

μDeviceInterface

func.c

IR→code

func.ogcc remap

ld linker

customloader

Today’s Talk

FPGA ASIC

LLVM, CUDA

AutoTVM on µTVM

● Same pipeline as usual ● Load kernels into RAM

instead of flash

End-to-End CIFAR-10 Evaluation

Conv2D

MaxPool2DConv2D AvgPool2D

Conv2DAvgPool2D

3@32x32

32@32x32 32@16x16 32@16x16 32@8x8

64@8x8 64@4x4

Replicated an int8-quantized CNN from an ARM Mbed tutorial

Preliminary CIFAR-10 CNN Results

● Ran on ARM Cortex-M7

● Compared against CMSIS-NN

● Vanilla template

● ~5 hours of tuning

● No vectorization

Preliminary Int-8 Conv2D Results

arm_convolve_HWC_q7_fast in CMSIS-NN arm_convolve_HWC_q7_RGB in CMSIS-NN

Coming Soon - Self-hosted models

Coming Soon - Ultra low bit-width quantization

84Riptide: Fast End-to-End Binarized Neural Networks. MLSys 2020 (March 3rd) Joshua Fromm · Meghan Cowan · Matthai Philipose · Luis Ceze · Shwetak Patel

Squeezenet on RaspberryPi 3

• TVM already supports flexible code generation for a variety of data types

• Natural to combine this with µTVM

• See Josh Fromm’s MLSys talk on March 3rd to learn more

Coming soon - µTVM as a service

• OctoML offering hosted version of AutoTVM for cloud, mobile, and embedded • Cloud CPU/GPU • ARM A class CPU/GPU • ARM M class microcontrollers • More on request

• Trained model in, optimized binary out • Currently in private beta

TensorFlow, Pytorch, ONNX serialized models

Optimized deployment artifacts

Octomizer

API and web UI

Auto-tuning using OctoML clusters

Status and next steps

• Functional on x86, ARM, and RISC-V • In-depth writeup early March on our blog with upstream TVM

documentation/tutorial • Follow us on https://octoml.ai or @octoml

• Interop with CMSIS-NN library and TF Lite Micro (better together!) • First class SIMD vectorization support for ARM schedules, (in addition

to current tensorization approach)

Questions or interest? Please reach out:

Jason Knight jknight@octoml.ai Acknowledgments

Logan Weber - AutoTVM Joshua Fromm - RipTide

Thanks!

Copyright NoticeThe presentation(s) in this publication comprise the proceedings of tinyML® Summit 2020. The content reflects the opinion of the authors and their respective companies. This version of the presentation may differ from the version that was presented at the tinyML Summit. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org

Using ML for ML to span the gamut of · Using ML for ML to span the gamut of TinyML hardware TinyML...

Documents

CG1 - ColorColor Gamut • Gamut – Set of representable colors • CIE XYZ gamut – Device-independent • Device color gamut ... of different devices by intermediate mapping to

To Develop a Universal Gamut Mapping AlgorithmThe gamut mapping solution proposed in this thesis is a gamut compression algorithm developed with the aim of being accurate and universally

TM COLOUR GAMUT COMPARISON - ROBE Lighting

Out of Gamut

tinyML Summit 2021 Proceedings

tinyML meetup kick-off•HW: TPU, FPGA, GPU, CPU Edge ML •Optimized algos and CNN-light •SoC (with NPUs/NSP accelerators) Tiny ML •CNN-micro •MCU w/ HW accelerators Data Sources:

Color Gamut Monitoring

Pointer MacAdam Wide-Gamut Displays

GAMUT: Es un anglicismo definido como

tinyML EMEA 2021 Proceedings Cover 210607

Gamut mapping for visual attention retargetingip4ec.upf.edu/system/files/publications/GM_VisualAttention_CIC2017.pdf · gamut reduction and the gamut extension problems is the behaviour

Introducción a la Lógica de Gamut

LOUDSPEAKER GamuT RS3

AutoMLPlatform for Sensor Data - tinyML · 1.Requires a rare breed of “rocket scientists” (ML engineers) Existing machine learning tools and platforms require deep technical knowledge

tinyML EMEA 2021 Proceedings Cover 210607...tinyML EMEA Technical Forum 2021 Proceedings June 7 –10, 2021 Virtual Event

Automatic Color Gamut Calibration

Image Quality and Colour Gamut Comparative Performance

Berry Stravinsky Invar&Analogy Gamut 2008

Chroma Shift and Gamut Shape: Going Beyond … · Chroma Shift and Gamut Shape: Going Beyond Average Color Fidelity and Gamut Area . October 2017 . Prepared for: Solid-State Lighting

Amit Mate Bangalore, India - October 17, 2020 - tinyML