31
14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International Solid-State Circuits Conference 1 of 31 A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge, MA

A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

Embed Size (px)

Citation preview

Page 1: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 1 of 31

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

and Voice-Activated Power Gating

Michael Price*, James Glass, Anantha Chandrakasan

MIT, Cambridge, MA* now at Analog Devices, Cambridge, MA

Page 2: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 2 of 31

Diversification of speech interfaces

Today• Personal assistant (smartphone)• Search (smartphone, PC)

Future• Wearables• Appliances• Robots

Goal:Complement or replace touchscreen interfaces

Page 3: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 3 of 31

Computational demands of ASR(ASR = Automatic speech recognition)

Today’s non-volatile memory:MLC NAND flash

At 100 pJ/bit: 100 MB/s -> 80 mW

• Real-time ASR is now feasible on x86 PCs• ARM SoCs can run with minor performance degradation• But what about everyone else?

• Today’s model: offload to cloud servers• Tomorrow: offload to cloud OR hardware accelerator

System power concerns:• Memory• I/O signaling

If processor efficiency is optimized in isolation, these can exceed core power

Page 4: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 4 of 31

Hardware accelerated speech interface

Primary use cases:1. Low power2. Low system complexity3. Low latency

Page 5: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 5 of 31

Outline1. Introduction

a) Motivation and scopeb) ASR formulation

2. Acoustic modeling with deep neural networks (DNN)a) Performance with limited memory bandwidthb) Parallel architecturec) Details of execution unit design

3. Search (Viterbi) architecture4. Voice activity detection (VAD)

a) System power modelb) Modulation frequency (MF) algorithm and architecture

5. Test chipa) Circuit featuresb) Measured performance

Page 6: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 6 of 31

ASR formulation – HMM

Inference: search using Viterbi algorithm

Page 7: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 7 of 31

ASR formulation – Acoustic model• Training clusters WFST states into “tied states” (senones)• Acoustic model approximates p(y | i)

where i is the tied state index and y is the feature vector (typ. 10—50 dims.)

Example PDFs(MFCC features projected to 2-D)

Models have many parameters – large memory requirement• Size: Typical ASR model is ~50 MB – must be stored off-chip• Bandwidth: Naïve evaluation would require 5 GB/s

(Target for low-power ASR: ~10 MB/s)

Page 8: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 8 of 31

Bandwidth-limited acoustic models

Comparison of frameworks considers:• Quantization• Parallelization• Accuracy

GMM = Gaussian mixture model

SGMM = Subspace GMMDNN = Deep neural network

Page 9: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 9 of 31

Feed-forward neural network (DNN)

Top image is from: Michael A. Nielsen, “Neural Networks and Deep Learning”, Determination Press, 2015.

• Sequence of simplenonlinear transformations

• Used in ASR to evaluate allacoustic likelihoods jointly

• This work: feed-forwardnetworks only(no time dependence)

Page 10: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 10 of 31

DNN evaluator architecture

Page 11: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 11 of 31

Execution unit assignment

Page 12: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 12 of 31

Execution unit design

• Complexity hasbeen minimized:• Executes arithmetic

commands fromsequencer

• Writes results back tolocal SRAM

Page 13: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 13 of 31

Sigmoid approximation

Piecewise (low-order) polynomial fit• Less area than plain LUT• Simpler to evaluate

than high-order fit

Chebyshev approximation• Better accuracy than Taylor series

Page 14: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 14 of 31

Sigmoid approximation

• Polynomial evaluated using Horner’s method• Use unpipelined version to save area• 7-cycle latency; 5.7k gates (mostly the multiplier)

Page 15: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 15 of 31

Search – Viterbi algorithmBaseline architecture

From: M. Price et al., A 6mW 5k-word Speech Recognizer using WFST Models, ISSCC 2014.

Page 16: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 16 of 31

Search – Viterbi algorithmImproved architecture

Merged state lists: 11% area savings

Compression and caching: memory bandwidth reduction

Word lattice:memory bandwidthreduction

Page 17: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 17 of 31

Search – Word lattice

Discard intermediate states between words

From state lattice… … to word lattice

Page 18: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 18 of 31

Search – Word lattice (cont.)

• Light workloads: no external memory writes(pruning keeps word lattice within internal memory)

• Heavy workloads: 8x reduction (writing only word arcs)

Page 19: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 19 of 31

ASR accuracy and speed

At 80 MHz, same search speed as software on 3.7 GHz Xeon

Similar accuracy to software (Kaldi) with 145k word vocab.

Page 20: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 20 of 31

Voice activity detection (VAD)

Use VAD to control power gate for ASR(Automatic wake-up)

Page 21: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 21 of 31

VAD power impacts

Downstream system contribution

Consider typical values:• pVAD < 100 μW• pdownstream > 1 mW• D < 0.05

If pFA is significant, averaged downstream power exceeds VAD power• Optimize for pM and pFA (VAD accuracy)

rather than VAD power

Page 22: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 22 of 31

Modulation frequency VAD

Page 23: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 23 of 31

Modulation frequency VAD (cont.)• MF features fed to NN

classifier

• Small network (e.g. 3x32)is sufficient

• Fewer parameters thanSVM

• NN architecture strippeddown from ASR

• No sparse matrix support• No parallel evaluation• Still quantized

• No external memory• Parameters supplied at

startup over SPI bus

Page 24: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 24 of 31

VAD performance comparison

• Two tasks: Aurora2 (left), Forklift/Switchboard (right—more difficult)• MF algorithm outperforms energy-based (EB) and harmonicity (HM)• Performance improves with more training data

Page 25: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 25 of 31

ASR/VAD interfaces• VAD runs continuously• Supervisory logic places ASR in reset during non-speech input• ASR requests audio samples buffered by VAD after wake-up

Page 26: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 26 of 31

Test chip specifications

Page 27: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 27 of 31

Multivoltage design• Memory tends to require

higher voltage than logic• Separate memory supplies

• I/O level shifters (1.8—2.5 V) cannot handle low logic level

• Intermediate supply for top level (with supervisory logic)

• Relationship between supply voltages is constrained

• Level shifters operate in one direction

Page 28: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 28 of 31

Clock gating

Explicit clock gating complements automatic clock gating

Page 29: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 29 of 31

Test setup

• On-board control of clocks and power supplies• FPGA provides host and memory interfaces• Tested live audio, real-time streaming, batch processing

Page 30: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 30 of 31

Measured performance and scalability

45x range

Note: ASR model sizes and search parameters vary between tasks;ASR and VAD tested independently

Page 31: A Scalable Speech Recognizer with Deep-Neural · PDF file14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating © 2017 IEEE International

14.4: A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Modelsand Voice-Activated Power Gating

© 2017 IEEE International Solid-State Circuits Conference 31 of 31

Conclusions• DNNs can improve performance of HW ASR

• Even with restricted memory bandwidth: < 10 MB/s

• DNN based VAD can be robust and compact• Model stored on-chip (< 12 kB)• Low power (22.3 μW)

• ASR is not only about neural networks• Significant effort required for feature extraction and search• NN architecture developed with knowledge of application

• Combination of algorithm, architecture, and circuittechniques delivers:

• Improved accuracy – fewer word errors• Improved programmability – train using standard tools• Improved scalability – lower power consumption