18
11 September 2017 Sapphyre-P-009 v1.0 Power Efficient Computation through Processor & Algorithm Co-Design Bryan Donoghue NMI: High Performance Digital Systems & Applications Event Bryan Donoghue Biography: Bryan Donoghue is Group Leader of the Digital Systems Group at Cambridge Consultants. He has over 20 years’ experience in the field of electronics and chip design at Cambridge Consultants, 3Com Networks and Hewlett Packard Research Laboratories. Bryan holds 15 patents in the fields of wireless communications and ASIC design. His current areas of technical interest are in fully-digital radio design and in processor optimisation for signal processing and machine learning.

Power Efficient Computation through Processor & Algorithm

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0

Power Efficient Computation through

Processor & Algorithm Co-Design

Bryan Donoghue

NMI: High Performance Digital Systems & Applications Event

Bryan Donoghue Biography:

Bryan Donoghue is Group Leader of the Digital Systems Group at Cambridge Consultants. He has over 20 years’ experience in the field of electronics and

chip design at Cambridge Consultants, 3Com Networks and Hewlett Packard Research Laboratories. Bryan holds 15 patents in the fields of wireless

communications and ASIC design. His current areas of technical interest are in fully-digital radio design and in processor optimisation for signal processing

and machine learning.

Page 2: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 2

Power Efficient Computation

When you don’t care:

– Do you know or care whether you Desktop PC consumes 2W, 20W or 200W?

When you do care:

– Battery powered-systems

– Cell phones

– Tablets

– Laptops

– Cooling-constrained systems

– Cloud data centres

Page 3: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 3

What is driving power-constrained computation?

Wireless modulation standards

– GSM: GMSK

– 3G: CDMA

– LTE: OFDM, 64QAM

Machine Learning

– Cars: latency-sensitive image recognition

– IoT: Tiered wake-up

– Cloud Systems: cooling-constrained massive systems

Page 4: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 4

Computation Systems

Conventional solutions trade flexibility for power-efficiency

HIGH

HIGH

Pure hardware

Microprocessor

Conventional

DSP, GPU

Flexibility

Eff

icie

ncy

Worst

Best

?

Page 5: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 5

How to improve power efficiency?

Gates = Power

Reduce the ratio of control and datapath logic to computation logic

Processor Number of

Multipliers

Gate Count MACs/

MegaGate

16*16 Multiply-

Accumulator

1 5K 200

Ceva Teaklite-II 1 100K 10

ARM Cortex-R7 1 1350K 0.74

Page 6: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 6

Why to improve power efficiency?

Lost Cycles = Power

Computation is memory-access

limited

MAC needs 3 memory accesses

– Hardware = 1 cycle

– CPU = 3 to 30 cycles

Page 7: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 7

Pure Hardware DSP

Comparison with conventional CPU or DSP…

Advantages Disadvantages

Lowest Power Time-consuming and costly to

design in RTL

Lowest Silicon Area Limited Flexibility in case of:

• Standard / Algorithm change

• RTL error

• Re-use IP in a new product

Page 8: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 8

How to build Flexible Hardware DSP?

Programmable VLIW DSP Engine

VLIW instruction mini-opcodes control

– Sequencer (program counter)

– Many DSP modules

– Dynamic data routing

– Access to multiple memories

Advantages

– Low control and data-path overhead

– Choose DSP modules & routing for application

Data R

ou

ting

Instructiondecoder

ProgramMemory

Sequencer

ALU

Indexer

MAC

MemoryInterface

IORegisters

Data Bus

Module N

IO Bus

Page 9: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 9

Design philosophy

Run it slow(er)

– Short pipelines: efficient loops and low control logic and datapath overhead

– Low-latency access to memory

– Low drive/power gates

Match the mix of modules to your algorithm

Match memory bandwidth to task

– e.g. MAC has 3 memory accesses per cycle

If you want to go faster…

– Add modules e.g. multiple MACs

– Add VLIW cores

Page 10: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 10

Sapphyre™ VLIW DSP

VLIW instruction controls multiple modules each clock cycle

Modules interconnected by multiplexed data routing

Modules have next-cycle access to multiple memories

Library of modules to suit different algorithms

Balanced cores – you can really use the available processing

capacity for processing

Data R

ou

ting

Instructiondecoder

ProgramMemory

Sequencer

ALU

Indexer

MAC

MemoryInterface

IORegisters

Data Bus

Module N

IO Bus

Sequencer ALU MAC Cart2Polar Constants

Debug

monitor

Memory

Interface

I/O

Registers

Bit

operator

Register

Bank

Adder ABS Sin Cos Indexer Min Max

Shifter Radix FFT Addr Oscillator Limiter

Page 11: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 11

Sapphyre™ DSP – Programmers Toolchain

Developing code for SapphyreTM cores is supported by the Programmers Toolchain,

consisting of:

– Macro Assembler

– Export Tool

– Graphical Simulator

– Real-time Debug Monitor

Page 12: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 12

Sapphyre™ DSP – Graphical Simulator

Configurable for :

– DSP Module choice

– Data paths

– New DSP modules

Macro Assembler

Bit & cycle-accurate simulation

Single stepping, breakpoints,

register watch windows

Profiling for code efficiency

Page 13: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 13

Sapphyre™ DSP – Real-time debug monitoring output in Silicon

Real-time and non-invasive

Test point monitoring of inputs, configuration and intermediate outputs

Replay and debug in the simulator

Page 14: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 14

Sapphyre™ DSP – Simultaneous Core and Code Development

We develop the DSP application code in parallel with the customised core

Simultaneous development allows quick prototyping of data routing and modules

– Balanced I/O, memory access and processing

– Reduced development time

– Algorithm can be written before ASIC is complete

Reduced ASIC development risk

– ASIC RTL verified against DSP simulator vectors of real application code

– Modest clock speed allows real-time verification of ASIC RTL on FPGA

The resulting Sapphyre™ DSP cores are balanced, efficient designs, tailored to

an application but with the flexibility to cope with future expansions

Page 15: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 15

Does it really work?

384MMAC/s

1mW typical

$0.03 of silicon

Sapphyre™ Gen 5

Geometry 40nm

Clock 96MHz

Gates 116K

Program Memory (typical) 64KByte

Data Memory (typical) 64KByte

MMAC/s 384

Power (mW) 8 (peak)

1 (avrg)

Power (uW/MHz) 80 (peak)

10 (avrg)

Die Area (mm2) 0.06 Core

0.25 Mem

Page 16: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 16

How does Sapphyre™ VLIW approach compare?

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

Sapphyre Gen 3 Sapphyre Gen 5 ARM Cortex-R4 ARM Cortex-R5 ARM Cortex-R7 Ceva Teaklite-II Ceva Teaklite-III-tl3210

MACs/MegaGate

Page 17: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0 17

VLIW DSP – What applications is it good for?

Low-power audio processing e.g. codecs

Software defined radio

Machine learning inference

Taking cost out of projects - replace dollars of DSP with cents of silicon

CPU hardware accelerators

Page 18: Power Efficient Computation through Processor & Algorithm

11 September 2017 Sapphyre-P-009 v1.0

UK

Cambridge Consultants is part of the Altran group, a global

leader in Innovation. www.Altran.com

www.CambridgeConsultants.com

USA SINGAPORE JAPAN

Registered No. 1036296 England