Techniques for Low Power Turbo Coding in Software Radio Joe Antoon Adam Barnett

Preview:

Citation preview

Techniques for Low Power Turbo Coding in Software Radio

Joe AntoonAdam Barnett

Software Defined Radio

• Single transmitter for many protocols• Protocols completely specified in memory• Implementation:

– Microprocessors– Field programmable logic

Why Use Software Radio?

• Wireless protocols are constantly reinvented– 5 Wi-Fi protocols– 7 Bluetooth protocols– Proprietary mice and keyboard protocols– Mobile phone protocol alphabet soup

• Custom DSP logic for each protocol is costly

So Why Not Use Software Radio?

• Requires high performance processors• Consumes more power

Inefficient general fork

Efficient applicationspecific fork

Inefficient Field-programmable

fork

Turbo Coding

• Channel coding technique• Throughput nears theoretical limit• Great for bandwidth limited applications

– CDMA2000– WiMAX – NASA ‘s Messenger probe

Turbo Coding Considerations

• Presents a design trade-off• Turbo coding is computationally expensive• But it reduces cost in other areas

– Bandwidth– Transmission power

Reducing Power in Turbo Decoders

• FPGA turbo decoders– Use dynamic reconfiguration

• General processor turbo decoders– Use a logarithmic number system

Generic Turbo Encoder

ComponentEncoder

ComponentEncoderInterleave

p1

s

p2

Data stream

q1

r

q2

Generic Turbo Decoder

Decoder DecoderInterleave

Decoder Design Options• Multiple algorithms used to

decode• Maximum A-Posteriori (MAP)

– Most accurate estimate possible– Complex computations required

• Soft-Output Viterbi Algorithm– Less accurate– Simpler calculations

Decoder

FPGA Design Options

• Goal Make an adaptive decoder

DecoderReceived Data

Parity

Original sequence

Tunable Parameter

Lowpower,

accuracy

Highpower,

accuracy

Component Encoder

• M blocks are 1-bit registers• Memory provides encoder state

M MGeneratorFunction

Encoder State

00

Time

01

10

11

00

01

10

11

0

1GF

000

01

10

11

1

0

1

Viterbi’s Algorithm

• Determine most likely output• Simulate encoder state given received values

s0 s1 s2

r0 p0 r1 p1 r2 p2

d0 d1 d2

Time

Viterbi’s Algorithm

• Write: Compute branch metric (likelihood)• Traceback: Compute path metric, output data• Update: Compute distance between paths

• Rank paths by path metric and choose best• For N memory:

– Must calculate 2N-1 paths for each state

Adaptive SOVA

• SOVA: Inflexible path system scales poorly• Adaptive SOVA: Heuristic

– Limit to M paths max– Discard if path metric below threshold T– Discard all but top M paths when too many paths

Implementing in Hardware

Branch Metric

Unit

AddCompare

Select

Survivor memory

Control

q

r

Implementing in Hardware

• Controller – – Control memory– select paths

• Branch Metric Unit– Compute likelihood– Consider all possible

“next” states

• Add, Compare, Select – Append path metric– Discard paths

• Survivor Memory– Store / discard path bits

Implementing in Hardware

Add, Compare, Select Unit

Present State Path

Values

Next State Path Values

Compute,Compare

Paths

BranchValues

> T

PathDistance

Threshold

Dynamic Reconfiguration

• Bit Error Rate (BER)– Changes with signal strength– Changes with number of paths used

• Change hardware at runtime– Weak signal: use many paths, save accuracy– Strong signal: use few paths, save power– Sample SNR every 250k bits, reconfigure

Dynamic Reconfiguration

Experimental Results

K (Number of encoder bits) proportional to average speed, power

Experimental Results

• FPGA decoding has a much higher throughput• Due to parallelism

Experimental Results

• ASOVA performs worse than commercial cores• However, in other metrics it is much better

– Power– Memory usage– Complexity

Future Work

• Use present reconfiguration means to design– Partial reconfiguration– Dynamic voltage scaling

• Compare to power efficient software methods

Power-Efficient Implementation of a Turbo Decoder in SDR System

• Turbo coding systems are created by using one of three general processor types– Fixed Point (FXP)

• Cheapest, simplest to implement, fastest

– Floating Point (FLP)• More precision than fixed point

– Logarithmic Numbering System (LNS)• Simplifies complex operations• Complicates simple add/subtract operations

Logarithmic Numbering System

• X = {s, x = log(b)[|x|]}– S = sign bit, remaining bits used for number value

• Example– Let b = 2,– Then the decimal number 8 would be represented

as log(2)[8] = 3– Numbers are stored in computer memory in 2’s

compliment form (3 = 01111101) (sign bit = 0)

Why use Logarithmic System?

• Greatly simplifies multiplication, division, roots, and exponents– Multiplication simplifies to addition

• E.g. 8 * 4 = 32, LNS => 3 + 2 = 5 (2^5 = 32)

– Division simplifies to subtraction• E.g. 8 / 4 = 2, LNS => 3 – 2 = 1 (2^1 = 2)

Why use Logarithmic System?

• Roots are done as right shifts– E.g. sqrt(16) = 4,

LNS => 4 shifted right = 2 (2^2 = 4)

• Exponents are done as left shifts– E.g. 8^2 = 64, LNS => 3 shifted left = 6 (2^6 = 64)

So why not use LNS for all processors?

• Unfortunately addition and subtraction are greatly complicated in LNS.– Addition: log(b)[|x| + |y|] = x + log(b)[1 + b^z]– Subtraction: log(b)[|x| - |y|] = x + log(b)[1 - b^z]

• Where z = y – x

• Turbo coding/decoding is computationally intense, requiring more mults, divides, roots, and exps, than adds or subtracts

Turbo Decoder block diagram

• Use present reconfiguration means to design– Partial reconfiguration– Dynamic voltage scaling

• Compare to power efficient software methods

• Each bit decision requires a subtraction, table look up, and addition

Proposed new block diagram

• As difference between e^a and e^b becomes larger, error between value stored in lookup table vs. computation becomes negligible.

• For this simulation a difference of >5 was used

How it works

• For d > 5• New Mux (on right) ignores SRAM input and simply

adds 0 to MAX result.• d > 5, pre-Decoder circuitry disables the SRAM for

power conservation.

Comparing the 3 simulations

• Comparisons were done between a 16-bit fixed point microcontroller, a 16-bit floating point processor, and a 20-bit LNS processor.

• 11-bits would be sufficient for FXP and FLP, but 16-bit processors are much more common

• Similarly 17-bits would suffice for LNS processor, but 20-bit is common type

Power Consumption

Latency

• Recall: Max*(a,b) = ln(e^a+e^b)

Power savings

• Pre-Decoder circuitry adds 11.4% power consumption compared to SRAM read.

• So when an SRAM read is required, we use 111.4% of the power compared to the unmodified system

• However, when SRAM is blocked we only use 11.4% of the power we used before.

Power savings

• The CACTI simulations for the system reported that the Max* operation accounted for 40% of all operations in the decoder

• The Max* operations for the modified system required 69% of the power when compared to the unmodified system.

• This leads to an overall power savings of69% * 40% = 27.6%

Conclusion

• Turbo codes are computationally intense, requiring more complex operations than simple ones

• LNS processors simplify complex operations at the expense of making adding and subtracting more difficult

Conclusion

• Using a LNS processor with slight modifications can reduce power consumption by 27.6%

• Overall latency is also reduced due to ease of complex operations in LNS processor when compared to FXP or FLP processors.

Recommended