DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf1 1 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol DSP Architectures for

1

1ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

DSP Architectures for Next-Generation Wireless Communications

Chris NicolBell Laboratories Australia

Lucent Technologies

[email protected]

Ingrid VerbauwhedeDepartment of Electrical EngineeringUniversity of California Los Angeles

[email protected]


Mobile Wireless TrendsSubscribers in (000)

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

G lobal W irelineG obal W ire less

W ireless C AG R 21%G lo bal Penetratio n (2010) - 21%(Cellu lar+PCS+W L AS+O ther)

W ire line CAG R - 5%G lo bal Penetration (2010) - 20%

G lobal Pop - 7 billCAG R 1995-2010 - 1 .4%

Subs

crib

ers

(000

)

World-wide deployment of mobile communications is exceeding expectations

2


DSP Evolution and Markets

Power

(mw/MIP)

1980 1985 1990 1995 2000

DSP-1 ($150)

DSP16A ($15) DSP1600 (<$10)

1K

100

10

10KM68000 ($200)

80286 ($200)

80386 ($300)Pentium ($300)

1

DSP-32C ($250)

DSP16210

Pentium (MMX) ($700)

Cellular InfrastructureMobile HandsetsCordlessGPS

Wireless

$1.01BModem

$727 MV.34V.90xDSL Consumer &

Automotive

Disk

$270 MOther

Source: Forward Concepts 1996

$2B market, 30% growth rate

DSP Market

Power

(mw/MIP)


The DSP Market Splits - and so does this tutorial

Today’s general purposeassembly coded

DSP

Low cost,low power

DSPs

HighPerformance

DSPs

• 1-10 GOPS• 1-5 watts• < $50

• 200-1000 MOPS• < 100 mW• $10

• 100 MOPS• 250 mW• $40

Chris NicolIngridVerbauwhede

InfrastructureMobile Terminals

3


Overview• Introduction• Low Power DSP Architectures for Handsets

• Domain Specific Processors• DSP Processor Fundamentals• Datapath Design, Instruction Set Design• Pipeline Control, Memory Architecture, Low Power Design• for FIR - Viterbi - speech codec

• High performance DSP Processors for BTS• 2G and 3G Wireless Standards• Mobile Wireless Basestation Systems

• Receiver Algorithms, Smart Antennas• Wideband TRX Architectures• Convolutional and Turbo coding

• High Performance DSP Architectures for 3G Wireless• LU DSP16210, TI ‘C6x, Starcore SC140• Future Trends - MIMD DSP


Domain Specific Processors

ASIC Application Specific

Domain Specific

General DSP

General Purpose

low

none very high

Performance / Power:

Programmability:

high

Low power programmable DSP’s for wireless communications

high

none parameters

4


Domain Specific Processors

Domain specific processors: to combine

High performance

Low Power

High degree of programmability

Application domains that need it:Wireless communications (baseband processing)

Application domain is narrower, hence need high volume to compensate development cost.

Video processors

Embedded micro controllersEtc.


Application domain: wireless communications

Receiver

Tran

smit

Synt

hesi

ze

PA

TCXO

Receiver

Tran

smit

Synt

hesi

ze

PA

TCXO

Exte

rnal

Mem

orie

s

DigitalASIC

MicroProcessor

DSP

BatteryPack

AnalogASIC

PowerSupply

AudioCodec

No network

* 0 #7 8 94 5 61 2 3

clr

RF Board

Baseband board

5


Performance requirements: digital cellular phone

RFReceive

RFSend

Demodulation Channeldecoder

Speechdecoder

Modulation Channelencoder

Speechencoder

Communication Application

Goal: Minimum “MIPS” to get the job done.


Note: Definition of MIPS, MOPS

What is inside a MIPS = Million Instructions per Second ?

DSPs use Complex Instructions

One instruction = 5 operationsE.g. Lode instruction: 2 Memory operations, 2 address generationsand 1 arithmetic operation

So: benchmarks are expressed in minimum number of operationsto finish a job, usually expressed in “MIPS”

Small Example: Viterbi butterfly operation in 4 cycles/butterflyLarge Example: GSM Half rate speech codec in only 12 “MIPS”

6


Application Domain: compute intensive functions

Source encoder/decoder = speech codersAdvanced vocoders for improved speech quality & higher capacity:Example: ACELP derivatives for GSM and IS136A

• Digital filtering (FIR, IIR)• Vector quantization, code book search

(square distance computation)

Channel encoder/decoder = error correctingComplex wireless modems:

• Galois field arithmetic• Convolution coders based on Viterbi trellis search• Turbo coders

Modulation/demodulation =

• Receivers based on Maximum Likelihood Sequence Estimation(requires again fast Viterbi butterfly operations)


Compute intensive functions: evolution of DSP’s

Simple FIR example

Square distance

Speed-up of FIR example

Viterbi acceleration

Evolution of DSPs follows these examples

7


Evolution of DSP processors

Generation Features Examples

0 (1980) Von Neumann architecture DSP-1 (AT&T)

1 (1982) Basic Harvard architecture TMS320C10 (TI)NEC7720

2 (1986) 1data/program bus,1 data bus

TMS320C25 (TI)DSP16A (AT&T)

3 (1990) Extra Addressing modes,extra functions

TMS320C5x (TI)DSP16xx (AT&T)

4 (1994) 2 data busses1 program bus

TMS320C54x (TI)

5 (1996 – now) 2 data busses,1 program bus,multiple units

Lucent 16xxxAtmel LodeSiemens Carmel


DSP Processor Fundamentals

Data PathProcessing

Unit

InterconnectProcessing

Unit

MemoryManagement

Unit

InstructionProcessing

Unit

Processor Components [Skillikorn-88]

8


Basic Harvard Architecture

ProgramMemory

DataMemory

MultiplyAccumulate

InstructionProcessing

Unit

Separate data memory from program memory!

16 x 16 mpy

ALU

Different from Von Neumann machine:one address bus - one data bus - one memory space


Example 1: TMS320C10 (1982)

Data RAM Program ROM1.5K x 16144 x 16

16-bit T-register

16 x 16 Multiply

32-bit P-register

16-bit BarrelShifter (L)

32-bit ALU

32-bit Accumulator

ShiftL (0,1,4)

2 Auxiliary RegsFour Level H/W Stack

Status Register

CPU

D (15-0)

A (11-0)

I/O Ports8 x 16

PA (7-0)(A 2-0, D 15-0)

160/200ns Instructioncycle time4K word externaladdress reach

60 general purpose andDSP specific instructionsSingle cycle multiply

16-bit Barrel Shifter

External interrupt andpolled input pins

Eight 16-bit I/O ports

40-pin DIP/44-pin PLCC

Courtesy: Texas Instruments

9


Compute Intensive function 1: FIR

x(n)

X

(50 TAPS)

Z-1 Z-1 Z-1

X X X

+ + +

x(n-1)

y(n)

c(0) c(N-1)

x(n-(N-1))

ΣΣΣΣy(n) = c(i) x(n-i)N-1

i=0

TMS320C10 TMS320C25LTD RPTK 49MPY MACDLTDMPYLTD

MPY

LTDMOVAPAC

LTDMOVAPACMPY

3 Words Prog Memory53 Cycles

100 Words Prog Memory100 Cycles

...

Single Cycle Multiply - Accumulate!


16 16

16

32

32

32

32

32

Example 2: Single Cycle MAC

TMS320C2x Multiplier/ALU

Left Shifter (0-7)

Left Shifter (0-16)3232

16

Single Cycle 16x16 bitMultiply yielding a32-bit product

Supports simultaneousProgram and two DataOperand aquisition

Supports simultaneousALU and Multiplieroperations

0-16 bit Left Post-Shifter

Data Bus

Program Bus

LeftShifter(0-16)

T Register (16)

Multiplier (16x16)

P Register (32)

MUX

Arithmetic Logic Unit (ALU)

Accumulator Register (32)C

MUX

16

16

16

32

Courtesy: Texas Instruments

10


Compute Intensive function 1: FIR (cont.)

x(n)

X

(50 TAPS)

Z-1 Z-1 Z-1

X X X

+ + +

x(n-1)

y(n)

c(0) c(N-1)

x(n-(N-1))

ΣΣΣΣy(n) = c(i) x(n-i)N-1

i=0

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);

y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);

. . .

y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

One output = 2N reads, N MAC’s, 1 write

Classic Harvard: one output = N cycles


FIR speed-up

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);

y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);

. . .


Run MAC at double frequency, read two 32-bit numbers

FIR filtering: two outputs in parallel

Two outputs = 4N reads, 2N MAC’s, 2 writesDual Mac Architecture with ONLY 2 data busses??

Read two 32-bit numbers instead of four 16-bit numbers Solution by Lucent 16000 core with dual MAC

Solution by MatsushitaInsert delay register

Solution by Atmel’s LODE

11


Example 3: Lucent DSP16210

Horizontal parallelism, one sample at a time

2G mobile wireless base-stations

16 x 16 mpy 16 x 16 mpy

p0 (32) p1 (32)

Shift/Sat.

ADD BMU

ACC File8 x 40

Y(32) X(32)

ALU

Shift/Sat.

do 14 { //one instruction !

a0=a0+p0+p1

p0=xh*yh p1=xl*yl

y=*r0++ x=*pt0++

}

Inner loop of 32-tap FIR Filter XDB(32)IDB(32)

Outer Loop: 19 cycles, 38 bytes1 cycle in inner loop

5 exec units used in inner loop2 MACs per cycle

Courtesy: Gareth Hughes, Bell Labs Australia


FIR on Lode

FIR filter: two outputs in parallel with delay register y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);

y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);

. . .


Total energy for one output sample:

Energy SingleMAC

DualMAC

Dual MACwith REG

No. of MAC operations N N N

No of Memory reads 2N 2N N

No of Instruction Cycles N N/2 N/2

12


FIR on Lode

Two MAC units with dedicated bus network

x(n-i)

X

LREG

+

y(n+1) y(n)

c(i)

X

+

c(i)x(n-i+1)

A0 A1

MAC1 MAC0

DB1(16)DB0(16)

• DB0 fetches coefficient

• DB1 fetches data

• LREG delays input data

• A0 stores y(n) output

• A1 stores y(n+1) output

Same structure can be used for IIR


Compute Intensive function 2: Viterbi

i

i+ s/2

2i

2i+1

+a

-a

-a+a

. . .

. . .

Viterbi butterfly

i = state indexs = # of states = 2w = decoding window

Basic equations:

d(2n) = min { d(i) + a, d(i + s/2) - a }d(2i + 1) = min { d(i) - a, d(i + s/2) + a }

IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)

k-1

7

Basic algorithm in Viterbi channel decoders and MLSE based receivers,modified version in turbo decoders.

Key operation: Add-Compare-Select (ACS)

13


Viterbi on Lode

Two MAC units & ALU: Add-Compare-Select

• DMAC operates as dual add/subtract unit

• ALU finds minimum

• Shortest distance saved

• Path indicator saved

• 4 cycles / butterfly

+

A1

MAC0

DB1(16)DB0(16)

µ2

+

µ1

A0

MAC1

Γ1 Γ2

Min()ALU

A3Γ

A2

decision bit

to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]


MSW/LSWSelect

Viterbi on TIC54x

ALU and CSSU: Add-Compare-Select

• ALU splits in 16 bit halves

• ACC splits in half

• Shortest distance saved

• CSSU compares halves

• Path indicator saved

• 4 cycles / butterfly

+

TREG

ALU

DB1(16)DB0(16)

µ2

+

µ1

AccumulatorΓ1 Γ2

CompALU

TRN regΓ

decision bit

Data bus EB, to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]

14


Viterbi on LU DSP16210

do 8 {a0=a4+y a1=a5-y *r3++=a0ha2=a4-y a3=a5+y *r5++=a2ha0=cmp1(a1,a0) yh=*r0 r0=r1+j j=k k=*pt1++a2=cmp1(a3,a2) a4_5h=*pt0++

}

GSM (K=5, 16 states)

AR0

AR0

AR0

AR0

. . .

a0=cmp1(a1,a0)

a2=cmp1(a3,a2)

a2=cmp1(a3,a2)

• Hardware support for Viterbi algorithm:– ACS calculations are efficient– Minimal overhead

• 4 cycles per butterfly– 32 cycles per GSM timeslot.

• Comparison functions store ACS decision bits:

. . .

Results writtento memory

Courtesy: Gareth Hughes, Bell Labs Australia


Square distance on Lode

ALU in parallel with MAC: Sum of square distance

• ALU performs subtraction and absolute value

• MAC performs squaring and accumulation

Vector quantization in vocoders:vector size N = 50, codebook > 1000

D = Σ || x(i) - y(i) || N-1

i = 0

2

X

+

D

x(i)

-

y(i)

A0

MAC

ALU

DB1(16)

DB0(16)

15


Lode Core Architecture


Domain specific instruction set

Basic instruction set for general purpose DSPe.g. MAC, min, max, etc.

Extra instructions for performance with every new generatione.g. “square distance and accumulate

D = Σ || x(i) - y(i) || N-1

i = 0

2

One 32 bit instruction:

a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;

Bus network and instruction set design go together

CISC, thus compiler unfriendly

16


Control & Pipeline for DSP’sRISC: load/store machinememory access with load/store instructions (DLX, MIPS, D10V)

MemoryAccessDecodeFetch Execute Write

Back

Memory access / branchExecution/ address generation

Excellent for complex decision making!

Memory accessExecution

DSP: register-memory architecture (TI, Lucent, HX, Lode)

Excellent for number crunching!

ExecuteDecodeFetch MemoryAccess

WriteBack


Pipeline RISC compared to DSPRISC:example

DSP: memory intensive applications:

r0 = *p0; // load dataa0 = a0 + r0; // execute

MemoryAccessDecodeFetch Execute



Too expensive for DSP

ExecuteDecodeFetchMemoryAccess




Penalty: data dependent branch is expensive

17


Other control features

Hardware looping:

• Because software branch is expensive• “Zero overhead hardware loops” (for tight FIR loops)

hardware supported

Interrupts: hardware with shadow registers for extremely fastcontext switching.

Special instruction cache:• Single instruction “repeat” buffer• Multiple instruction cache: under programmers control!• E.g. Lucent DSP16210:31x 32 instruction cache

Predictable worst case execution time!


Low Power DSP’sC54x 1V DSP(Texas Instruments - ISSCC 1997)

DSP 1600 Core(Lucent - 1609 low cost consumer 16-bit)

0.35µ 3LM CMOS80 M 16b MAC/s at 3.3V1.4 mW/MHz at 3.3V30 µW stand-by power

0.25µ 3LM CMOS65 M 16b MAC/s at 1.0V0.21 mW/MHz at 1.0V4.0 mW stand-by power

Dual Vt process

18


BUT: DSP Software Development

• Complex DSP architecture not amenable to compiler technology

• Algorithms are modeled in high level language (e.g. C++)

• Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support

HLL

algorithmic

model

prototype

code

production

code

hand coded assembler

optimize & debug

Long, frustrating time to market

Fragile legacy code

Still used in handhelds, but change in basestations, Part II


Mobile Wireless Evolution

SERVICE

First Generation

Mobile TelephoneService: Carphone

Analog CellularTechnology

MacrocellularSystems

Past

Second Generation

Digital Voice +and Messaging/Data

Services

Fixed Wireless Loop

Digital CellularTechnology + INemergence

Microcellular &Picocellular:capacity, quality

Enhanced CordlessTechnology

Now

Third Generation

Integrated High QualityAudio and Data.Narrowband andBroadband MultimediaServices + IN integration

Broader BandwidthEfficient Radio Transmission

Information Compression

Higher FrequencySpectrum Utilization

IN + Network Managementintegration

Year 2000-2005

Fourth Generation

TelePresencing

Education, training anddynamic information access

Wireless- Wireline andBroadbandTransparency

Knowledge-BasedNetwork Operations

Unified Service Network

Year 2010?

TECHNOLOGY

WCDMAUWC-136 TDMAcdma2000

NMTTACSAnalog AMPS

GSMIS-54/ 136 TDMAIS-95/ cdmaOnePDCDECT

We are entering the decade of wireless data communications - and World-War 3G

Global roaming

19


Mobile Data Services• Carriers invest >$500 per subscriber but subscriber voice calls (and therefore revenues) are reducing.

• Data currently 3% of wireless traffic - projected to >50% by 2005

• Wireless Internet : Average internet connection 30 mins

• Text Messaging: Saturating 2G voice networks

2.5 Generation Mobile Standards [1]GPRS: Packet Data over GSM - timeslot multiplexing, multi-slots per user.EDGE: 8-PSK modulation + GPRS, 384 Kbps max to 1 user.

3G - IMT2000 Proposals144 Kbps Automobile, 384 Kbps Pedestrian, 2 Mbps stationary.Several Proposals - UWC 136 (200Khz, TDMA, 8-PSK = EDGE).UMTS, CDMA-2000 are both CDMA proposals.


Evolution of Mobile Wireless Network Architecture

…

BaseStations

PacketMode

ServersHigh Speed Data,

Multimedia,Voice over IP,

etc.

WirelessControlServers

(Feature Control,Network Management,

Billing, etc.)

RadioClients

MSC

BSC

…

Internet / Advanced ServicesPSTN

CircuitMode

Servers(Voice, LowSpeed Data,

etc.)

PSTN

NetworkServers

MobileSwitches

Packet Connectivity (ATM / IP)

2G Network IP-based 3G Network

Mobile networks are being upgraded in preparation for the delivery of high speed data services.

20


Mobile Wireless Infrastructure

Macro-cell GSM Basestation(6-12 TRX)

Micro-cell GSM Basestation(2 TRX)


2G Basestation Baseband Processing

• Multiple DSPs used for baseband processing.• RISC Microcontroller for timing, framing, I/O control• Software upgradable over the network• DSPs dominate cost and power consumption

DSP RISCMicro

Controller

I/O

T1/E1

DSP

DSP

DSP

DSP

DSP

DSP

DSP

I/O

I/O I/O ASIC

DSP

DSP

AFE

AFE

ChannelEqualization

ChannelDe/coding Encryption

RAM

RAM

Tx

TxRx

Rx

Tx/Rx baseband processing board for 2-carrier GSM basestation

Future trend - integratebaseband processing -low cost Pico BTS

21


3G Basestation Baseband Processing

• Increased Receiver Algorithm Sensitivity• Antenna Arrays - Smart Antennas• Multi-Standard Basestations using Software Radio Architecture• 3G - constraint length 9, rate 1/2 convolutional coding for voice.• 3G - constraint length 4, Turbo codes for data

Increased DSP performance needed in next-generation basestation

High Performance DSPs+ Custom Logic needed for 3G (Viterbi decoding and Turbo decoding)

RAKE combinerreassemble multipath

(DSP, ASIC)

Sliding correlatordespreading

(ASIC)

Deinterleaver(DSP)

DecoderViterbi algorithmTurbo decoding

(DSP, ASIC)

Code trackingdelay-lock-loop(ASIC, DSP)

Channel estimation(DSP)

Code generatorchannelisation code

scambling code(ASIC))

Code generatorchannelisation code

scambling code(ASIC))

Synchronisationcell search

slot syn, frame syn.(DSP)

Path search(ASIC)

SIR measurementfast power control

(DSP)

Power control

Courtesy: Bing Xu: Bell Labs Australia


Receiver Algorithms for GSM Basestation

• Enhanced Receiver Sensitivity• Larger Cells in Suburban Areas = Reduced network cost• Mobile transmits with less power = Increased battery life

EstimatingWirelessChannel

EqualizingMulti-pathEffects

ChannelDecoding

SpeechDecoding

Existing Receiver

New Iterative Receiver

Challenge - requires 6x DSP MIPS of existing receiver in basestation

EstimatingWirelessChannel

EqualizingMulti-pathEffects

ChannelDecoding

SpeechDecoding

SpeechStatistics

1.3dB improvement

Courtesy: Magnus Sandell: Bell Labs UK

22


OmnidirectionalCell Site

Three SectorCell Site

Intelligent AntennaCell Site

• A multiple antenna element system• Combined with a base station architecture and signal processingtechniques designed to dynamically select or form the “optimum” beam pattern per user

Smart Antennas

Increased cost in RF electronics and enhanced DSP requirements.


Fixed Multi-Beam Versus Adaptive Beam

Mobile

Reflected Ray

Select from--or use--multiple “fixed” antenna beams to optimize

performance.

Fixed Multi-BeamMobile 1

Direct Ray

Reflected Rays

Mobile 1

Mobile 2

Adaptively “weight” and combine multiple antenna elements to optimize

performance.

Adaptive Beam

Mobile 2

Interferer

Direct Ray

Interferer

23


Digital Radio Trends - Software Radio

Digital Processing

RF/AnalogProcessing

A/D

NetworkNetworkInterface

AMP

DSPs - higher speed, more powerful

Filtering ModulationDemodulation EqualizationRake receiver CorrelatorChannel coding EncryptionDiversity . . .

RF/IF

Linear amplificationCombining

Higher dynamic rangeSmallerAmplifiersMixersFilters . . .

Antennas

multi-standardbasestation


Wideband Receiver Architecture

HighSpeed

A/D

BasebandProcessing

......

CH1

CHM

CH1

CH2

CH3

CHM

. . .

freqfBB

CH1

CHM

DigitalChanneliser

RF-IF &Filter

CH1

freq

CHM

freq

CH1

CH2

CH3

CHM

. . .

freqfRF

CH1

CH2

CH3

CHM

. . .

freqfIF

Increased DSP performanceneeded for Software Radio

24


Turbo Codes

• Parallel concatenation of convolutional codes is used to give the codes structure so they can be decoded

• Pseudorandom interleaving is used to give the codes performance which approaches that for random coding

• Resulting encoder structure: Two Recursive Systematic Convolutional(RSC) Codes

Encoder#1

Encoder#2Int

erlea

ver MUX

Input

ParityOutput

Systematic Output

For 3G Wireless (UMTS and CDMA2000)• Voice service: BER requirement 10-3

• Data service: BER requirement 10-5


Turbo Decoding

• Key idea: iterative decoding (up to 10 iterations for 3G)• There is one decoder for each elementary encoder.• Each decoder estimates the a-posteriori probability (APP) of each data

bit.• The APP’s are used as a priori information by the other decoder.

Decoder#1

Decoder#2

DeMUX

Interleaver

Interleaver

Deinterleaver

systematicdataparitydata

APPAPP

hard bitdecisions

25


Soft-Output Decoding Algorithms

Requirements for Turbo:– Accept Soft-Inputs in the form of a priori probabilities (APP) – Produce APP estimates of the data.– “Soft-Input Soft-Output”

Trellis-Based Estimation Algorithms

ViterbiAlgorithm

MAPAlgorithm

max-log-MAP

log-MAP

Sequence Estimation

Symbol-by-symbolEstimation

Improved SOVA

SOVA

SOVA and log-MAP use modified Add-Compare-Select operations - not onlyselect the maximum path metric - but also need to keep the difference.

Today’s High-performance DSPs are highly MAC-focussed (for filtering in modem applications). Some DSPsprovide hardware support for efficient implementation of Viterbi - none support SOVA or log-MAP

Iterative channel estimation also usesSoft-Input Soft-Output decoders.


The Maximum A Posteriori (MAP) Algorithm

( ) [ ][ ]

( ){ }

( ){ }

�

��

�

�

� ′

� ′

=��

��

�

=+==

=′

=′

0:,

1:,,,

,,

ln0Pr1Prln

k

k

uss

uss

kk

k ssp

ssp

uuuL

y

y

yy

( ) [ ][ ]0Pr

1Prln===

dddLLog-Likelihood Ratio: ( ) ( )

( )( )( ) ( )dL

dypdyp

ydyd

ydL +�

��

�

==

=��

��

�

==

=01

ln0Pr1Pr

ln

• A Priori value of Pr[d=1],Pr[d=0]• Output of decoder contains additional extrinsic information• The sum of the a priori information and the extrinsic information will be the a priori information for the next-stage of decoding, for both 2nd decoder or 1st decoder in the next iteration

1) uk is the kth bit of the desired data sequence, 2) y be the observed sequence, 3) the state transitions from state s’ at time k-1 to state s at time k, 4) We want to evaluate this LLR for every k

( ) ( ) ( ) ( )spsspspssp kjkkj >< ⋅′⋅′=′ yyyy ,,,, ( ) ( )kjk sps <− ′= y,1α( )sp kjk >= yβ

( ) ( )sspss kk ′=′ y,,γBreak the probability computation into: Gamma:Alpha:Beta:

26


Gamma, Alpha and Beta CalculationsGamma: Calculated from known bits up to k, needs to be stored

where is calculated from the a priori information and is calculatedfrom the received bits

( ) ( ) ( ) ( ) ( ) ( )kkkkkk upuPsspssPsspss yyy ⋅=′⋅=′=′ ,',,γ

( )kuP ( )kk up y

Alpha: Calculated by a forward recursion through the trellis based on Gamma

Beta: Calculated by a backward recursion from the end of the trellis

( ) ( ) ( )′⋅′=′

−s

kkk ssss 1, αγα

( ) ( ) ( )⋅′=′−s

kkk ssss βγβ ,1

Alpha BetaGamma

Window algorithm

DummyBeta’s


Log MAP and MAX-log MAP

( )21ln δδ ee +

Compute logarithms of alpha, beta and gamma, which means we compute:

Log-MAP: ( ) ( ) ( )2121 ,maxln 21 δδδδδδ −+∝+ cfee

MAX-Log-MAP: ( ) ( )21 ,maxln 21 δδδδ ∝+ ee Correction function (impl. table)

2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.910-6

10-5

10-4

10-3

10-2

10-1

BER

MaxlogAPPLogAPP MAX-log MAP suffers approx 0.5dB

from log MAP.

For log-MAP, small correction tableneeded (approx 6 non-zero values).Absolute difference used as tablelook-up. We need the difference!

Courtesy: Bing Xu: Bell Labs Australia

27


High Performance DSP Requirements• Very high levels of DSP integer performance

• Scalability to meet wide range of cost, power, performance.

• Large memory and I/O bandwidth.

• Friendly, compiler driven, programming environment.

• Support for complex real-time synchronous applications (latency, predictable throughput, synchronization)

• Cost & power efficient solution.

100K

10K

1000

100

101997 1999 2001

V.34

GSMterm

ADSL500k

ADSL6M

24 ch.modem

DABrcvr16 HR

GSM

1G eth. xcvr

set-topbox

MPEGIIencode

Soft radio

3-D graphics?

MOPS

K56PCSterm

traditional DSP

3G Wireless

Some DSP Applications


Compiler Driven VLIW

Large orthogonal register set, regular interconnect

Data memory

RegisterArray

Interconnect

ex1(alu)

ex2(alu)

ex3(mpy)

ex4(ld/st)

exn(ld/st)

cond/branch ex1 ex2 ex3 ….. exnInstruction format:

Atomic RISC-like operations => heavily pipelined, high freq. clock

28


Explicitly Parallel Instruction Computing

Execution ClustersData memory

RegisterArray

Interconnect

ex1(alu)

ex4(alu)

ex5(mpy)

ex3(ld/st)

ex6(ld/st)

RegisterArray

Interconnect

ex2(alu)

Execution Sets

1 1 1 0 1 0 1 0

fetch set

exec. set


Explicitly Parallel Instruction ComputingPredication (guarded) exec.

Instruction modifiers

any instructioncond

- eliminates branches - improves compiler efficiency- eliminates branches - removes pipeline bubbles- fill delayed branch slots with predicated instructions

instr1modifier instr2 instr3 instr4

- allows shorter instruction length- extend register addressing- predication- execution set identifier- looping- extended operations

29


Texas Instruments ‘C6201

ALU shift mpy add ALU shift mpy add

Register Bank A(16 x 32)

Register Bank B(16 x 32)

Instruction Dispatch & Decode

Program Memory(16K x 32)

256

Data Memory(32K x 16)

8-way VLIW with two execution clusters256 bit (8x32) instruction fetch with variable length execute setEach 32 bit instruction individually predicated11 stage pipeline1600 MIPS, 400 MMACs @ 200 MHz


FIR Filter on TI ‘C6x

loop:

ldw .d1t1 *a4++,a5

|| ldw .d2t2 *b4++,b5

||[b0] sub .s2 b0,1,b0

||[b0] b .s1 loop

|| mpy .m1x a5,b5,a6

|| mpyh .m2x a5,b5,b6

|| add .l1 a7,a6,a7

|| add .l2 b7,b6,b7

• Outer Loop: 23 cycles, 180 bytes– 1 cycle in inner loop

• All 8 exec units used in inner loop - maximum efficiency– 2 MACs per cycle

Hand-coded assembly: 32-tap FIR filter

Assembly syntax more difficult to learn.Hard to get full use of all 8 execution units at once.Software pipelining difficult to implement, and requires longer prolog/epilog (larger

code size).

Courtesy: Gareth Hughes: Bell Labs Australia

30


Viterbi on TI ‘C6x

LOOP:[b1] b .s1 LOOP

||[b1] sub .s2 b1,1,b1||[!a2] sth .d1 b12,*+a6[8]||[!a2] add .d2 b0,b14,b14|| cmpgt .l1 a11,a10,a1|| cmpgt .l2 b11,b10,b0|| mpy .m1x 1,b5,a4

[a2] sub .s1 a2,1,a2||[!a2] sth .d1 a12,*a6++||[a1] add .s2 2,b0,b0||[b0] mpy .m2 1,b11,b12|| mpy .m1 1,a10,a12|| sub .l2x a7,b5,b10|| ldh .d2 *++b9,b5

shl .s2 b14,2,b14||[a1] mpy .m1 1,a11,a12|| add .s1 a7,a4,a10|| sub .l1x b13,a4,a11|| add .l2 b13,b5,b11|| mpy .m2 1,b10,b12|| ldh .d2 *b4++[2],a7|| ldh .d1 *a5++[2],b13; end of LOOP

Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]

.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]

.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0

.M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0

.L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I

.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k

.S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j

Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1

.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0

.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0

.M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP

.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8

.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k

.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP

Utilization of execution units in Viterbi decoder

• 16-state Viterbi decoder for GSM from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm

– 3 cycles per butterfly– 32 cycles per GSM timeslot (8 butterflies)– MPY instructions used to move data

3-cycle 2-ACS Inner-Loop

x 8


Lucent / Motorola Star*Core SC140

6-way VLIW with 128 bit (8x16) instruction fetchPrefix instructions for high performance without sacrificing code densityEach execution set (parallel instructions + prefix) predicated5 stage pipeline1800 MIPS, 1200 MMACs @ 300 MHz

Program / Data Memory

ProgramSequencerInstructionDispatcher

AddressRegisters

(27)

AAU

Data Registers(16)

MACALU

BFUAAU

MACALU

BFU

MACALU

BFU

MACALU

BFU

31


Viterbi on Star*Core

• Hardware support for Viterbi algorithm:– max2vit instruction.– vsl instruction

• 1 cycle per butterfly through software-pipelining

• Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction:

GSM (K=5, 16 states)[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 ][ add2 d0,d4 sub2 d6,d2sub2 d4,d0 add2 d2,d6 ]

[ max2vit d4,d2 max2vit d0,d6 ][ vsl.4w d2:d6:d1:d3,(r2)+n0vsl.4f d2:d6:d1:d3,(r3)+n0 ]

max2vit d4,d2 max2vit d0,d6

SR

D1

D3

D2

D6

vsl.4w d2:d6:d1:d3,(r2)+n0

Results writtento memory

x 4

decisions

decisions

path metricspath metrics



Log-MAP on Star*Core

d0: a+x d1: b+x

d1: bd0: a d6: x

d5: a-xd4: b-x

d3: d1-d5d2: d0-d4

max max

n0: |d2|

n0: |d3|

r6

r6d4: d4+d2 d5: d5+d3

d5: max(d1,d5)d4: max(d0,d4)

Cycle 2

Cycle 3

Cycle 4

Cycle 6

Cycle 5

Cycle 7

Cycle 9

move.w (r0)+,d0 move.w (r1)+,d1

add d0,d6,d0 sub d6,d0,d5

sub d6,d1,d4 add d1,d6,d1

max d0,d4 max d1,d5

abs d2 abs d3

sub d0,d4,d2 sub d1,d5,d3

move.l d2,n0

move.l d3,n0 move.w (r6+n0),d2

add d4,d2,d4 move.w (r6+n0),d3

add d5,d3,d5

move.2w d4:d5,(r2)+

Cycle 1

Cycle 8

Cycle 10Cycle 11

This code uses 2 of the 4 ALUs and can be software pipelined to achieve 6 cycles per LOG-MAP Butterfly

Star*Core code for log-MAP Butterfly


d2:

d3:

32


Parallel DSP Architectures

Arch. Parallelism Compile? Power ?

S/scalar Dynamic instruction level��

VLIW Static instruction level��

SIMD Highly regular, data dependent��

MIMD Task level��

MIMD with VLIW / SIMD provides high order parallel execution

The future of high performance DSPs is MIMD


Daytona: A Multiprocessor DSP Architecture

ProgrammableProcessing

Element(PE)

HardwareAccelerator

Chip

split transaction bus (128 bits)

ProgrammableProcessing

Element(PE)

I/O Subsystem

I/O Interfaces

BufferedI/O

External Memory

ArbitrationSynchronization

I/O Interfaces

Scalable Architecture - multiple programmable DSPs on a single chip1 Bus supports different programmable DSPs and Microcontrollers

33


Split Transaction Bus

Arbiter(round-robin)

ID

data

ID

data

ID

addrAddressBus (100MHz)

DataBus (128 bits 100MHz)

Multiple outstanding transactions - varying size/priority

Separate Bus Arbitration

ID

data

IDIDMemory

ControllerPE

addraddr

Separate Address and Data busses - each with pipelined protocol

Arbiter(round-robin)


Memory Hierarchy in MIMD DSPs

Multiple copies of 1 application (e.g. odd/even slot channel equalisation)

Mix of different applications (e.g. equalisation, convolutional decoding)

• Heterogenous mix of applications

• Multiple copies of same software - Shared memory multiprocessing

SRAM

DSP

SRAM

DSP DSPCache

DSPCache

DRAM

2 copies of software 1 copy of software

Flat Memory Architecture vs. Hierarchical Memory Architecture

Inefficient

34


Shared Memory Multiprocessing

64 Semaphores provided for process synchronization

DSP

hit

DSP DSPDSPAccessto shareddata

Snoop(miss)

Snoop(hit)

Snoop(miss)

Coherent TransactionMemoryController

Access to shared datauses coherent transaction.Caches “snoop” the addressand query their tag RAMs.A cache hit prevents the memory controller fromservicing the request.

L-1 cache coherency using a snoopy protocol (modified MESI used)


Daytona Multiprocessor DSP Chip

128-b Split Transaction Bus

HostInterface

I/O &Memory

Controller

Test &JTAG Port

Arbiter

Semaphore

120mmCore Area

100 MHzSpeed

4WPower

Tech

Chip Characteristics2

0.25um

Bell Laboratories Research Chip for 3G Wireless Base-stations / Head-end xDSL

64-b 4-MACSIMD DSP

32-b RISC

Cache Memory

64-b 4-MACSIMD DSP

32-b RISC

Cache Memory

64-b 4-MACSIMD DSP

32-b RISC

Cache Memory

64-b 4-MACSIMD DSP

32-b RISC

Cache Memory

Paper 4.2, ISSCC2000

35


Photomicrograph of Daytona Test Chip

8KB Re-configurable Memory

DLLSPARC

Vector Unit (RVU)

BUS IN

T

HDS

LRU

I/O Subsystem

ArbiterSemph

Proces

sing Elem

ent (P

E)

Split

Tra

nsac

tion

Bus

Proces

sing Elem

ent (P

E)

Proces

sing Elem

ent (P

E)

Paper 4.2, ISSCC2000


AcknowledgementsThe following people contributed to the work in this tutorial:

Low Power DSPs for WirelessWanda Gass: Texas InstrumentsMihran Touriguian: Atmel

High Performance DSPs for Wireless InfrastructureBryan Ackland: Bell Labs US - High Perf. DSP ArchitectureGareth Hughes: Bell Labs Australia - LU DSP16210, ‘C6x and Starcore benchmarksBing Xu: Bell Labs Australia - SOVA, MAP, LOG-MAPRan-Hong Yan: Bell Labs UK - 3G WirelessDaytona Team: (J Williams, K.J. Singh, J. Othmer, B. Ackland), Bell Labs US.

36


References

[1] P. Lapsley, J. Bier, A. Shoham, E. Lee, “DSP Processor Fundamentals,” IEEE Press, New York, 1997.[2] D. Skillikorn, “A Taxonomy for Computer Architectures,” Computer Magazine, Nov. 1988.[3] H. Kabuo, M. Okamoto, I. Tanaka, H. Yasoshima, S. Marui, M. Yamasaki, T. Sugimura, K. Ueda, T. Ishikawa, H. Suzuki, R. Asahi, “An 80 MOPS-Peak High-Speed and Low-Power-Consumption 16-b Digital Signal Processor,” IEEE Journal of Solid-State Circuits, Vol. 31, No. 4, April 1996, pg. 494-503.[4] E. A. Lee, D. G. Messerschmitt, Digital communication, Boston: Kluwer Academic Publishers, 1988.[5] W. Lee et al., “A 1V DSP for Wireless Communications,” Proceedings IEEE International Solid-State Circuits Conference, pp. 92-93, February 1997. [6] S. Lin, and J. Costello Jr., Error Control Coding: Fundamentals and applications, Prentice Hall, New Jersey, 1983[7] Lucent 16000, http://www.lucent.com/micro/ or http://www.lucent.dk/micro/dsp16000/[8] Thomas Parsons, Voice and Speech Processing, McGraw-Hill Book Company, New York, 1987.[9] TMS320C54x User’s Guide, available from the Texas Instruments Literature Response Center.[10] I. Verbauwhede, M. Touriguian, “A Low Power DSP Engine for Wireless Communications,” Journal of VLSI Signal Processing 18, pg. 177-186, 1998, Kluwer Academic Publishers.[11] I. Verbauwhede, M. Touriguian, “Wireless digital signal processors,” Chapter in Digital Signal Processing for Multimedia Systems, Edited by K.K. Parhi, T. Nishitani, Publisher: Marcel Dekker, New York, 1999. [12] M. Okamoto, K. Stone, T. Sawai, H. Kabuo, S. Marui, M. Yamasaki, Y. Uto, Y. Sugisawa, Y. Sasagawa, T. Ishikawa, H. Suzuki, N. Minamida, R. Yamanaka, K. Ueda, “A High Performance DSP Architecture for Next Generation Mobile Phone Systems,” 1998 IEEE DSP Workshop.[13] Lode specifications, available from www.atmel.com[14] M.W. Oliphant, “The Mobile Phone meets the Internet”, IEEE Spectrum pp. 20-28, Aug. 1999.[15] L. C. Godara, “Application of Antenna Arrays to Mobile Communications: Part 1”, Proc. IEEE, Vol 85, No. 7. pp1031-1060, July 97


References (cont)[16] G. D. Forney, Jr., “Maximum Likelihood Sequence Estimation of Digital Sequences in the Presence of IntersymbolInterference”, IEEE Trans. Inform. Theory, V IT-18, pp. 363-378, May 1972.[17] C. Berrou, A. Glavieux, P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes (1)”, Proc. ICC’93, May 1993.[18] J. Hagenauer, P. Hoeher, “A Viterbi Algorithm with Soft-Decision Outputs and its Applications”, Proc. Globecom 89, Nov. 1989, pp.47.1.1-47.1.7[19] L. Bahl, J. Cocke, F. Jelinek, J. Raviv, “Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate”, IEEE Trans. Inform. Theory, V IT-20, pp. 284-287, Mar. 1974.[20] J. Turley, H. Hakkaraainen, “TI’s new ‘C6x DSP Screams at 1600 MIPS”, Microprocessor Report, Vol 11, No. 2, pp14, Feb 1997[21] “Starcore Launched First Architecture”, Microprocessor Report, V12, No. 14. pp 22, Oct 1998[22] B. Ackland & P. D’Arcy, “A New Generation of DSP Architectures”, Proc. IEEE CICC99, Paper 25.1.1[23] J. Williams, K.J. Singh, C.J. Nicol, B. Ackland, “A 3.2 GOPs Multiprocessor DSP for Communication Applications”,Proc. IEEE ISSCC2000, Paper 4.2

Documents

DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf1 1 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol DSP Architectures for