Upload
lecong
View
216
Download
0
Embed Size (px)
Citation preview
1
1ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
DSP Architectures for Next-Generation Wireless Communications
Chris NicolBell Laboratories Australia
Lucent Technologies
Ingrid VerbauwhedeDepartment of Electrical EngineeringUniversity of California Los Angeles
2ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Mobile Wireless TrendsSubscribers in (000)
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
G lobal W irelineG obal W ire less
W ireless C AG R 21%G lo bal Penetratio n (2010) - 21%(Cellu lar+PCS+W L AS+O ther)
W ire line CAG R - 5%G lo bal Penetration (2010) - 20%
G lobal Pop - 7 billCAG R 1995-2010 - 1 .4%
Subs
crib
ers
(000
)
World-wide deployment of mobile communications is exceeding expectations
2
3ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
DSP Evolution and Markets
Power
(mw/MIP)
1980 1985 1990 1995 2000
DSP-1 ($150)
DSP16A ($15) DSP1600 (<$10)
1K
100
10
10KM68000 ($200)
80286 ($200)
80386 ($300)Pentium ($300)
1
DSP-32C ($250)
DSP16210
Pentium (MMX) ($700)
Cellular InfrastructureMobile HandsetsCordlessGPS
Wireless
$1.01BModem
$727 MV.34V.90xDSL Consumer &
Automotive
Disk
$270 MOther
Source: Forward Concepts 1996
$2B market, 30% growth rate
DSP Market
Power
(mw/MIP)
4ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
The DSP Market Splits - and so does this tutorial
Today’s general purposeassembly coded
DSP
Low cost,low power
DSPs
HighPerformance
DSPs
• 1-10 GOPS• 1-5 watts• < $50
• 200-1000 MOPS• < 100 mW• $10
• 100 MOPS• 250 mW• $40
Chris NicolIngridVerbauwhede
InfrastructureMobile Terminals
3
5ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Overview• Introduction• Low Power DSP Architectures for Handsets
• Domain Specific Processors• DSP Processor Fundamentals• Datapath Design, Instruction Set Design• Pipeline Control, Memory Architecture, Low Power Design• for FIR - Viterbi - speech codec
• High performance DSP Processors for BTS• 2G and 3G Wireless Standards• Mobile Wireless Basestation Systems
• Receiver Algorithms, Smart Antennas• Wideband TRX Architectures• Convolutional and Turbo coding
• High Performance DSP Architectures for 3G Wireless• LU DSP16210, TI ‘C6x, Starcore SC140• Future Trends - MIMD DSP
6ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Domain Specific Processors
ASIC Application Specific
Domain Specific
General DSP
General Purpose
low
none very high
Performance / Power:
Programmability:
high
Low power programmable DSP’s for wireless communications
high
none parameters
4
7ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Domain Specific Processors
Domain specific processors: to combine
High performance
Low Power
High degree of programmability
Application domains that need it:Wireless communications (baseband processing)
Application domain is narrower, hence need high volume to compensate development cost.
Video processors
Embedded micro controllersEtc.
8ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Application domain: wireless communications
Receiver
Tran
smit
Synt
hesi
ze
PA
TCXO
Receiver
Tran
smit
Synt
hesi
ze
PA
TCXO
Exte
rnal
Mem
orie
s
DigitalASIC
MicroProcessor
DSP
BatteryPack
AnalogASIC
PowerSupply
AudioCodec
No network
* 0 #7 8 94 5 61 2 3
clr
RF Board
Baseband board
5
9ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Performance requirements: digital cellular phone
RFReceive
RFSend
Demodulation Channeldecoder
Speechdecoder
Modulation Channelencoder
Speechencoder
Communication Application
Goal: Minimum “MIPS” to get the job done.
10ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Note: Definition of MIPS, MOPS
What is inside a MIPS = Million Instructions per Second ?
DSPs use Complex Instructions
One instruction = 5 operationsE.g. Lode instruction: 2 Memory operations, 2 address generationsand 1 arithmetic operation
So: benchmarks are expressed in minimum number of operationsto finish a job, usually expressed in “MIPS”
Small Example: Viterbi butterfly operation in 4 cycles/butterflyLarge Example: GSM Half rate speech codec in only 12 “MIPS”
6
11ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Application Domain: compute intensive functions
Source encoder/decoder = speech codersAdvanced vocoders for improved speech quality & higher capacity:Example: ACELP derivatives for GSM and IS136A
• Digital filtering (FIR, IIR)• Vector quantization, code book search
(square distance computation)
Channel encoder/decoder = error correctingComplex wireless modems:
• Galois field arithmetic• Convolution coders based on Viterbi trellis search• Turbo coders
Modulation/demodulation =
• Receivers based on Maximum Likelihood Sequence Estimation(requires again fast Viterbi butterfly operations)
12ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Compute intensive functions: evolution of DSP’s
Simple FIR example
Square distance
Speed-up of FIR example
Viterbi acceleration
Evolution of DSPs follows these examples
7
13ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Evolution of DSP processors
Generation Features Examples
0 (1980) Von Neumann architecture DSP-1 (AT&T)
1 (1982) Basic Harvard architecture TMS320C10 (TI)NEC7720
2 (1986) 1data/program bus,1 data bus
TMS320C25 (TI)DSP16A (AT&T)
3 (1990) Extra Addressing modes,extra functions
TMS320C5x (TI)DSP16xx (AT&T)
4 (1994) 2 data busses1 program bus
TMS320C54x (TI)
5 (1996 – now) 2 data busses,1 program bus,multiple units
Lucent 16xxxAtmel LodeSiemens Carmel
14ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
DSP Processor Fundamentals
Data PathProcessing
Unit
InterconnectProcessing
Unit
MemoryManagement
Unit
InstructionProcessing
Unit
Processor Components [Skillikorn-88]
8
15ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Basic Harvard Architecture
ProgramMemory
DataMemory
MultiplyAccumulate
InstructionProcessing
Unit
Separate data memory from program memory!
16 x 16 mpy
ALU
Different from Von Neumann machine:one address bus - one data bus - one memory space
16ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Example 1: TMS320C10 (1982)
Data RAM Program ROM1.5K x 16144 x 16
16-bit T-register
16 x 16 Multiply
32-bit P-register
16-bit BarrelShifter (L)
32-bit ALU
32-bit Accumulator
ShiftL (0,1,4)
2 Auxiliary RegsFour Level H/W Stack
Status Register
CPU
D (15-0)
A (11-0)
I/O Ports8 x 16
PA (7-0)(A 2-0, D 15-0)
160/200ns Instructioncycle time4K word externaladdress reach
60 general purpose andDSP specific instructionsSingle cycle multiply
16-bit Barrel Shifter
External interrupt andpolled input pins
Eight 16-bit I/O ports
40-pin DIP/44-pin PLCC
Courtesy: Texas Instruments
9
17ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Compute Intensive function 1: FIR
x(n)
X
(50 TAPS)
Z-1 Z-1 Z-1
X X X
+ + +
x(n-1)
y(n)
c(0) c(N-1)
x(n-(N-1))
ΣΣΣΣy(n) = c(i) x(n-i)N-1
i=0
TMS320C10 TMS320C25LTD RPTK 49MPY MACDLTDMPYLTD
MPY
LTDMOVAPAC
LTDMOVAPACMPY
3 Words Prog Memory53 Cycles
100 Words Prog Memory100 Cycles
...
Single Cycle Multiply - Accumulate!
18ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
16 16
16
32
32
32
32
32
Example 2: Single Cycle MAC
TMS320C2x Multiplier/ALU
Left Shifter (0-7)
Left Shifter (0-16)3232
16
Single Cycle 16x16 bitMultiply yielding a32-bit product
Supports simultaneousProgram and two DataOperand aquisition
Supports simultaneousALU and Multiplieroperations
0-16 bit Left Post-Shifter
Data Bus
Program Bus
LeftShifter(0-16)
T Register (16)
Multiplier (16x16)
P Register (32)
MUX
Arithmetic Logic Unit (ALU)
Accumulator Register (32)C
MUX
16
16
16
32
Courtesy: Texas Instruments
10
19ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Compute Intensive function 1: FIR (cont.)
x(n)
X
(50 TAPS)
Z-1 Z-1 Z-1
X X X
+ + +
x(n-1)
y(n)
c(0) c(N-1)
x(n-(N-1))
ΣΣΣΣy(n) = c(i) x(n-i)N-1
i=0
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
One output = 2N reads, N MAC’s, 1 write
Classic Harvard: one output = N cycles
20ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
FIR speed-up
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
Run MAC at double frequency, read two 32-bit numbers
FIR filtering: two outputs in parallel
Two outputs = 4N reads, 2N MAC’s, 2 writesDual Mac Architecture with ONLY 2 data busses??
Read two 32-bit numbers instead of four 16-bit numbers Solution by Lucent 16000 core with dual MAC
Solution by MatsushitaInsert delay register
Solution by Atmel’s LODE
11
21ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Example 3: Lucent DSP16210
Horizontal parallelism, one sample at a time
2G mobile wireless base-stations
16 x 16 mpy 16 x 16 mpy
p0 (32) p1 (32)
Shift/Sat.
ADD BMU
ACC File8 x 40
Y(32) X(32)
ALU
Shift/Sat.
do 14 { //one instruction !
a0=a0+p0+p1
p0=xh*yh p1=xl*yl
y=*r0++ x=*pt0++
}
Inner loop of 32-tap FIR Filter XDB(32)IDB(32)
Outer Loop: 19 cycles, 38 bytes1 cycle in inner loop
5 exec units used in inner loop2 MACs per cycle
Courtesy: Gareth Hughes, Bell Labs Australia
22ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
FIR on Lode
FIR filter: two outputs in parallel with delay register y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
Total energy for one output sample:
Energy SingleMAC
DualMAC
Dual MACwith REG
No. of MAC operations N N N
No of Memory reads 2N 2N N
No of Instruction Cycles N N/2 N/2
12
23ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
FIR on Lode
Two MAC units with dedicated bus network
x(n-i)
X
LREG
+
y(n+1) y(n)
c(i)
X
+
c(i)x(n-i+1)
A0 A1
MAC1 MAC0
DB1(16)DB0(16)
• DB0 fetches coefficient
• DB1 fetches data
• LREG delays input data
• A0 stores y(n) output
• A1 stores y(n+1) output
Same structure can be used for IIR
24ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Compute Intensive function 2: Viterbi
i
i+ s/2
2i
2i+1
+a
-a
-a+a
. . .
. . .
Viterbi butterfly
i = state indexs = # of states = 2w = decoding window
Basic equations:
d(2n) = min { d(i) + a, d(i + s/2) - a }d(2i + 1) = min { d(i) - a, d(i + s/2) + a }
IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)
k-1
7
Basic algorithm in Viterbi channel decoders and MLSE based receivers,modified version in turbo decoders.
Key operation: Add-Compare-Select (ACS)
13
25ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Viterbi on Lode
Two MAC units & ALU: Add-Compare-Select
• DMAC operates as dual add/subtract unit
• ALU finds minimum
• Shortest distance saved
• Path indicator saved
• 4 cycles / butterfly
+
A1
MAC0
DB1(16)DB0(16)
µ2
+
µ1
A0
MAC1
Γ1 Γ2
Min()ALU
A3Γ
A2
decision bit
to memory
Γ = min [(Γ1 + µ1), (Γ2 + µ2)]
26ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
MSW/LSWSelect
Viterbi on TIC54x
ALU and CSSU: Add-Compare-Select
• ALU splits in 16 bit halves
• ACC splits in half
• Shortest distance saved
• CSSU compares halves
• Path indicator saved
• 4 cycles / butterfly
+
TREG
ALU
DB1(16)DB0(16)
µ2
+
µ1
AccumulatorΓ1 Γ2
CompALU
TRN regΓ
decision bit
Data bus EB, to memory
Γ = min [(Γ1 + µ1), (Γ2 + µ2)]
14
27ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Viterbi on LU DSP16210
do 8 {a0=a4+y a1=a5-y *r3++=a0ha2=a4-y a3=a5+y *r5++=a2ha0=cmp1(a1,a0) yh=*r0 r0=r1+j j=k k=*pt1++a2=cmp1(a3,a2) a4_5h=*pt0++
}
GSM (K=5, 16 states)
AR0
AR0
AR0
AR0
. . .
a0=cmp1(a1,a0)
a2=cmp1(a3,a2)
a2=cmp1(a3,a2)
• Hardware support for Viterbi algorithm:– ACS calculations are efficient– Minimal overhead
• 4 cycles per butterfly– 32 cycles per GSM timeslot.
• Comparison functions store ACS decision bits:
. . .
Results writtento memory
Courtesy: Gareth Hughes, Bell Labs Australia
28ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Square distance on Lode
ALU in parallel with MAC: Sum of square distance
• ALU performs subtraction and absolute value
• MAC performs squaring and accumulation
Vector quantization in vocoders:vector size N = 50, codebook > 1000
D = Σ || x(i) - y(i) || N-1
i = 0
2
X
+
D
x(i)
-
y(i)
A0
MAC
ALU
DB1(16)
DB0(16)
15
29ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Lode Core Architecture
30ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Domain specific instruction set
Basic instruction set for general purpose DSPe.g. MAC, min, max, etc.
Extra instructions for performance with every new generatione.g. “square distance and accumulate
D = Σ || x(i) - y(i) || N-1
i = 0
2
One 32 bit instruction:
a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;
Bus network and instruction set design go together
CISC, thus compiler unfriendly
16
31ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Control & Pipeline for DSP’sRISC: load/store machinememory access with load/store instructions (DLX, MIPS, D10V)
MemoryAccessDecodeFetch Execute Write
Back
Memory access / branchExecution/ address generation
Excellent for complex decision making!
Memory accessExecution
DSP: register-memory architecture (TI, Lucent, HX, Lode)
Excellent for number crunching!
ExecuteDecodeFetch MemoryAccess
WriteBack
32ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Pipeline RISC compared to DSPRISC:example
DSP: memory intensive applications:
r0 = *p0; // load dataa0 = a0 + r0; // execute
MemoryAccessDecodeFetch Execute
MemoryAccessDecodeFetch Execute
MemoryAccessDecodeFetch Execute
Too expensive for DSP
ExecuteDecodeFetchMemoryAccess
ExecuteDecodeFetchMemoryAccess
ExecuteDecodeFetchMemoryAccess
ExecuteDecodeFetchMemoryAccess
Penalty: data dependent branch is expensive
17
33ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Other control features
Hardware looping:
• Because software branch is expensive• “Zero overhead hardware loops” (for tight FIR loops)
hardware supported
Interrupts: hardware with shadow registers for extremely fastcontext switching.
Special instruction cache:• Single instruction “repeat” buffer• Multiple instruction cache: under programmers control!• E.g. Lucent DSP16210:31x 32 instruction cache
Predictable worst case execution time!
34ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Low Power DSP’sC54x 1V DSP(Texas Instruments - ISSCC 1997)
DSP 1600 Core(Lucent - 1609 low cost consumer 16-bit)
0.35µ 3LM CMOS80 M 16b MAC/s at 3.3V1.4 mW/MHz at 3.3V30 µW stand-by power
0.25µ 3LM CMOS65 M 16b MAC/s at 1.0V0.21 mW/MHz at 1.0V4.0 mW stand-by power
Dual Vt process
18
35ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
BUT: DSP Software Development
• Complex DSP architecture not amenable to compiler technology
• Algorithms are modeled in high level language (e.g. C++)
• Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support
HLL
algorithmic
model
prototype
code
production
code
hand coded assembler
optimize & debug
Long, frustrating time to market
Fragile legacy code
Still used in handhelds, but change in basestations, Part II
36ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Mobile Wireless Evolution
SERVICE
First Generation
Mobile TelephoneService: Carphone
Analog CellularTechnology
MacrocellularSystems
Past
Second Generation
Digital Voice +and Messaging/Data
Services
Fixed Wireless Loop
Digital CellularTechnology + INemergence
Microcellular &Picocellular:capacity, quality
Enhanced CordlessTechnology
Now
Third Generation
Integrated High QualityAudio and Data.Narrowband andBroadband MultimediaServices + IN integration
Broader BandwidthEfficient Radio Transmission
Information Compression
Higher FrequencySpectrum Utilization
IN + Network Managementintegration
Year 2000-2005
Fourth Generation
TelePresencing
Education, training anddynamic information access
Wireless- Wireline andBroadbandTransparency
Knowledge-BasedNetwork Operations
Unified Service Network
Year 2010?
TECHNOLOGY
WCDMAUWC-136 TDMAcdma2000
NMTTACSAnalog AMPS
GSMIS-54/ 136 TDMAIS-95/ cdmaOnePDCDECT
We are entering the decade of wireless data communications - and World-War 3G
Global roaming
19
37ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Mobile Data Services• Carriers invest >$500 per subscriber but subscriber voice calls (and therefore revenues) are reducing.
• Data currently 3% of wireless traffic - projected to >50% by 2005
• Wireless Internet : Average internet connection 30 mins
• Text Messaging: Saturating 2G voice networks
2.5 Generation Mobile Standards [1]GPRS: Packet Data over GSM - timeslot multiplexing, multi-slots per user.EDGE: 8-PSK modulation + GPRS, 384 Kbps max to 1 user.
3G - IMT2000 Proposals144 Kbps Automobile, 384 Kbps Pedestrian, 2 Mbps stationary.Several Proposals - UWC 136 (200Khz, TDMA, 8-PSK = EDGE).UMTS, CDMA-2000 are both CDMA proposals.
38ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Evolution of Mobile Wireless Network Architecture
…
BaseStations
PacketMode
ServersHigh Speed Data,
Multimedia,Voice over IP,
etc.
WirelessControlServers
(Feature Control,Network Management,
Billing, etc.)
RadioClients
MSC
BSC
…
Internet / Advanced ServicesPSTN
CircuitMode
Servers(Voice, LowSpeed Data,
etc.)
PSTN
NetworkServers
MobileSwitches
Packet Connectivity (ATM / IP)
2G Network IP-based 3G Network
Mobile networks are being upgraded in preparation for the delivery of high speed data services.
20
39ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Mobile Wireless Infrastructure
Macro-cell GSM Basestation(6-12 TRX)
Micro-cell GSM Basestation(2 TRX)
40ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
2G Basestation Baseband Processing
• Multiple DSPs used for baseband processing.• RISC Microcontroller for timing, framing, I/O control• Software upgradable over the network• DSPs dominate cost and power consumption
DSP RISCMicro
Controller
I/O
T1/E1
DSP
DSP
DSP
DSP
DSP
DSP
DSP
I/O
I/O I/O ASIC
DSP
DSP
AFE
AFE
ChannelEqualization
ChannelDe/coding Encryption
RAM
RAM
Tx
TxRx
Rx
Tx/Rx baseband processing board for 2-carrier GSM basestation
Future trend - integratebaseband processing -low cost Pico BTS
21
41ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
3G Basestation Baseband Processing
• Increased Receiver Algorithm Sensitivity• Antenna Arrays - Smart Antennas• Multi-Standard Basestations using Software Radio Architecture• 3G - constraint length 9, rate 1/2 convolutional coding for voice.• 3G - constraint length 4, Turbo codes for data
Increased DSP performance needed in next-generation basestation
High Performance DSPs+ Custom Logic needed for 3G (Viterbi decoding and Turbo decoding)
RAKE combinerreassemble multipath
(DSP, ASIC)
Sliding correlatordespreading
(ASIC)
Deinterleaver(DSP)
DecoderViterbi algorithmTurbo decoding
(DSP, ASIC)
Code trackingdelay-lock-loop(ASIC, DSP)
Channel estimation(DSP)
Code generatorchannelisation code
scambling code(ASIC))
Code generatorchannelisation code
scambling code(ASIC))
Synchronisationcell search
slot syn, frame syn.(DSP)
Path search(ASIC)
SIR measurementfast power control
(DSP)
Power control
Courtesy: Bing Xu: Bell Labs Australia
42ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Receiver Algorithms for GSM Basestation
• Enhanced Receiver Sensitivity• Larger Cells in Suburban Areas = Reduced network cost• Mobile transmits with less power = Increased battery life
EstimatingWirelessChannel
EqualizingMulti-pathEffects
ChannelDecoding
SpeechDecoding
Existing Receiver
New Iterative Receiver
Challenge - requires 6x DSP MIPS of existing receiver in basestation
EstimatingWirelessChannel
EqualizingMulti-pathEffects
ChannelDecoding
SpeechDecoding
SpeechStatistics
1.3dB improvement
Courtesy: Magnus Sandell: Bell Labs UK
22
43ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
OmnidirectionalCell Site
Three SectorCell Site
Intelligent AntennaCell Site
• A multiple antenna element system• Combined with a base station architecture and signal processingtechniques designed to dynamically select or form the “optimum” beam pattern per user
Smart Antennas
Increased cost in RF electronics and enhanced DSP requirements.
44ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Fixed Multi-Beam Versus Adaptive Beam
Mobile
Reflected Ray
Select from--or use--multiple “fixed” antenna beams to optimize
performance.
Fixed Multi-BeamMobile 1
Direct Ray
Reflected Rays
Mobile 1
Mobile 2
Adaptively “weight” and combine multiple antenna elements to optimize
performance.
Adaptive Beam
Mobile 2
Interferer
Direct Ray
Interferer
23
45ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Digital Radio Trends - Software Radio
Digital Processing
RF/AnalogProcessing
A/D
NetworkNetworkInterface
AMP
DSPs - higher speed, more powerful
Filtering ModulationDemodulation EqualizationRake receiver CorrelatorChannel coding EncryptionDiversity . . .
RF/IF
Linear amplificationCombining
Higher dynamic rangeSmallerAmplifiersMixersFilters . . .
Antennas
multi-standardbasestation
46ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Wideband Receiver Architecture
HighSpeed
A/D
BasebandProcessing
......
CH1
CHM
CH1
CH2
CH3
CHM
. . .
freqfBB
CH1
CHM
DigitalChanneliser
RF-IF &Filter
CH1
freq
CHM
freq
CH1
CH2
CH3
CHM
. . .
freqfRF
CH1
CH2
CH3
CHM
. . .
freqfIF
Increased DSP performanceneeded for Software Radio
24
47ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Turbo Codes
• Parallel concatenation of convolutional codes is used to give the codes structure so they can be decoded
• Pseudorandom interleaving is used to give the codes performance which approaches that for random coding
• Resulting encoder structure: Two Recursive Systematic Convolutional(RSC) Codes
Encoder#1
Encoder#2Int
erlea
ver MUX
Input
ParityOutput
Systematic Output
For 3G Wireless (UMTS and CDMA2000)• Voice service: BER requirement 10-3
• Data service: BER requirement 10-5
48ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Turbo Decoding
• Key idea: iterative decoding (up to 10 iterations for 3G)• There is one decoder for each elementary encoder.• Each decoder estimates the a-posteriori probability (APP) of each data
bit.• The APP’s are used as a priori information by the other decoder.
Decoder#1
Decoder#2
DeMUX
Interleaver
Interleaver
Deinterleaver
systematicdataparitydata
APPAPP
hard bitdecisions
25
49ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Soft-Output Decoding Algorithms
Requirements for Turbo:– Accept Soft-Inputs in the form of a priori probabilities (APP) – Produce APP estimates of the data.– “Soft-Input Soft-Output”
Trellis-Based Estimation Algorithms
ViterbiAlgorithm
MAPAlgorithm
max-log-MAP
log-MAP
Sequence Estimation
Symbol-by-symbolEstimation
Improved SOVA
SOVA
SOVA and log-MAP use modified Add-Compare-Select operations - not onlyselect the maximum path metric - but also need to keep the difference.
Today’s High-performance DSPs are highly MAC-focussed (for filtering in modem applications). Some DSPsprovide hardware support for efficient implementation of Viterbi - none support SOVA or log-MAP
Iterative channel estimation also usesSoft-Input Soft-Output decoders.
50ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
The Maximum A Posteriori (MAP) Algorithm
( ) [ ][ ]
( ){ }
( ){ }
�
���
�
�
� ′
� ′
=��
��
�
=+==
=′
=′
0:,
1:,,,
,,
ln0Pr1Prln
k
k
uss
uss
kk
k ssp
ssp
uuuL
y
y
yy
( ) [ ][ ]0Pr
1Prln===
dddLLog-Likelihood Ratio: ( ) ( )
( )( )( ) ( )dL
dypdyp
ydyd
ydL +�
���
�
==
=���
���
�
==
=01
ln0Pr1Pr
ln
• A Priori value of Pr[d=1],Pr[d=0]• Output of decoder contains additional extrinsic information• The sum of the a priori information and the extrinsic information will be the a priori information for the next-stage of decoding, for both 2nd decoder or 1st decoder in the next iteration
1) uk is the kth bit of the desired data sequence, 2) y be the observed sequence, 3) the state transitions from state s’ at time k-1 to state s at time k, 4) We want to evaluate this LLR for every k
( ) ( ) ( ) ( )spsspspssp kjkkj >< ⋅′⋅′=′ yyyy ,,,, ( ) ( )kjk sps <− ′= y,1α( )sp kjk >= yβ
( ) ( )sspss kk ′=′ y,,γBreak the probability computation into: Gamma:Alpha:Beta:
26
51ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Gamma, Alpha and Beta CalculationsGamma: Calculated from known bits up to k, needs to be stored
where is calculated from the a priori information and is calculatedfrom the received bits
( ) ( ) ( ) ( ) ( ) ( )kkkkkk upuPsspssPsspss yyy ⋅=′⋅=′=′ ,',,γ
( )kuP ( )kk up y
Alpha: Calculated by a forward recursion through the trellis based on Gamma
Beta: Calculated by a backward recursion from the end of the trellis
( ) ( ) ( )′⋅′=′
−s
kkk ssss 1, αγα
( ) ( ) ( )⋅′=′−s
kkk ssss βγβ ,1
Alpha BetaGamma
Window algorithm
DummyBeta’s
52ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Log MAP and MAX-log MAP
( )21ln δδ ee +
Compute logarithms of alpha, beta and gamma, which means we compute:
Log-MAP: ( ) ( ) ( )2121 ,maxln 21 δδδδδδ −+∝+ cfee
MAX-Log-MAP: ( ) ( )21 ,maxln 21 δδδδ ∝+ ee Correction function (impl. table)
2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.910-6
10-5
10-4
10-3
10-2
10-1
BER
MaxlogAPPLogAPP MAX-log MAP suffers approx 0.5dB
from log MAP.
For log-MAP, small correction tableneeded (approx 6 non-zero values).Absolute difference used as tablelook-up. We need the difference!
Courtesy: Bing Xu: Bell Labs Australia
27
53ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
High Performance DSP Requirements• Very high levels of DSP integer performance
• Scalability to meet wide range of cost, power, performance.
• Large memory and I/O bandwidth.
• Friendly, compiler driven, programming environment.
• Support for complex real-time synchronous applications (latency, predictable throughput, synchronization)
• Cost & power efficient solution.
100K
10K
1000
100
101997 1999 2001
V.34
GSMterm
ADSL500k
ADSL6M
24 ch.modem
DABrcvr16 HR
GSM
1G eth. xcvr
set-topbox
MPEGIIencode
Soft radio
3-D graphics?
MOPS
K56PCSterm
traditional DSP
3G Wireless
Some DSP Applications
54ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Compiler Driven VLIW
Large orthogonal register set, regular interconnect
Data memory
RegisterArray
Interconnect
ex1(alu)
ex2(alu)
ex3(mpy)
ex4(ld/st)
exn(ld/st)
cond/branch ex1 ex2 ex3 ….. exnInstruction format:
Atomic RISC-like operations => heavily pipelined, high freq. clock
28
55ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Explicitly Parallel Instruction Computing
Execution ClustersData memory
RegisterArray
Interconnect
ex1(alu)
ex4(alu)
ex5(mpy)
ex3(ld/st)
ex6(ld/st)
RegisterArray
Interconnect
ex2(alu)
Execution Sets
1 1 1 0 1 0 1 0
fetch set
exec. set
56ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Explicitly Parallel Instruction ComputingPredication (guarded) exec.
Instruction modifiers
any instructioncond
- eliminates branches - improves compiler efficiency- eliminates branches - removes pipeline bubbles- fill delayed branch slots with predicated instructions
instr1modifier instr2 instr3 instr4
- allows shorter instruction length- extend register addressing- predication- execution set identifier- looping- extended operations
29
57ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Texas Instruments ‘C6201
ALU shift mpy add ALU shift mpy add
Register Bank A(16 x 32)
Register Bank B(16 x 32)
Instruction Dispatch & Decode
Program Memory(16K x 32)
256
Data Memory(32K x 16)
8-way VLIW with two execution clusters256 bit (8x32) instruction fetch with variable length execute setEach 32 bit instruction individually predicated11 stage pipeline1600 MIPS, 400 MMACs @ 200 MHz
58ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
FIR Filter on TI ‘C6x
loop:
ldw .d1t1 *a4++,a5
|| ldw .d2t2 *b4++,b5
||[b0] sub .s2 b0,1,b0
||[b0] b .s1 loop
|| mpy .m1x a5,b5,a6
|| mpyh .m2x a5,b5,b6
|| add .l1 a7,a6,a7
|| add .l2 b7,b6,b7
• Outer Loop: 23 cycles, 180 bytes– 1 cycle in inner loop
• All 8 exec units used in inner loop - maximum efficiency– 2 MACs per cycle
Hand-coded assembly: 32-tap FIR filter
Assembly syntax more difficult to learn.Hard to get full use of all 8 execution units at once.Software pipelining difficult to implement, and requires longer prolog/epilog (larger
code size).
Courtesy: Gareth Hughes: Bell Labs Australia
30
59ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Viterbi on TI ‘C6x
LOOP:[b1] b .s1 LOOP
||[b1] sub .s2 b1,1,b1||[!a2] sth .d1 b12,*+a6[8]||[!a2] add .d2 b0,b14,b14|| cmpgt .l1 a11,a10,a1|| cmpgt .l2 b11,b10,b0|| mpy .m1x 1,b5,a4
[a2] sub .s1 a2,1,a2||[!a2] sth .d1 a12,*a6++||[a1] add .s2 2,b0,b0||[b0] mpy .m2 1,b11,b12|| mpy .m1 1,a10,a12|| sub .l2x a7,b5,b10|| ldh .d2 *++b9,b5
shl .s2 b14,2,b14||[a1] mpy .m1 1,a11,a12|| add .s1 a7,a4,a10|| sub .l1x b13,a4,a11|| add .l2 b13,b5,b11|| mpy .m2 1,b10,b12|| ldh .d2 *b4++[2],a7|| ldh .d1 *a5++[2],b13; end of LOOP
Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]
.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]
.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0
.M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8
.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0
.L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I
.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k
.S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j
Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1
.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0
.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0
.M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj
.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP
.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8
.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k
.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP
Utilization of execution units in Viterbi decoder
• 16-state Viterbi decoder for GSM from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm
– 3 cycles per butterfly– 32 cycles per GSM timeslot (8 butterflies)– MPY instructions used to move data
3-cycle 2-ACS Inner-Loop
x 8
60ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Lucent / Motorola Star*Core SC140
6-way VLIW with 128 bit (8x16) instruction fetchPrefix instructions for high performance without sacrificing code densityEach execution set (parallel instructions + prefix) predicated5 stage pipeline1800 MIPS, 1200 MMACs @ 300 MHz
Program / Data Memory
ProgramSequencerInstructionDispatcher
AddressRegisters
(27)
AAU
Data Registers(16)
MACALU
BFUAAU
MACALU
BFU
MACALU
BFU
MACALU
BFU
31
61ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Viterbi on Star*Core
• Hardware support for Viterbi algorithm:– max2vit instruction.– vsl instruction
• 1 cycle per butterfly through software-pipelining
• Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction:
GSM (K=5, 16 states)[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 ][ add2 d0,d4 sub2 d6,d2sub2 d4,d0 add2 d2,d6 ]
[ max2vit d4,d2 max2vit d0,d6 ][ vsl.4w d2:d6:d1:d3,(r2)+n0vsl.4f d2:d6:d1:d3,(r3)+n0 ]
max2vit d4,d2 max2vit d0,d6
SR
D1
D3
D2
D6
vsl.4w d2:d6:d1:d3,(r2)+n0
Results writtento memory
x 4
decisions
decisions
path metricspath metrics
Courtesy: Gareth Hughes: Bell Labs Australia
62ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Log-MAP on Star*Core
d0: a+x d1: b+x
d1: bd0: a d6: x
d5: a-xd4: b-x
d3: d1-d5d2: d0-d4
max max
n0: |d2|
n0: |d3|
r6
r6d4: d4+d2 d5: d5+d3
d5: max(d1,d5)d4: max(d0,d4)
Cycle 2
Cycle 3
Cycle 4
Cycle 6
Cycle 5
Cycle 7
Cycle 9
move.w (r0)+,d0 move.w (r1)+,d1
add d0,d6,d0 sub d6,d0,d5
sub d6,d1,d4 add d1,d6,d1
max d0,d4 max d1,d5
abs d2 abs d3
sub d0,d4,d2 sub d1,d5,d3
move.l d2,n0
move.l d3,n0 move.w (r6+n0),d2
add d4,d2,d4 move.w (r6+n0),d3
add d5,d3,d5
move.2w d4:d5,(r2)+
Cycle 1
Cycle 8
Cycle 10Cycle 11
This code uses 2 of the 4 ALUs and can be software pipelined to achieve 6 cycles per LOG-MAP Butterfly
Star*Core code for log-MAP Butterfly
Courtesy: Gareth Hughes: Bell Labs Australia
d2:
d3:
32
63ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Parallel DSP Architectures
Arch. Parallelism Compile? Power ?
S/scalar Dynamic instruction level��������
VLIW Static instruction level����
SIMD Highly regular, data dependent��������
MIMD Task level����
MIMD with VLIW / SIMD provides high order parallel execution
The future of high performance DSPs is MIMD
64ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Daytona: A Multiprocessor DSP Architecture
ProgrammableProcessing
Element(PE)
HardwareAccelerator
Chip
split transaction bus (128 bits)
ProgrammableProcessing
Element(PE)
I/O Subsystem
I/O Interfaces
BufferedI/O
External Memory
ArbitrationSynchronization
I/O Interfaces
Scalable Architecture - multiple programmable DSPs on a single chip1 Bus supports different programmable DSPs and Microcontrollers
33
65ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Split Transaction Bus
Arbiter(round-robin)
ID
data
ID
data
ID
addrAddressBus (100MHz)
DataBus (128 bits 100MHz)
Multiple outstanding transactions - varying size/priority
Separate Bus Arbitration
ID
data
IDIDMemory
ControllerPE
addraddr
Separate Address and Data busses - each with pipelined protocol
Arbiter(round-robin)
66ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Memory Hierarchy in MIMD DSPs
Multiple copies of 1 application (e.g. odd/even slot channel equalisation)
Mix of different applications (e.g. equalisation, convolutional decoding)
• Heterogenous mix of applications
• Multiple copies of same software - Shared memory multiprocessing
SRAM
DSP
SRAM
DSP DSPCache
DSPCache
DRAM
2 copies of software 1 copy of software
Flat Memory Architecture vs. Hierarchical Memory Architecture
Inefficient
34
67ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Shared Memory Multiprocessing
64 Semaphores provided for process synchronization
DSP
hit
DSP DSPDSPAccessto shareddata
Snoop(miss)
Snoop(hit)
Snoop(miss)
Coherent TransactionMemoryController
Access to shared datauses coherent transaction.Caches “snoop” the addressand query their tag RAMs.A cache hit prevents the memory controller fromservicing the request.
L-1 cache coherency using a snoopy protocol (modified MESI used)
68ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Daytona Multiprocessor DSP Chip
128-b Split Transaction Bus
HostInterface
I/O &Memory
Controller
Test &JTAG Port
Arbiter
Semaphore
120mmCore Area
100 MHzSpeed
4WPower
Tech
Chip Characteristics2
0.25um
Bell Laboratories Research Chip for 3G Wireless Base-stations / Head-end xDSL
64-b 4-MACSIMD DSP
32-b RISC
Cache Memory
64-b 4-MACSIMD DSP
32-b RISC
Cache Memory
64-b 4-MACSIMD DSP
32-b RISC
Cache Memory
64-b 4-MACSIMD DSP
32-b RISC
Cache Memory
Paper 4.2, ISSCC2000
35
69ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Photomicrograph of Daytona Test Chip
8KB Re-configurable Memory
DLLSPARC
Vector Unit (RVU)
BUS IN
T
HDS
LRU
I/O Subsystem
ArbiterSemph
Proces
sing Elem
ent (P
E)
Split
Tra
nsac
tion
Bus
Proces
sing Elem
ent (P
E)
Proces
sing Elem
ent (P
E)
Paper 4.2, ISSCC2000
70ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
AcknowledgementsThe following people contributed to the work in this tutorial:
Low Power DSPs for WirelessWanda Gass: Texas InstrumentsMihran Touriguian: Atmel
High Performance DSPs for Wireless InfrastructureBryan Ackland: Bell Labs US - High Perf. DSP ArchitectureGareth Hughes: Bell Labs Australia - LU DSP16210, ‘C6x and Starcore benchmarksBing Xu: Bell Labs Australia - SOVA, MAP, LOG-MAPRan-Hong Yan: Bell Labs UK - 3G WirelessDaytona Team: (J Williams, K.J. Singh, J. Othmer, B. Ackland), Bell Labs US.
36
71ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
References
[1] P. Lapsley, J. Bier, A. Shoham, E. Lee, “DSP Processor Fundamentals,” IEEE Press, New York, 1997.[2] D. Skillikorn, “A Taxonomy for Computer Architectures,” Computer Magazine, Nov. 1988.[3] H. Kabuo, M. Okamoto, I. Tanaka, H. Yasoshima, S. Marui, M. Yamasaki, T. Sugimura, K. Ueda, T. Ishikawa, H. Suzuki, R. Asahi, “An 80 MOPS-Peak High-Speed and Low-Power-Consumption 16-b Digital Signal Processor,” IEEE Journal of Solid-State Circuits, Vol. 31, No. 4, April 1996, pg. 494-503.[4] E. A. Lee, D. G. Messerschmitt, Digital communication, Boston: Kluwer Academic Publishers, 1988.[5] W. Lee et al., “A 1V DSP for Wireless Communications,” Proceedings IEEE International Solid-State Circuits Conference, pp. 92-93, February 1997. [6] S. Lin, and J. Costello Jr., Error Control Coding: Fundamentals and applications, Prentice Hall, New Jersey, 1983[7] Lucent 16000, http://www.lucent.com/micro/ or http://www.lucent.dk/micro/dsp16000/[8] Thomas Parsons, Voice and Speech Processing, McGraw-Hill Book Company, New York, 1987.[9] TMS320C54x User’s Guide, available from the Texas Instruments Literature Response Center.[10] I. Verbauwhede, M. Touriguian, “A Low Power DSP Engine for Wireless Communications,” Journal of VLSI Signal Processing 18, pg. 177-186, 1998, Kluwer Academic Publishers.[11] I. Verbauwhede, M. Touriguian, “Wireless digital signal processors,” Chapter in Digital Signal Processing for Multimedia Systems, Edited by K.K. Parhi, T. Nishitani, Publisher: Marcel Dekker, New York, 1999. [12] M. Okamoto, K. Stone, T. Sawai, H. Kabuo, S. Marui, M. Yamasaki, Y. Uto, Y. Sugisawa, Y. Sasagawa, T. Ishikawa, H. Suzuki, N. Minamida, R. Yamanaka, K. Ueda, “A High Performance DSP Architecture for Next Generation Mobile Phone Systems,” 1998 IEEE DSP Workshop.[13] Lode specifications, available from www.atmel.com[14] M.W. Oliphant, “The Mobile Phone meets the Internet”, IEEE Spectrum pp. 20-28, Aug. 1999.[15] L. C. Godara, “Application of Antenna Arrays to Mobile Communications: Part 1”, Proc. IEEE, Vol 85, No. 7. pp1031-1060, July 97
72ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
References (cont)[16] G. D. Forney, Jr., “Maximum Likelihood Sequence Estimation of Digital Sequences in the Presence of IntersymbolInterference”, IEEE Trans. Inform. Theory, V IT-18, pp. 363-378, May 1972.[17] C. Berrou, A. Glavieux, P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes (1)”, Proc. ICC’93, May 1993.[18] J. Hagenauer, P. Hoeher, “A Viterbi Algorithm with Soft-Decision Outputs and its Applications”, Proc. Globecom 89, Nov. 1989, pp.47.1.1-47.1.7[19] L. Bahl, J. Cocke, F. Jelinek, J. Raviv, “Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate”, IEEE Trans. Inform. Theory, V IT-20, pp. 284-287, Mar. 1974.[20] J. Turley, H. Hakkaraainen, “TI’s new ‘C6x DSP Screams at 1600 MIPS”, Microprocessor Report, Vol 11, No. 2, pp14, Feb 1997[21] “Starcore Launched First Architecture”, Microprocessor Report, V12, No. 14. pp 22, Oct 1998[22] B. Ackland & P. D’Arcy, “A New Generation of DSP Architectures”, Proc. IEEE CICC99, Paper 25.1.1[23] J. Williams, K.J. Singh, C.J. Nicol, B. Ackland, “A 3.2 GOPs Multiprocessor DSP for Communication Applications”,Proc. IEEE ISSCC2000, Paper 4.2