42
RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication Department of Electrical and Computer Engineering Rice University, Houston TX 77005 March 23, 2003 This work has been supported in part by Nokia, TI, TATP and NSF

SWAPs: Re-thinking mobile and base-station architectures

Embed Size (px)

DESCRIPTION

SWAPs: Re-thinking mobile and base-station architectures. Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication Department of Electrical and Computer Engineering Rice University, Houston TX 77005 March 23, 2003. - PowerPoint PPT Presentation

Citation preview

Page 1: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

SWAPs: Re-thinking mobile and base-station

architectures

Sridhar Rajagopal

VLSI Signal Processing GroupCenter for Multimedia Communication

Department of Electrical and Computer EngineeringRice University, Houston TX 77005

March 23, 2003

This work has been supported in part by Nokia, TI, TATP and NSF

Page 2: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Future wireless devices :

High data rate mobile devices with multimedia

Seamless connection across environments and

standards

Use the fastest and cheapest available service

Bluetooth/Home Networks

Wireless Cellular

Wireless LAN

Page 3: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Aim of the talk

How do I build such a device?ChallengesConstraintsSolutions

Page 4: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Trend comparisons

Past Current Future Year 1990’s 2002-2005 2006+

Function Voice Data Multimedia

Data rates 10’s of Kbps 100’s of Kbps (10x) 10’s of Mbps (10-100x)

Complexity KOPs MOPs (1000x) GOPs (1000x)

Power < 500 mW < 500 mW < 500mW

Antennas Single Single Multiple

Standard GSM (Europe) CDMA (Qualcomm)

TDMA (Nokia) (different devices)

GSM/TDMA/CDMA on same device

GSM/TDMA/CDMA/EDGE/ Wireless LAN/Bluetooth

on same device

Page 5: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Change in flexibility requirements

Physical Layer

MAC Layer

Network Layer

Application LayerNo change

(already flexible)

Maximum change(needs to support multiple

environments, algorithms and standards)

Page 6: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Summary of Challenges for

Sophisticated algorithms (GOPs of computation)10’s of Mbps, < 500 mW

Flexibility required at physical layerMultiple algorithms, multiple standards, multiple

environments

What we would also like:Time to marketRapid evaluation and implementationScalable architecture design methodologies

Page 7: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Physical layer of a receiver

Antenna

Channel estimation

Detection DecodingHigher(MAC/

Network/OS)Layers

RF Front-end

Baseband processing

Receiver more complex than transmitter

Page 8: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Physical layer architecture

Evolving Cellular Handset Architectures but a Continuing, Insatiable Desire for DSP MIPs M. L. McMahan, TI Report SPRA650, March 2000

ro

Analog RF

Digital

Baseband

DSP

ASICs

controller

Analog Baseband

Audio A/D

D/A

Page 9: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Architecture trade-offs

Past : more DSP + less ASIC Current “proposed” solutions : less DSP + more ASICs

Reason: DSPs not powerful enough

Can’t we build better DSPs?

ASIC solutions

Intermediate solutions

Programmable solutions

Area-Time-PowerPerformance

Flexibility

Page 10: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Can this methodology scale for

Baseband increasingly important for real-time and power

Need much more flexibility Environment-specific sophisticated algorithmsCannot keep adding co-processors lose flexibility of a programmable solution

1 Mbps with 100 MHz processor100 cycles per bit to do all your work (GOPs/bit)

Power consumption with bigger color displays, video and more complex algorithmsMay have only ~100 mW for baseband

Page 11: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Motivation

Now that we know the challenges and constraints,

Design me

Page 12: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

design

How do we choose the right algorithms? the right amount of flexibility?

Do we build DSPs, ASICs, heterogeneous, reconfigurable?

If ASICs, how to build better ASICs?If programmable, how to build better DSPs?If both, how do we mix them better?

Answers dependent on level of flexibility needed area-time-power architecture tradeoffs

Page 13: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

My contributions

“Low-complexity” algorithms for wireless:Parallel, fixed point algorithms for multiuser estimation and

detection

ASIC design for wireless using computer arithmetic techniques:Dynamic truncation using on-line arithmetic

Programmable architecture design for wireless:Scalable Wireless Application-specific Processors (SWAPs)

Page 14: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Programmable architectures

Current DSPsNot enough functional units (FUs) for GOPs of

computation

Cannot extend to more FUsLimited Instruction Level Parallelism (ILP)Limited Subword Parallelism (SP)Cannot support more registers (register area increases quadratically with FUs)Compilers: difficult to find ILP as FUs increase

Page 15: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Solution

Exploit data parallelism (DP)Lots available in wireless algorithms

Example:Int i,a,b,c; // 32 bitsshort int d,e,f; // 16 bits packed

for (i = 1: 1024)

{

a[i] = b[i] + c[i];

d[i] = e[i] + f[i];

} ILP

DP

SP

Page 16: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

DSP vs. SWAPs

+++***

InternalMemory

ILP

Stream Register File

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

ILP

DP

DSP(1 cluster)

SWAPs(max. clusters)

Page 17: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Builds on the Imagine media processor

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler

Page 18: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

SWAPs trade-offs

Same internal memory size as DSPs Dependent on application, not architecture

Needs more area to support more functional unitsArea is less of a constraint than power

Varying levels of DP in applicationsNeeds reconfiguration!!Need to turn off unused clusters

More parallelism lower clock frequency lower voltage

low power (CV2f + leakage) in spite of larger area

Page 19: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Design methodology

Chain of receiver algorithms

Low “complexity”, parallel, fixed point

High level language implementation

Programmable implementation

Modular programmablearchitecture design

ASICimplementation

FPGA, customized,

reconfigurable, heterogeneous

implementations

Example: Pentium, DSP, SWAPs

Area-Time-Power

specs: no

1

1

2

3

4

5

6

7

8

7

specs : no

learn

learn

Example: H-SWAPs

Page 20: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Choosing the right algorithms : theory

Algorithm research: Spectral efficiency Low power (RF)

Metrics: Bit error rate Frame error rate

10 -8

10 -6

10 -4

10 -2

10 0

Signal to Noise Ratio

Bit

Err

or R

ate

Past

Current

Future

Theory

Page 21: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Choosing right algorithms : practice

Refine candidates from theory (using linear algebra / opt.) lower “complexity”, parallel, fixed-point

Optimization:

Area: ATime: BPower: A

Energy: A/B

Multi-parameter optimization

?

“Complexity” : #operations of equivalent type

Complexity Complexity/Parallelism

Exe

cuti

on

Tim

e

0

10

20

30

40

50

60

70

80OriginalCandidate ACandidate B

Page 22: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Example : Parallel Viterbi Decoding

Add-Compare-Select (ACS) : trellis interconnectRe-order for exploiting DPParallelism depends on constraint length (#states)

Conventional Traceback – sequentialUse Register Exchange (RE) for parallel solution

Exploiting DP in a programmable architecture implies:Re-order ACS Re-order RE

Page 23: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

SWAP design

Decide how many clustersExploit DP Look at the for loop () count

Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space

See how it meets time-area-power constraints

Page 24: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

What goes inside a cluster?

1

2

3

4

5

1

2

3

4

5

40

60

80

100

120

140

160

(43,58)

(54,59)

(39,41)

(62,62)

(47,43)

#Multipliers

(40,32)

(70,59)

(65,45)

(49,33)

(39,27)

(80,34)

(73,41)

(61,33)

Auto-exploration of adders and multipliers for kernel "acskc"

(48,26)

(39,22)

(50,22)

(85,24)

(76,33)

(60,26)

#Adders

(61,22)

(85,17)

(72,22)

(72,19)

(85,13)

(85,11)

Ins

tru

cti

on

co

un

t w

ith

FU

uti

liza

tio

n(+

,*)

Page 25: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Re-ordering for parallel Viterbi

X(0)

X(2)

X(4)

X(6)

X(8)

X(10)

X(12)

X(14)

X(1)

X(3)

X(5)

X(7)

X(9)

X(11)

X(13)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

b. Shuffled Trellisa. Trellis

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

Page 26: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Viterbi reconfiguration

Packet 1Constraint length 7

(16 clusters)

Packet 2Constraint length 9

(64 clusters)

Packet 3Constraint length 5

(4 clusters)

DP Can be turned OFF

Page 27: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

How to reconfigure?

Move data to appropriate clusters and turn off unused clusters and SRFSignificant loss in performanceMaximum power savings

Use Conditional StreamsCannot turn off SRF, comm ,scratchpad in clustersMinimal loss in performance

Use mux-demux buffersCan turn off clusters entirely – more power savingsMinimal loss in performance

Page 28: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

64-bit Packet 1Rate ½ Constraint Length 7

64-bit Packet 2Rate ½ Constraint Length 9

64-bit Packet 3Rate ½ Constraint Length 5

Kernels(Computation)

Memoryaccesses

Page 29: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz

100

101

102

100

101

102

103

Number of clusters

Frequency

needed t

o a

ttain

real-

tim

e (

in M

Hz) Actual K = 9

Actual K = 7

Actual K = 5

Regular codeReconfigurable code

Page 30: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Viterbi decoding: Execution time

10

10

10

103

Ideal DSP C64x (w/o co-proc)

*VITURBO: A reconfigurable architecture for Viterbi and Turbo decoding, M. Vaya, J. R. Cavallaro, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2003, Hong Kong 

128 KHz

(1 bit /cycle)

DSP (RE)

SWAP

ASIC/FPGA – Real-time performance

DP

Task PipeliningDedicated interconnect 10

010

110

2

0

1

2

Actual K = 9Actual K = 7Actual K = 5

Virtex II FPGA*

Page 31: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Salient features of this solution

Any constraint length 10 MHz at 128 Kbps (handset)

Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant

Exploiting parallelism for real-time: Instruction Level Parallelism (DSP)Subword Parallelism (DSP)Data Parallelism (Imagine)Dynamic Cluster Scaling (SWAP)

Power savings due to dynamic cluster scaling

Page 32: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Expected SWAP power numbers

Viterbi decoding

64 clusters and 1 multiplier per cluster: Process: 0.13 micron Voltage: 1.5 V (to min. leakage when not active) R-T Frequency: f~10 MHz Peak Active Power: ~16 mW/MHz (11 mW/MHz if 1.2V) Area: ~53.7 mm2

10 MHz, 128 Kbps ~160 (110) mW for K = 9 ~53.33 (36.7) mW for K = 7 ~26.67 (12.5) mW for K = 5

ASICs : ~10-100 W

*Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164

Page 33: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Problems

Suitable for handsets? - Not yet!

Still too general Not low power enough!!!

No special customization for the applicationExcept for a fixed-point architectureGeneric instruction setGeneric ALUs (though, can be powered down)Generic inter-cluster communication network

Suitable for base-stations?Why not – power is not a primary constraint?

Page 34: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Multiuser Estimation-Detection+Decoding

100

101

102

101

102

103

104

105

Number of clusters

Fre

qu

en

cy n

ee

de

d t

o a

tta

in r

ea

l-tim

e (

in M

Hz)

FASTMEDIUMSLOW

32-user 3G base-station

Hand-set

Real-time target : 128 Kbps per user

Page 35: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Expected power numbers

32 user base-station with 3 multipliers per cluster and 64 clusters: Process: 0.13 micron Voltage: 1.2 V (always active, leakage less important) R-T Frequency: f~1 GHz Peak Active Power: ~19.88 mW/MHz (increased *) Area: ~93.4 mm2

Total Base-station power consumption:~19.88 W at 1 GHz for 32 users at 128 Kbps/user

Page 36: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

H-SWAPs

Trade Data Parallelism for Task Pipelining Customize each SWAPlet

Internal Memory

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

DP

SWAPs(max. clusters

and reconfigure)

+++*

+++*

+++*

+++*

LimitedDP

SWAPlet(limit

clusters)

+++*

+++*

+++*

+++*

LimitedDP

++*

++*

++*

++*

LimitedDP

++++

++++

LimitedDP

H-SWAPs(collection of customized

SWAPlets)

Page 37: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Viterbi decoding

Survivor management – serial Finding parallel solution for SWAPs – expensive

> 50% of execution time : overheadSerial solution now possible with H-SWAPs

ACS+

ACS+

ACS+

ACS+

LimitedDP

TBU

H-SWAPs for Viterbi decoding

ACS unit

Traceback unit

Page 38: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Potential advantages

DSP (RE)

SWAP

ASIC/FPGA – Real-time performance

DP

Task PipeliningDedicated interconnect

DSP (RE)

H-SWAP

Partial DP + Task Pipelining

Application-specific units

ASIC/FPGA – Real-time performance

Dedicated interconnect

Performance

H-SWAPsSWAPs

Page 39: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Current research

How to trade-off task vs. data parallelism?

Evaluation of specialized inter-cluster communication

Integrating specialized arithmetic units (ACS, on-line)

Area-Time-Power efficiency of Handset SWAPs

Learning to migrate from H-SWAPs to SWAPs

Scale to future systems!!

Page 40: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Future research: efficient algorithms

M u lt ipa thC h a n n e l

Equ a lize rM R C D e co de r

D e te cto rD e m o du la to r

N on -C oh e r e n tor P ar tial ly

C oh e r e n t S TC

B e a m fo rm in g

C o h e re n tS TC

C h a n n e lEs t im a to r

C S I RFin ite

Fe e dba ck

C h a n n e l

Tu rbo Equ a lize r

Page 41: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Future research: architectures

Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs

Some other potential applications Image processing:

Cameras : variety of compression algorithms

Biomedical applications: Hearing aids: DSP running on body heat*

Sensor networks

*Quote: Gene Frantz, TI Fellow

Page 42: SWAPs:  Re-thinking mobile and base-station architectures

RICE UNIVERSITY

Conclusions

Exciting times for wireless algorithm and architecture research More complex algorithmsHigher data rates – meet real-time requirementsLower powerLow area

Seek to design flexible architectures learn from ASIC solutions

Inter-disciplinary research needed: Computer architecture, VLSI, wireless

communications, computer arithmetic, compilers