Massively Parallel Signal Processing for Wireless

Massively Parallel Signal Processing for

Wireless Communication Systems

Michael Wu, Guohui Wang, Joseph R. Cavallaro

Department of ECE, Rice University


3/20/2013

Internet

transmitted

signal

Information

bits

Information

bits

Received

signal

2


3/20/2013

Used in many standards:

Heavy computations on RX side:

MIMO Detector: decouple streams to provide estimates of the tx bits.

Channel Decoder: correct errors using redundant data.

Chan

nel

Enco

der

MIM

O

Modula

tor

RF T

X

Sourc

e R

F R

X

Sink

Channel

Chan

nel

Deco

der

MIM

O

Dete

ctor

3

Wireless Communication Systems (cont.)

3/20/2013

Related Research Work

MIMO detector: [SAMOS 2010 IEEE-TVT 2012]

Turbo decoder: [CASES 2010, VLSID 2012]

LDPC decoder: [ASILOMAR 2008, TPDS 2010, JSPS 2010, SASP 2011]

SDR systems: [IEEE Comm. Mag 2010, ISRSP 2011]

Our previous work

MIMO detector: [SiPS 2009, Asilomar 2009, JSPS 2011, SIPS 2012]

Turbo decoder: [SiPS 2010, JSPS 2011]

LDPC decoder: [SASP 2011, Asilomar 2011]

NB-LDPC decoder: [Asilomar 2012]

4

Massively parallel implementations

3/20/2013

Massively parallel implementations:

Tailored algorithms to improve efficiency.

Results:

Achieve high throughput (faster than existing work)

Very flexible, can be a good platform for SDR systems.

MIMO Detector LDPC Decoder

5

Outline

3/20/2013

MIMO soft detection algorithm on GPU

Introduction to MIMO detection

Kernel mapping

Optimization techniques

Experiment results

Multi-standard LDPC Decoder on GPU

Introduction to LDPC algorithm

Kernel mapping


Experiment results

6

Modulation

3/20/2013 7

Encode data in amplitude and phase of a sinusoid

Higher modulation order more data per symbol

0 1

11

00

10

01

…0111011

MIMO System Model

3/20/2013

Spatial Multiplexing: ↑ throughput by transmitting multiple streams

Receiver: Transmit streams interfere with each other

0 00 01 02 0 0

1 10 11 12 1 1

2 2 220 21 22

y h h h x ny h h h x ny x nh h h

y Hx n

8

…0111011

…0101011

…0101111

0x

1x

2x

0y

1y

2y

H

MIMO-OFDM

3/20/2013 9

H

Break a wideband signal into many independent subcarriers

Perform MIMO detection independently many times, one per subcarrier

Many subcarriers for many wireless standards.

LTE 20Mhz subframe: 14*1200 subcarriers

MIMO Detection

3/20/2013

Probability of a path, 𝑥, is inversely prop. to d y , 𝑥 = y − 𝑅𝑥 = 𝒅𝒊𝒊

Probabilities of all paths are used to generate bit probability values.

4x4 64QAM644 = 16,777,216 paths

nHxy

2

1

0

2

1

0

222120

121110

020100

2

1

0

n

n

n

x

x

x

hhh

hhh

hhh

y

y

y

H nRxy

ˆ

00

0

ˆ

ˆ

ˆ

2

1

0

2

1

0

22

1211

020100

2

1

0

n

n

n

x

x

x

r

rr

rrr

y

y

y

+

+ -

-

+ -

+

+

+ -

+

+ -

-

𝑑𝟐 = 𝑦 2 − 𝑟22𝑥2𝟐

𝑑𝟏 = 𝑦 1 − 𝑟12𝑥2 − 𝑟11𝑥1𝟐

𝑑𝟎 = 𝑦 0 − 𝑟02𝑥2 − 𝑟01𝑥1 − 𝑟00𝑥0𝟐

Search Space for 3x3 BPSK

10

0x

1x

2x

𝑑𝟐 = 𝑦 2 − 𝑟22(−1) 𝟐

𝑑𝟏 = 𝑦 1 − 𝑟12(−1) − 𝑟11(1)𝟐

𝑑𝟎 = 𝑦 0 − 𝑟02(−1) − 𝑟01(1) − 𝑟00(−1) 𝟐

SSFE Detector

3/20/2013

Selective Spanning with Fast Enumeration (SSFE)*

Real value decomposition

Data parallel deterministic search

Generate M likely paths which are used to generate bit probability values

+

+ -

-

+ -

1st antenna Real

1st antenna Imag

1st level: enumerate

all modulation points.

+ - Subsequent levels:

depth-first search, pick

the best outgoing

node

+ + - +

- - - +

2nd antenna Real

2nd antenna Imag

2x2 Q

PS

K

11

SSFE Detector: Node Expansion

3/20/2013

Find best node–x0 that minimizes the cumulative distance

Pick the constellation point closest to the zero forcing solution.

+

+ -

-

+ -

Inputs:

• P: [x2=1,x1=1]

• Channel gains: r : [r00 r01 r02]

• Received signal: y0

Schnoor-Euchner enumeration

𝑥 0 =1

𝑟00(𝑦 0 − 𝑟02𝑥2 − 𝑟01𝑥1)

𝑥 0 = 3.9

-7 -5 -3 -1 1 3 5 7

Zero forcing solution:

64 QAM example:

12

GPU Implementation

3/20/2013

Search algorithm maps well onto GPU

Data parallel with no sharing

Each search path is independent

Efficient node expansion

complexity doesn’t depend on modulation

Modest storage requirement

M threads per detection

13

GPU Implementation of SSFE

3/20/2013

//enumerate a modulation point for 1st antenna

path[0] = mod_table[tid%8];

path[1] = mod_table[tid/8];

dist+= calc_dist(y, r, path[0]);

dist+= calc_dist(y, r, path[1]);

//depth first search

For i = 2:ntx

//compute partial Euclidean dist

ped = 0;

For j = 0:i

ped += calc_dist(y, r, path[j]);

//find best outgoing path

Path[j] = SE_expand(dist, r);

dist = update_dist(dist, ped);

One thread block handles one subcarrier

Spawns 1 thread per modulation point (M

threads)

Completely unrolled inner and outer loops

Path stored in registers

Demodulator + soft estimate computation

not shown

Result

4x4 16QAM: 940Mbps

4x4 64QAM: 480Mbps

14

N-Way MIMO Detector

3/20/2013

Duplicate search block depending on FER requirement.

Add permute block which enforces a detection order

Example: N = 2, two search blocks

Larger lists, NM candidates

X0 X1 X2

X1 X2 X0

y,H

Perm

ut

e

RV

D-

QR

SS

FE

Perm

ute

RV

D-

QR

SS

FE

Qi, Q., Chakrabarti, C, Parallel high throughput soft-output sphere decoder, (SIPS 10)

M Wu, C Dick, JR Cavallaro, Improving MIMO sphere detection through antenna detection order scheduling (SDR Forum 11)

15

GPU Implementation of N-Way MIMO Detector

3/20/2013

Duplicate threads to improve accuracy of the detection algorithm

Divide a thread block into N subsections

Each subsection consists of M threads operates on a different channel

permutation

Performs SSFE detection independently

Generate a NM size candidate list.

Parallelism: M Parallelism:NM

16

N-Way MIMO FER Performance

3/20/2013

Soft Output detectors + Rate 1/2 WiMAX LDPC code, Rayleigh fading channel.

1 outer iteration + 20 inner iteration with early termination

Compared to soft-output K-best and exhaustive (MAP) detector.

4x4 16QAM 4x4 64QAM

17

Bet

ter

Better

N-Way MIMO Detector Throughput

3/20/2013

• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps

• 8192 subcarriers, Kernel time only

18

945

480 476

239

316

151

225

111

0

250

500

750

1000

1250

16QAM 64QAM

Mb

ps N=1

N=2

N=3

N=4

N-Way MIMO Detector Throughput vs Workload

3/20/2013


• Kernel time only

16 QAM 64 QAM

19

0

200

400

600

800

1000

0 1024 2048 3072 4096 5120 6144 7168 8192

Mb

ps

Number of subcarriers

N=1 N=2 N=3 N=4

0

100

200

300

400

500

0 1024 2048 3072 4096 5120 6144 7168 8192

Mb

ps

Number of subcarriers

Performance Comparison

3/20/2013

FPFSD

N=4—4 parallel detectors with different permutations

Differences: a) operates in complex domain b) one kernel for search + one kernel for soft-output generation

Fermi, 448 core @ 1150MHz, 320bit DDR5 @ 3Gbps

*Sandra Roger, et.al Fully Parallel GPU Implementation of a Fixed-Complexity Soft-Output MIMO Detector (IEEE TVT 2012)

Number of

Subcarriers

16QAM (Mbps) 64QAM (Mbps)

FPFSD* Ours N=4 FPFSD* Ours N=4

150*7 72.41 232.27 18.6 67.55

300*7 82.56 246.60 18.59 69.12

600*7 90.44 256.83 17.9 69.53

900*7 91.39 256.22 17.4 70.05

1200*7 92.31 256.88 17.2 69.97

20

N-Way MIMO Detector

3/20/2013

Duplicate search block depending on FER requirement.

Add permute block which enforces a detection order

Example: N = 2, two search blocks

Larger lists, NM candidates

X0 X1 X2

X1 X2 X0

y,H

Perm

ute

RV

D-Q

R

SS

FE

Perm

ute

RV

D-

QR

SS

FE

Qi, Q., Chakrabarti, C, Parallel high throughput soft-output sphere decoder, (SIPS 10)

M Wu, C Dick, JR Cavallaro, Improving MIMO sphere detection through antenna detection order scheduling (SDR Forum 11)

21

N-Way QR Decomposition

3/20/2013

Divide a thread block into N subsections

Each subsection of N threads operates on a different channel permutation

Performs modified Gram Schmidt QR on an extended matrix [𝐻|𝑌]

Generate 𝑅 and 𝑦

QR decomposition time for 8192 symbols

N=1 N=2 N=3 N=4

4x4 0.201 ms 0.350 ms 0.551ms 0.675 ms

290.3 Mbps @ 64QAM

22


• Kernel time only

Complete Design

3/20/2013

Complete design, QR + MIMO Detection

Also includes PCIE transfer time

23


• 8192 subcarriers

196

173

130 124

89 85 79

64

0

20

40

60

80

100

120

140

160

180

200

16QAM 64QAM

Mb

ps

N=1

N=2

N=3

N=4

Complete Design

3/20/2013

Transfer time doesn’t depend on N (# of parallel search) or M (modulation size)

Transfer time can be hidden

QR depends only on N

Detection depends on N and M

25

0

0.5

1

1.5

2

2.5

3

N=1 N=2 N=3 N=4

ms

4x4 64QAM

Detection

QR

Transport

0

0.5

1

1.5

2

2.5

3

N=1 N=2 N=3 N=4

ms

4x4 16QAM

Outline

3/20/2013

Multi-standard LDPC Decoder on GPU

Introduction to LDPC algorithm

Kernel mapping


Experiment results

26

Chan

nel

Enco

der

MIM

O

Modula

tor

RF T

X

Sourc

e R

F R

X

Sink

Channel

Chan

nel

Deco

der

MIM

O

Dete

ctor

Channel Coding

3/20/2013 27

Linear block codes

Encoding: 𝒙 ∙ 𝑮 = 𝒄

K-bit 𝒙 encoded into N-

bit codeword 𝒄 (K<N)

Generator matrix 𝑮

Parity check matrix 𝑯:

𝑯 ∙ 𝒄𝑻 = 𝟎 (𝑮 ∙ 𝑯𝑻 = 𝟎)

High throughput requirement

Error-correction codes

Provides near-capacity error-correcting performance

Application of LDPC codes

Wireless communication

IEEE 802.16m WiMax

IEEE 802.11n, 802.11ac WiFi

10Gbps Ethernet communication

IEEE 802.3an

Digital broadcast: DVB-S2

High speed magnetic storage device

Satellite communication

Low-density parity-check (LDPC) codes

3/20/2013 28

Turbo

codes LDPC

codes

Multi-standard support

Flexibility and scalability

Challenges of decoder design

LDPC Codes

3/20/2013

LDPC codes are linear block codes defined by sparse matrices 𝑯

Codeword 𝒄 should satisfy the parity-check equations: 𝑯 · 𝒄𝑇 = 0

Belief propagation decoding algorithm

Tanner graph

29

CN1 CN2 CN3CN0

VN3VN3 VN4VN4VN1VN1 VN2VN2 VN7VN7VN0VN0 VN5VN5 VN6VN6

Sparse matrix H

LDPC Decoding: Belief propagation decoding

3/20/2013 30

0, 1, 1, 0, 1, 0, 0, 1,…

-1, 1, 1, -1, 1, -1, -1, 1,…

-1.3, 0.8, 1.1, -0.7, 0.5, -1.2, -0.9, 1.1,…

Bit stream

Modulated symbol

Received symbol

Modulation

Wireless channel

Probability(cn=0 | received) VS Probability(cn=1 | received)

CN1 CN2 CN3CN0


• 𝑯 · 𝒄𝑇 = 0

• Complexity ~O(N3)

Belief propagation decoding

Initialization

Finish decoding

i < max_iter

Done

Variable Node Processing

Check Node Processing

Check Node Processing, Rmn

Variable Node Processing, Qmn

3/20/2013 31

Belief propagation decoding algorithm: initialization

3/20/2013

Detector/Demodulator

ChannelDecoder

DecodedBit stream

Probabilityvalues

Wireless receiver

32

CN1 CN2 CN3CN0


Bit probability values

Belief propagation decoding algorithm: CNP

3/20/2013

Init

decoder

Check

Node

Processing

Variable

Node

Processing

Hard

Decision

R L

Q

Decoded

Bits

CN1 CN2 CN3CN0


R00 Q10 Q20 Q70 CN0

VN0VN0

VN1VN1 VN2VN2 VN7VN7

R00

Q10Q20 Q70

33

Belief propagation decoding algorithm: CNP

3/20/2013

Init

decoder

Check

Node

Processing

Variable

Node

Processing

Hard

Decision

R L

Q

Decoded

Bits

CN1 CN2 CN3CN0


R01

Q00 Q20 Q70 CN0

VN1VN1

VN0VN0 VN2VN2 VN7VN7

R01

Q00Q20 Q70

34

Belief propagation decoding algorithm: VNP

3/20/2013

Init

decoder

Check

Node

Processing

Variable

Node

Processing

Hard

Decision

R L

Q

Decoded

Bits

CN1 CN2 CN3CN0


R00 Q20

CN2

VN0VN0

CN0

R00

Q20

35

Belief propagation decoding algorithm: hard decision

3/20/2013

Init

decoder

Check

Node

Processing

Variable

Node

Processing

Hard

Decision

R L

Q

Decoded

Bits

CN1 CN2 CN3CN0


Hard Decision Block

>0 ?Yes

No

xn=1

xn=0

L

36

Early Termination

3/20/2013

Early termination (ET)

Avoid unnecessary computations when codeword converges

Widely used in low power decoding architecture.

ET for LDPC decoder

Use parity check equation: H·cT=0

Use massive threads to perform parity check

H cT

=0.

37

Why GPU for LDPC Decoding?

Highly parallel algorithm

• No dependency for computations

among rows (or columns).

SIMT Parallel architecture

High complexity iterative

algorithm

Enough workload to fully occupy the

GPU’s computing resources

Clear algorithm structure Partition tasks into kernel

functions.

LDPC Decoding GPU

3/20/2013 38

Partition the LDPC Decoding Task

3/20/2013

Computation

kernels Kernel 1

Kernel 2

Host code

Host code

Kernel 3

39

Initialization

Start decoding iterations

Check node processing

Variable node processing

Early termination (ET) check

(Go back if the ET condition is not met)

Make hard decision

CUDA Kernel 1: Check Node Processing

3/20/2013

One thread block processes one row of sub-matrices

Each thread block contains 81 threads, each thread processes one row of the H

matrix (one check node).

972 th

read

s

12 th

read

blo

cks

81 threads/TB

* 802.11n (1944, 972) LDPC code

40

CUDA Kernel 2: Variable Node Processing

3/20/2013

One thread block processes one column of sub-matrices.

Use 1944 threads to run concurrently.

Update variable

node message

Probability value

memory

1944 threads

One

thread

block

41

CUDA Kernel 3: Parallel Early Termination

3/20/2013

H cT

....

M threads

.

.. =

b

b. ..

M threads

Check euqationb[0] b[1] … b[M-1] = 0

?

Barrier

Sync

42

Decoding Algorithm Optimization

3/20/2013 43

Loosely coupled algorithm

Don t store qmn in the memory.

Before computing rmn, recover qmn first.

Good for CUDA implementation

Reduce the device memory storage

Reduce number of memory operations

Forward-backward check node update

For one row with ωr non-zero element, we need to traverse the row for ωr times.

Use forward-backward algorithms

Reduce number of operations: M ωr(ωr-2) → M (3ωr-2)

For example, M=2000, ωr=7, reduce ~50% operations.

before

after

Store r

Store r Compute q

Store q

Optimization: multi-codeword decoding

3/20/2013

Utilize the 2-D thread block

structure

Reduce diverse branches

Take advantage of constant memory

Good flexibility and scalability

44

…

Optimization – Efficient Storage

3/20/2013

Memory optimization

Constant memory: increase throughput by 8%

Compact representation

45

I57 I3

I30

I62

I40

I0

I69

I65

I64

I28

I24

I53 I53

I20 I66

I8

I79 I79

I38

I14

I45 I70 I0

I50

I37

I57

I52

I55 I7

I56 I14

I3 I35

I22 I28

I42 I50

I56 I52

I72

I30

I77 I9

I79 I1 I0

I27

I0 I0

I0

I8

I0

I32

I0

I0 I0

I0 I0

I0 I0

I0 I0

I0 I0

I0 I0

I0

I2

I24

I56 I57 I35

I61 I60 I27 I51

I12

I16 I1

I0

I0 I0

I0

H_kernel2matrix

H_kernel1matrix

Horizontal compression

Ve

rtic

al c

om

pre

ssio

n

struct h_element

{ byte x;

byte y;

byte shift_value;

byte valid; };

Optimization – Memory Coalescing

3/20/2013

Coalescing device memory access

Compact format of Rmn and ∆mn (check node message)

Writing compressed Rmn and ∆mn matrices column-wise

→ coalesced memory access (20% throughput improvement)

46

One column of Rmn and ∆mn

Experimental Results: LDPC Decoding Throughput

3/20/2013

Code type # of iterations

Decoding

Time (ms)

Decoding

Throughput (Mbps)

802.11n

WiFi

N=1944

5 26.0 74.7

10 49.8 39

15 71.5 27.2

802.16m

WiMAX

N=2304

5 24.0 96.1

10 43.0 52.31

15 64.0 36

* Host PC: Intel i5-750 Quad-core CPU, @2.67GHz, 8GB DDR3 memory

* GTX 470 Fermi GPU

47

Experiment results: Early Termination Throughput

3/20/2013

Adaptive ET scheme:

Low SNR: ET off

High SNR: ET on

Increase throughput for high

SNR

48

Throughput VS Workload

3/20/2013

WiMax code, 2304 bits, rate ½ code

At first, throughput increases almost linearly as workload increases

After certain point, throughput stops increasing, because the threads occupies all the

computation SMs in the GPU.

49 G. Wang et al, GPGPU Accelerated Scalable Parallel Decoding of LDPC Codes , ASILOMAR Conference 2011.

Comparison with Recently Published Work

3/20/2013

Work Code length Normalized throughput

(# of iterations = 10)

Park et al, [Journal on WCN 2011] 18000 bits 2.809 Mbps

Yau et al, [ICACT 2011] 1/2 CMMB codes

9126 bits 2.046 Mbps

Zhao et al, [ICA3PP 2011] 4058 bits QC-LDPC 1.067 Mbps

Abburi, [VLSID 2011] 2034 bits WiMax 40 Mbps

Kennedy, [journal on WCN 2012] 2034 bits WiMax 32.9 Mbps

Kang [ICC 2012] 2048 bits, R=0.89 24.09 Mbps

Our work (Results on GTX 470) * 2304 bits WiMax 52.31 Mbps

50 * G. Wang et al, A Massively Parallel Implementation of QC-LDPC Decoder on GPU, IEEE SASP 2011.

Beyond Binary LDPC Codes – GF(q) Nonbinary LDPC

3/20/2013 51

q threads4 2 7 3

Forward computation

Backward computation

Inside one work group

F0(0)

F0(1)

...

F0(q-2)

F0(q-1)

F1(0)

F1(1)

...

F1(q-2)

F1(q-1)

F2(0)

F2(1)

...

F2(q-2)

F2(q-1)

F3(0)

F3(1)

...

F3(q-2)

F3(q-1)

Barrier local memory sync

q threads

N work groups

4 2 7 3

3

3 1 5 4

6 2 6

4 3 7 1

G. Wang et al, Parallel Nonbinary LDPC Decoding on GPU, ASILOMAR Conference 2012.

Conclusion

3/20/2013

Massively parallel implementations of a MIMO detector and a LDPC decoder on GPU

Tailor your algorithm

Tweak algorithm to improve efficiency

Results:

Achieve high throughput

Faster than Existing work

Very flexible, can be a good platform for SDR systems

Future work

Improving performance on Kepler

GPU accelerated SDR systems

Links

Guohui Wang: www.GuohuiWang.com

Michael Wu: http://www.ruf.rice.edu/~mbw2/

52

Acknowledgement

3/20/2013 53

Research supported by US National Science

Foundation under grants CNS-1265332, ECCS-1232274, EECS-

0925942 and CNS-0923479.

Equipment donations generously provided by

NVIDIA.

Documents

Massively Parallel Signal Processing for Wireless