Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Massively Parallel Signal Processing for
Wireless Communication Systems
Michael Wu, Guohui Wang, Joseph R. Cavallaro
Department of ECE, Rice University
Wireless Communication Systems
3/20/2013
Internet
transmitted
signal
Information
bits
Information
bits
Received
signal
2
Wireless Communication Systems
3/20/2013
Used in many standards:
Heavy computations on RX side:
MIMO Detector: decouple streams to provide estimates of the tx bits.
Channel Decoder: correct errors using redundant data.
Chan
nel
Enco
der
MIM
O
Modula
tor
RF T
X
Sourc
e R
F R
X
Sink
Channel
Chan
nel
Deco
der
MIM
O
Dete
ctor
3
Wireless Communication Systems (cont.)
3/20/2013
Related Research Work
MIMO detector: [SAMOS 2010 IEEE-TVT 2012]
Turbo decoder: [CASES 2010, VLSID 2012]
LDPC decoder: [ASILOMAR 2008, TPDS 2010, JSPS 2010, SASP 2011]
SDR systems: [IEEE Comm. Mag 2010, ISRSP 2011]
Our previous work
MIMO detector: [SiPS 2009, Asilomar 2009, JSPS 2011, SIPS 2012]
Turbo decoder: [SiPS 2010, JSPS 2011]
LDPC decoder: [SASP 2011, Asilomar 2011]
NB-LDPC decoder: [Asilomar 2012]
4
Massively parallel implementations
3/20/2013
Massively parallel implementations:
Tailored algorithms to improve efficiency.
Results:
Achieve high throughput (faster than existing work)
Very flexible, can be a good platform for SDR systems.
MIMO Detector LDPC Decoder
5
Outline
3/20/2013
MIMO soft detection algorithm on GPU
Introduction to MIMO detection
Kernel mapping
Optimization techniques
Experiment results
Multi-standard LDPC Decoder on GPU
Introduction to LDPC algorithm
Kernel mapping
Optimization techniques
Experiment results
6
Modulation
3/20/2013 7
Encode data in amplitude and phase of a sinusoid
Higher modulation order more data per symbol
0 1
11
00
10
01
…0111011
MIMO System Model
3/20/2013
Spatial Multiplexing: ↑ throughput by transmitting multiple streams
Receiver: Transmit streams interfere with each other
0 00 01 02 0 0
1 10 11 12 1 1
2 2 220 21 22
y h h h x ny h h h x ny x nh h h
y Hx n
8
…0111011
…0101011
…0101111
0x
1x
2x
0y
1y
2y
H
MIMO-OFDM
3/20/2013 9
H
Break a wideband signal into many independent subcarriers
Perform MIMO detection independently many times, one per subcarrier
Many subcarriers for many wireless standards.
LTE 20Mhz subframe: 14*1200 subcarriers
MIMO Detection
3/20/2013
Probability of a path, 𝑥, is inversely prop. to d y , 𝑥 = y − 𝑅𝑥 = 𝒅𝒊𝒊
Probabilities of all paths are used to generate bit probability values.
4x4 64QAM644 = 16,777,216 paths
nHxy
2
1
0
2
1
0
222120
121110
020100
2
1
0
n
n
n
x
x
x
hhh
hhh
hhh
y
y
y
H nRxy
ˆ
00
0
ˆ
ˆ
ˆ
2
1
0
2
1
0
22
1211
020100
2
1
0
n
n
n
x
x
x
r
rr
rrr
y
y
y
+
+ -
-
+ -
+
+
+ -
+
+ -
-
𝑑𝟐 = 𝑦 2 − 𝑟22𝑥2𝟐
𝑑𝟏 = 𝑦 1 − 𝑟12𝑥2 − 𝑟11𝑥1𝟐
𝑑𝟎 = 𝑦 0 − 𝑟02𝑥2 − 𝑟01𝑥1 − 𝑟00𝑥0𝟐
Search Space for 3x3 BPSK
10
0x
1x
2x
𝑑𝟐 = 𝑦 2 − 𝑟22(−1) 𝟐
𝑑𝟏 = 𝑦 1 − 𝑟12(−1) − 𝑟11(1)𝟐
𝑑𝟎 = 𝑦 0 − 𝑟02(−1) − 𝑟01(1) − 𝑟00(−1) 𝟐
SSFE Detector
3/20/2013
Selective Spanning with Fast Enumeration (SSFE)*
Real value decomposition
Data parallel deterministic search
Generate M likely paths which are used to generate bit probability values
+
+ -
-
+ -
1st antenna Real
1st antenna Imag
1st level: enumerate
all modulation points.
+ - Subsequent levels:
depth-first search, pick
the best outgoing
node
+ + - +
- - - +
2nd antenna Real
2nd antenna Imag
2x2 Q
PS
K
11
SSFE Detector: Node Expansion
3/20/2013
Find best node–x0 that minimizes the cumulative distance
Pick the constellation point closest to the zero forcing solution.
+
+ -
-
+ -
Inputs:
• P: [x2=1,x1=1]
• Channel gains: r : [r00 r01 r02]
• Received signal: y0
Schnoor-Euchner enumeration
𝑥 0 =1
𝑟00(𝑦 0 − 𝑟02𝑥2 − 𝑟01𝑥1)
𝑥 0 = 3.9
-7 -5 -3 -1 1 3 5 7
Zero forcing solution:
64 QAM example:
12
GPU Implementation
3/20/2013
Search algorithm maps well onto GPU
Data parallel with no sharing
Each search path is independent
Efficient node expansion
complexity doesn’t depend on modulation
Modest storage requirement
M threads per detection
13
GPU Implementation of SSFE
3/20/2013
//enumerate a modulation point for 1st antenna
path[0] = mod_table[tid%8];
path[1] = mod_table[tid/8];
dist+= calc_dist(y, r, path[0]);
dist+= calc_dist(y, r, path[1]);
//depth first search
For i = 2:ntx
//compute partial Euclidean dist
ped = 0;
For j = 0:i
ped += calc_dist(y, r, path[j]);
//find best outgoing path
Path[j] = SE_expand(dist, r);
dist = update_dist(dist, ped);
One thread block handles one subcarrier
Spawns 1 thread per modulation point (M
threads)
Completely unrolled inner and outer loops
Path stored in registers
Demodulator + soft estimate computation
not shown
Result
4x4 16QAM: 940Mbps
4x4 64QAM: 480Mbps
14
N-Way MIMO Detector
3/20/2013
Duplicate search block depending on FER requirement.
Add permute block which enforces a detection order
Example: N = 2, two search blocks
Larger lists, NM candidates
X0 X1 X2
X1 X2 X0
y,H
Perm
ut
e
RV
D-
QR
SS
FE
Perm
ute
RV
D-
QR
SS
FE
Qi, Q., Chakrabarti, C, Parallel high throughput soft-output sphere decoder, (SIPS 10)
M Wu, C Dick, JR Cavallaro, Improving MIMO sphere detection through antenna detection order scheduling (SDR Forum 11)
15
GPU Implementation of N-Way MIMO Detector
3/20/2013
Duplicate threads to improve accuracy of the detection algorithm
Divide a thread block into N subsections
Each subsection consists of M threads operates on a different channel
permutation
Performs SSFE detection independently
Generate a NM size candidate list.
Parallelism: M Parallelism:NM
16
N-Way MIMO FER Performance
3/20/2013
Soft Output detectors + Rate 1/2 WiMAX LDPC code, Rayleigh fading channel.
1 outer iteration + 20 inner iteration with early termination
Compared to soft-output K-best and exhaustive (MAP) detector.
4x4 16QAM 4x4 64QAM
17
Bet
ter
Better
N-Way MIMO Detector Throughput
3/20/2013
• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps
• 8192 subcarriers, Kernel time only
18
945
480 476
239
316
151
225
111
0
250
500
750
1000
1250
16QAM 64QAM
Mb
ps N=1
N=2
N=3
N=4
N-Way MIMO Detector Throughput vs Workload
3/20/2013
• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps
• Kernel time only
16 QAM 64 QAM
19
0
200
400
600
800
1000
0 1024 2048 3072 4096 5120 6144 7168 8192
Mb
ps
Number of subcarriers
N=1 N=2 N=3 N=4
0
100
200
300
400
500
0 1024 2048 3072 4096 5120 6144 7168 8192
Mb
ps
Number of subcarriers
Performance Comparison
3/20/2013
FPFSD
N=4—4 parallel detectors with different permutations
Differences: a) operates in complex domain b) one kernel for search + one kernel for soft-output generation
Fermi, 448 core @ 1150MHz, 320bit DDR5 @ 3Gbps
*Sandra Roger, et.al Fully Parallel GPU Implementation of a Fixed-Complexity Soft-Output MIMO Detector (IEEE TVT 2012)
Number of
Subcarriers
16QAM (Mbps) 64QAM (Mbps)
FPFSD* Ours N=4 FPFSD* Ours N=4
150*7 72.41 232.27 18.6 67.55
300*7 82.56 246.60 18.59 69.12
600*7 90.44 256.83 17.9 69.53
900*7 91.39 256.22 17.4 70.05
1200*7 92.31 256.88 17.2 69.97
20
N-Way MIMO Detector
3/20/2013
Duplicate search block depending on FER requirement.
Add permute block which enforces a detection order
Example: N = 2, two search blocks
Larger lists, NM candidates
X0 X1 X2
X1 X2 X0
y,H
Perm
ute
RV
D-Q
R
SS
FE
Perm
ute
RV
D-
QR
SS
FE
Qi, Q., Chakrabarti, C, Parallel high throughput soft-output sphere decoder, (SIPS 10)
M Wu, C Dick, JR Cavallaro, Improving MIMO sphere detection through antenna detection order scheduling (SDR Forum 11)
21
N-Way QR Decomposition
3/20/2013
Divide a thread block into N subsections
Each subsection of N threads operates on a different channel permutation
Performs modified Gram Schmidt QR on an extended matrix [𝐻|𝑌]
Generate 𝑅 and 𝑦
QR decomposition time for 8192 symbols
N=1 N=2 N=3 N=4
4x4 0.201 ms 0.350 ms 0.551ms 0.675 ms
290.3 Mbps @ 64QAM
22
• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps
• Kernel time only
Complete Design
3/20/2013
Complete design, QR + MIMO Detection
Also includes PCIE transfer time
23
• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps
• 8192 subcarriers
196
173
130 124
89 85 79
64
0
20
40
60
80
100
120
140
160
180
200
16QAM 64QAM
Mb
ps
N=1
N=2
N=3
N=4
Complete Design
3/20/2013
Transfer time doesn’t depend on N (# of parallel search) or M (modulation size)
Transfer time can be hidden
QR depends only on N
Detection depends on N and M
25
0
0.5
1
1.5
2
2.5
3
N=1 N=2 N=3 N=4
ms
4x4 64QAM
Detection
QR
Transport
0
0.5
1
1.5
2
2.5
3
N=1 N=2 N=3 N=4
ms
4x4 16QAM
Outline
3/20/2013
Multi-standard LDPC Decoder on GPU
Introduction to LDPC algorithm
Kernel mapping
Optimization techniques
Experiment results
26
Chan
nel
Enco
der
MIM
O
Modula
tor
RF T
X
Sourc
e R
F R
X
Sink
Channel
Chan
nel
Deco
der
MIM
O
Dete
ctor
Channel Coding
3/20/2013 27
Linear block codes
Encoding: 𝒙 ∙ 𝑮 = 𝒄
K-bit 𝒙 encoded into N-
bit codeword 𝒄 (K<N)
Generator matrix 𝑮
Parity check matrix 𝑯:
𝑯 ∙ 𝒄𝑻 = 𝟎 (𝑮 ∙ 𝑯𝑻 = 𝟎)
High throughput requirement
Error-correction codes
Provides near-capacity error-correcting performance
Application of LDPC codes
Wireless communication
IEEE 802.16m WiMax
IEEE 802.11n, 802.11ac WiFi
10Gbps Ethernet communication
IEEE 802.3an
Digital broadcast: DVB-S2
High speed magnetic storage device
Satellite communication
Low-density parity-check (LDPC) codes
3/20/2013 28
Turbo
codes LDPC
codes
Multi-standard support
Flexibility and scalability
Challenges of decoder design
LDPC Codes
3/20/2013
LDPC codes are linear block codes defined by sparse matrices 𝑯
Codeword 𝒄 should satisfy the parity-check equations: 𝑯 · 𝒄𝑇 = 0
Belief propagation decoding algorithm
Tanner graph
29
CN1 CN2 CN3CN0
VN3VN3 VN4VN4VN1VN1 VN2VN2 VN7VN7VN0VN0 VN5VN5 VN6VN6
Sparse matrix H
LDPC Decoding: Belief propagation decoding
3/20/2013 30
0, 1, 1, 0, 1, 0, 0, 1,…
-1, 1, 1, -1, 1, -1, -1, 1,…
-1.3, 0.8, 1.1, -0.7, 0.5, -1.2, -0.9, 1.1,…
Bit stream
Modulated symbol
Received symbol
Modulation
Wireless channel
Probability(cn=0 | received) VS Probability(cn=1 | received)
CN1 CN2 CN3CN0
VN3VN3 VN4VN4VN1VN1 VN2VN2 VN7VN7VN0VN0 VN5VN5 VN6VN6
• 𝑯 · 𝒄𝑇 = 0
• Complexity ~O(N3)
Belief propagation decoding
Initialization
Finish decoding
i < max_iter
Done
Variable Node Processing
Check Node Processing
Check Node Processing, Rmn
Variable Node Processing, Qmn
3/20/2013 31
Belief propagation decoding algorithm: initialization
3/20/2013
Detector/Demodulator
ChannelDecoder
DecodedBit stream
Probabilityvalues
Wireless receiver
32
CN1 CN2 CN3CN0
VN3VN3 VN4VN4VN1VN1 VN2VN2 VN7VN7VN0VN0 VN5VN5 VN6VN6
Bit probability values
Belief propagation decoding algorithm: CNP
3/20/2013
Init
decoder
Check
Node
Processing
Variable
Node
Processing
Hard
Decision
R L
Q
Decoded
Bits
CN1 CN2 CN3CN0
VN3VN3 VN4VN4VN1VN1 VN2VN2 VN7VN7VN0VN0 VN5VN5 VN6VN6
R00 Q10 Q20 Q70 CN0
VN0VN0
VN1VN1 VN2VN2 VN7VN7
R00
Q10Q20 Q70
33
Belief propagation decoding algorithm: CNP
3/20/2013
Init
decoder
Check
Node
Processing
Variable
Node
Processing
Hard
Decision
R L
Q
Decoded
Bits
CN1 CN2 CN3CN0
VN3VN3 VN4VN4VN1VN1 VN2VN2 VN7VN7VN0VN0 VN5VN5 VN6VN6
R01
Q00 Q20 Q70 CN0
VN1VN1
VN0VN0 VN2VN2 VN7VN7
R01
Q00Q20 Q70
34
Belief propagation decoding algorithm: VNP
3/20/2013
Init
decoder
Check
Node
Processing
Variable
Node
Processing
Hard
Decision
R L
Q
Decoded
Bits
CN1 CN2 CN3CN0
VN3VN3 VN4VN4VN1VN1 VN2VN2 VN7VN7VN0VN0 VN5VN5 VN6VN6
R00 Q20
CN2
VN0VN0
CN0
R00
Q20
35
Belief propagation decoding algorithm: hard decision
3/20/2013
Init
decoder
Check
Node
Processing
Variable
Node
Processing
Hard
Decision
R L
Q
Decoded
Bits
CN1 CN2 CN3CN0
VN3VN3 VN4VN4VN1VN1 VN2VN2 VN7VN7VN0VN0 VN5VN5 VN6VN6
Hard Decision Block
>0 ?Yes
No
xn=1
xn=0
L
36
Early Termination
3/20/2013
Early termination (ET)
Avoid unnecessary computations when codeword converges
Widely used in low power decoding architecture.
ET for LDPC decoder
Use parity check equation: H·cT=0
Use massive threads to perform parity check
H cT
=0.
37
Why GPU for LDPC Decoding?
Highly parallel algorithm
• No dependency for computations
among rows (or columns).
SIMT Parallel architecture
High complexity iterative
algorithm
Enough workload to fully occupy the
GPU’s computing resources
Clear algorithm structure Partition tasks into kernel
functions.
LDPC Decoding GPU
3/20/2013 38
Partition the LDPC Decoding Task
3/20/2013
Computation
kernels Kernel 1
Kernel 2
Host code
Host code
Kernel 3
39
Initialization
Start decoding iterations
Check node processing
Variable node processing
Early termination (ET) check
(Go back if the ET condition is not met)
Make hard decision
CUDA Kernel 1: Check Node Processing
3/20/2013
One thread block processes one row of sub-matrices
Each thread block contains 81 threads, each thread processes one row of the H
matrix (one check node).
972 th
read
s
12 th
read
blo
cks
81 threads/TB
* 802.11n (1944, 972) LDPC code
40
CUDA Kernel 2: Variable Node Processing
3/20/2013
One thread block processes one column of sub-matrices.
Use 1944 threads to run concurrently.
Update variable
node message
Probability value
memory
1944 threads
One
thread
block
41
CUDA Kernel 3: Parallel Early Termination
3/20/2013
H cT
....
M threads
.
.. =
b
b. ..
M threads
Check euqationb[0] b[1] … b[M-1] = 0
?
Barrier
Sync
42
Decoding Algorithm Optimization
3/20/2013 43
Loosely coupled algorithm
Don t store qmn in the memory.
Before computing rmn, recover qmn first.
Good for CUDA implementation
Reduce the device memory storage
Reduce number of memory operations
Forward-backward check node update
For one row with ωr non-zero element, we need to traverse the row for ωr times.
Use forward-backward algorithms
Reduce number of operations: M ωr(ωr-2) → M (3ωr-2)
For example, M=2000, ωr=7, reduce ~50% operations.
before
after
Store r
Store r Compute q
Store q
Optimization: multi-codeword decoding
3/20/2013
Utilize the 2-D thread block
structure
Reduce diverse branches
Take advantage of constant memory
Good flexibility and scalability
44
…
Optimization – Efficient Storage
3/20/2013
Memory optimization
Constant memory: increase throughput by 8%
Compact representation
45
I57 I3
I30
I62
I40
I0
I69
I65
I64
I28
I24
I53 I53
I20 I66
I8
I79 I79
I38
I14
I45 I70 I0
I50
I37
I57
I52
I55 I7
I56 I14
I3 I35
I22 I28
I42 I50
I56 I52
I72
I30
I77 I9
I79 I1 I0
I27
I0 I0
I0
I8
I0
I32
I0
I0 I0
I0 I0
I0 I0
I0 I0
I0 I0
I0 I0
I0
I2
I24
I56 I57 I35
I61 I60 I27 I51
I12
I16 I1
I0
I0 I0
I0
H_kernel2matrix
H_kernel1matrix
Horizontal compression
Ve
rtic
al c
om
pre
ssio
n
struct h_element
{ byte x;
byte y;
byte shift_value;
byte valid; };
Optimization – Memory Coalescing
3/20/2013
Coalescing device memory access
Compact format of Rmn and ∆mn (check node message)
Writing compressed Rmn and ∆mn matrices column-wise
→ coalesced memory access (20% throughput improvement)
46
One column of Rmn and ∆mn
Experimental Results: LDPC Decoding Throughput
3/20/2013
Code type # of iterations
Decoding
Time (ms)
Decoding
Throughput (Mbps)
802.11n
WiFi
N=1944
5 26.0 74.7
10 49.8 39
15 71.5 27.2
802.16m
WiMAX
N=2304
5 24.0 96.1
10 43.0 52.31
15 64.0 36
* Host PC: Intel i5-750 Quad-core CPU, @2.67GHz, 8GB DDR3 memory
* GTX 470 Fermi GPU
47
Experiment results: Early Termination Throughput
3/20/2013
Adaptive ET scheme:
Low SNR: ET off
High SNR: ET on
Increase throughput for high
SNR
48
Throughput VS Workload
3/20/2013
WiMax code, 2304 bits, rate ½ code
At first, throughput increases almost linearly as workload increases
After certain point, throughput stops increasing, because the threads occupies all the
computation SMs in the GPU.
49 G. Wang et al, GPGPU Accelerated Scalable Parallel Decoding of LDPC Codes , ASILOMAR Conference 2011.
Comparison with Recently Published Work
3/20/2013
Work Code length Normalized throughput
(# of iterations = 10)
Park et al, [Journal on WCN 2011] 18000 bits 2.809 Mbps
Yau et al, [ICACT 2011] 1/2 CMMB codes
9126 bits 2.046 Mbps
Zhao et al, [ICA3PP 2011] 4058 bits QC-LDPC 1.067 Mbps
Abburi, [VLSID 2011] 2034 bits WiMax 40 Mbps
Kennedy, [journal on WCN 2012] 2034 bits WiMax 32.9 Mbps
Kang [ICC 2012] 2048 bits, R=0.89 24.09 Mbps
Our work (Results on GTX 470) * 2304 bits WiMax 52.31 Mbps
50 * G. Wang et al, A Massively Parallel Implementation of QC-LDPC Decoder on GPU, IEEE SASP 2011.
Beyond Binary LDPC Codes – GF(q) Nonbinary LDPC
3/20/2013 51
q threads4 2 7 3
Forward computation
Backward computation
Inside one work group
F0(0)
F0(1)
...
F0(q-2)
F0(q-1)
F1(0)
F1(1)
...
F1(q-2)
F1(q-1)
F2(0)
F2(1)
...
F2(q-2)
F2(q-1)
F3(0)
F3(1)
...
F3(q-2)
F3(q-1)
Barrier local memory sync
q threads
N work groups
4 2 7 3
3
3 1 5 4
6 2 6
4 3 7 1
G. Wang et al, Parallel Nonbinary LDPC Decoding on GPU, ASILOMAR Conference 2012.
Conclusion
3/20/2013
Massively parallel implementations of a MIMO detector and a LDPC decoder on GPU
Tailor your algorithm
Tweak algorithm to improve efficiency
Results:
Achieve high throughput
Faster than Existing work
Very flexible, can be a good platform for SDR systems
Future work
Improving performance on Kepler
GPU accelerated SDR systems
Links
Guohui Wang: www.GuohuiWang.com
Michael Wu: http://www.ruf.rice.edu/~mbw2/
52
Acknowledgement
3/20/2013 53
Research supported by US National Science
Foundation under grants CNS-1265332, ECCS-1232274, EECS-
0925942 and CNS-0923479.
Equipment donations generously provided by
NVIDIA.