BUILDING O-RAN BASED HIGH PERFORMANCE 5G RAN ......2 BUILDING O-RAN BASED HIGH PERFORMANCE 5G RAN SYSTEMS WITH NVIDIA GPUS AND MELLANOX NIC In this webinar, we will walk through NVIDIA

April 2020 Elena Agostini and Joseph Boccuzzi

BUILDING O-RAN BASED HIGH PERFORMANCE 5G RAN SYSTEMS WITH NVIDIA GPU AND MELLANOX NIC

2

BUILDING O-RAN BASED HIGH PERFORMANCE 5G RAN SYSTEMS WITH NVIDIA GPUS AND MELLANOX NIC

In this webinar, we will walk through NVIDIA Aerial solution and O-RAN implementation. NVIDIA Aerial is a set of SDKs that enables GPU-accelerated, software-defined 5G wireless RANs. Today, NVIDIA Aerial provides two critical SDKs: cuVNF and cuBB. These SDKs can be combined to implement a software-accelerated physical layer on the O-DU that is able to dialog, by means of a Fronthaul I/O interface, with a set of radio heads to send, receive, and process 5G packets using GPUs. We'll show our implementation of the Fronthaul I/O interface to enable an O-RAN dialog with a radio unit, giving an overview of the most challenging issues we faced in differentiating between hardware- and software-accelerated features.

Elena AgostiniSoftware Engineer, NVIDIA

Joseph BoccuzziPrincipal 5G Architect, NVIDIA

3

AERIAL:5G vRAN BASEBAND PROCESSING

4

THREE REVOLUTIONS HAPPENING

5G will deliver 1000X better bandwidth

and 10X lower latency than 4G

By 2025, AI at the edge has a potential

total economic impact of up to

$11T/year

5GIOT AI

IOT devices projected to grow to >150B

by 2025, >1T by 2035

A Flexible and Scalable Network is needed.

SW Defined Open & Standards based solutions VNF/CNF based deployments

5

OPEN 5G vRAN DEPLOYMENTS5G + AI + Edge Compute

RU

Fronthaul

vDU

vCUvUPF

MECAI

vUPFvAMF, ...

MEC

AIBackhaul

Core Cloud

RU

DataCenter..

.vDU

vCUAI Midhaul

......

RU

Fronthaul

RU

...

GPU Based, SW Defined 5G Platform

vUPF

MEC

= Virtualized/Containerized

Edge Cloud

Regional Cloud

Access Network Core Network

Edge Compute Benefits:

Lower BW

Reduced Latency

Improved Reliability

Increased Privacy

Extends Application Space

6

5G vRAN AND EDGE COMPUTING ARCHITECTUREOpen & Standards based solution enables Edge Computing

MEC

N6DU&

CU

AMF SMFMEC

N6

N3

UPF

MEC

N6

DU CUF1

RU

Fronthaul

RU

...

Regional Cloud Core CloudEdge Cloud

N3

DUN3CUF1

N2

N4

N2

N4

N4N2

N2

Near-RT RICE2

E2

E2

E2

E2

E2

UPF

UPF

DU = Distributed UnitCU = Centralized UnitUPF = User Plane FunctionAMF = Access & Mobility Mgmt. FunctionSMF = Session Mgmt. FunctionMEC = Multi-Access Edge ComputeRU = Radio Unit

RIC = RAN Intelligent ControllerCUPS = Control and User Plane Separation

Impact of CUPS

7

FLEXIBLE AND SCALABLE 5G NETWORKSW Defined 5G solution enables Network Slicing Service

uR-LLC

mMTC

GPU Based Solution

MEC

N6DU&

CU

AMF SMFMEC

N6

N3

UPF

MEC

N6

DU CUF1

RU

Fronthaul

RU

...

Regional Cloud Core CloudEdge Cloud

N3

DUN3CUF1

N2

N4

N2

N4

N4N2

N2

Near-RT RICE2

E2

E2

E2

E2

E2

UPF

UPF

Smart City

HD Video/Gaming

Remote Control

mMTC = Massive Machine to Machine Comm.eMBB = Enhanced Mobile BroadbanduR-LLC = Ultra Reliable – Low Latency Comm.

uR-LLC

mMTC

eMBB

8

NVIDIA AERIAL SDK

Highest Performance 5G Software-Defined Radio

GPUMellanox

NIC

GPU Direct RDMA

CPU

DPDK

cuVNF

cuBB

AerialRich CUDA programmable environment

High performance & energy efficient

Scalable to mmWave & massive MIMO

100% SW Defined

One architecture from edge to cloud

Commercially off the shelf (COTS)

Open & Standards based

cuBB = CUDA BasebandcuVNF = CUDA Virtualized Network FunctionCUDA = Compute Unified Device Architecture

9

NVIDIA'S NEXT GENERATION COTS SOLUTION

CPU

DDR

BackHaul

FrontHaul

PCIePCIe

GPU

DDR

PCIe

GPU: L1 Functionality

5G demands a Flexible & Scalable SW Defined Network

CN RUGPU performs computationally intensive applications very well.

Inline functionality eliminates the need to move data back-n-forth.

GPU SW acceleration scales with 5G deployment complexity.

Provides a wide variety of application re-use.

Supports “speed-of-light” development & innovation (Specification Releases, ML-based algorithms).

CU DU

CPU: L2+ Functionality

10

NVIDIA AERIAL STACK

FH I/O lib

cuPHY

Dataplacement

O-RANformatting

Platform Features

GPU DPDK GPU Direct RDMA Header/Data Split

O-RAN flowidentification

CUDA ToolkitMellanox

OFEDnv_peer_mem

CPU

GPUMellanox

NIC

cuBB

cuVNF

Toolkit & Drivers

HW

AERIAL includes two SDKs:

cuVNF delivers low-latency GPU Direct packet IO to GPU memory.

cuBB offers accelerated 5G that’s been highly tuned for NVIDIA GPUs.

Accurate Packet Scheduling

11

AERIAL BASED 5G O-RAN DEPLOYMENT

RE Map

TrBLKCRC

CB Seg

+CRC

LDPCEncode

RateMatch

Scram ModLayer Map

PreCode

RE De-Map

ChanEst

De-Mod

De-Scram

EQTrBLKCRC

LDPCDecode

De-Rate

Match

CB Con

+CRC

PDSCH

PUSCH

IQComp

IQDe-

Comp

IQDe-

Comp

IQComp

FH

FH

CP+

MI

MO

iFFT

PreCode+ BF

CP-

FFT

DAC

ABF

DBF

ADC

ABF

RUCU/DUCN

L2+PHY-

UFH

FH

PHY-LRF

FHN2/N3App

N6

5G gNB

UE

App

CRC

FEC

Aerial

O-RANFH

N2/N3

...

O-RAN FH Split Option 7.2

12

AERIAL INTEGRATION: END-TO-END NSA SYSTEM

E2E Integration:

Core Network

L2+

PTP Timing

UE-EM

MACRLC

PHY-L

O-RAN

cuPHY-UUE

StackC/U

Sync

Mgnt

C/U

Sync

Mgnt

RU

UE-EM = RU + UE

PDCPS1

EPCSGi

PTP Sync

CU

DU

5G gNB

IP Switch

PTP GrandMaster

IP

FAPIPTP4L

PHC2SYS

NIC

NIC

BackHaul

RUCU/DUCNFHBH

App

5G gNB

UE

App

UE

13

NVIDIA AERIAL cuPHY

Aerial SDK (Alpha Release)

Location within 5G gNB

RE Map

TrBLKCRC

CB Seg

+CRC

LDPCEncode

RateMatch

Scram ModLayer Map

PreCode

RE De-Map

ChanEst

De-Mod

De-Scram

EQTrBLKCRC

LDPCDecode

De-Rate

Match

CB Con

+CRC

Tx

Rx

SS Block(P/S-SS, Polar Encode, Scrambling, DMRS Gen, Modulation)

PDCCH(Polar Encode, Modulation, DMRS Gen)

PDSCH

P/S-SS, PBCH

PDCCH

PUSCH

PUCCH

DL

UL ChanEst

Matched Filter

DetectorCC

removal

cuPHY

O-RANFront Haul

L2+

5G gNB

CPU

GPU

NIC

14

AERIAL SDK: ALPHA RELEASE

PDSCH

o Layers supported = 8 SU-MIMO, 16 MU-MIMO

PDCCH

o DCI 0_0 & 1_1, DMRS generation, time-frequency mapping

SS-Block (PBCH + PSS + SSS)

o PSS/SSS generation, DMRS generation, time-frequency mapping

PUSCH

o Layers supported = 4 SU-MIMO, 8 MU-MIMO

PUCCH

o Format 1, Multiplexing

HARQ support

Key Feature listing

Carrier BW = 20MHz & 100MHz

TDD/FDD

SCS supported = 15KHz & 30KHz

Multi-user Support

O-RAN Front Haul (split 7.2) support

All DL/UL modulations supported

5G FEC Processing

o LDPC encoding/decoding

o Rate Matching, De-Rate Matching

o Scrambling, Descrambling

o CB/TB CRC

o Polar Encoding

https://developer.nvidia.com/aerial-sdk

15

AERIAL SDK CUSTOMER CAPABILITIES

Uplink and Downlink test cases are provided.

The user can collect performance benchmarking such as Block Latency and uplink BLER.

Example CUDA implementations of PHY signal processing blocks are provided

What can you do with Aerial SDK ?

https://developer.nvidia.com/aerial-sdk

16

Poll Question #1

Which area do you expect the application of AI/ML will significantly impact ?

Physical Layer

Layer 2+ (ex. MAC, RLC, PDCP, SDAP)

Network Management

17

cuVNF:TECHNOLOGY, LIBRARY, FEATURES

18

cuVNF

• The NVIDIA cuVNF SDK provides a set of network libraries and features whereby packets are directly sent/received to/from GPU memory using GPU Direct capable network interface cards (NICs), such as Mellanox.

• The SDK package is based on DPDK 19.11 with:

• NVIDIA API extensions to send/receive packets using GPU memory (GPU DPDK)

• GDRCopy: required to let CPU access any GPU memory area

• Testpmd app with NVIDIA extensions to benchmark traffic forwarding with GPU memory

• l2fwd app with NVIDIA extensions as an example of:

• How to use NVIDIA API to send/receive packets back and forth the GPU memory

• Different techniques to interact with packets in GPU memory

Overview

19

NVIDIA AERIAL STACK

FH I/O lib

cuPHY

Dataplacement

O-RANformatting

Platform Features

GPU DPDK GPU Direct RDMA Header/Data Split

ORAN flowidentification

Accurate Packet Scheduling

CUDA ToolkitMellanox

OFEDnv_peer_mem

CPU

GPUMellanox

NIC

cuBB

cuVNF

Toolkit & Drivers

HW

cuBB & cuVNF

20

GPUDIRECT RDMAIn a nutshell

• 3rd party PCIe devices can directly read/write GPU memory

• e.g. network card

• GPU and external device must be under the same PCIe root complex

• No unnecessary system memory copies and CPU overhead

• cudaMalloc(gpu_buffer) + MPI_Send(gpu_buffer)

• https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

21

GPUDIRECT RDMASystem topology

BestGood

Mellanox module: https://github.com/Mellanox/nv_peer_memory

https://github.com/Mellanox/nv_peer_memory

22

DPDK

Data Plane Development Kit:

A set of data plane libraries and network interface controller drivers for fast packet processing

From user space, an application can directly dialog with the NIC avoiding OS procedures (and latencies)

• Mempool: contiguous system memory area which holds a list of mbufs

Overview

23

GPU DPDK

GPUDirect RDMA: NVIDIA GPU + Mellanox NIC

+ NVIDIA API to allocate mbufs content in GPU memory

+ DPDK 19.11

= GPU DPDK

Works with both GPUDirect RDMA HW topologies

Header/Data split feature:

• Same network packet split in two mbufs from different mempools (first A bytes in the first mempool, remaining B bytes in the second mempool)

• Useful to receive header of packet on CPU and payload of packet on the GPU

Recipe: DPDK & NVIDIA & Mellanox

24

GPU DPDK

Ordinary CUDA kernel: launch a CUDA kernel after receiving packets

• Pros:• GPU resources uses only when needed

• No GPU memory consistency problem

• Possible overlap between GPU processing and network activity

• Cons:

• High response latency

• CUDA kernel launch latency for each new set of packets

Dealing with GPU memory: ordinary CUDA kernel

25

GPU DPDK

Persistent CUDA kernel: pre-launch a CUDA kernel that's polling memory area waiting for new packets

• Pros:• Low response latency

• Avoid CUDA Kernel launch latency for each RX set of packets

• Possible overlap between GPU processing and network activity

• Cons:

• GPU resources held by the persistent kernel during polling

• CPU-GPU communication mechanism via polling/flags update

• GPU memory consistency problem

Dealing with GPU memory: persistent kernel

26

GPU DPDK – L2FWDOverview

L2fwd-nv:

Basic l2fwd example powered with NVIDIA extensions

Showcase of interaction with GPU packets (ordinary vs persistent CUDA kernel)

Trivial workload: swap MAC addresses of each Rx packet

Testpmd:

default DPDK application for network benchmarks

Used as packet generator

Tx throughput 100Gbps

27

GPU DPDK – L2FWD

• GPU memory vs CPU memory

• GPU processing vs CPU processing

• Persistent kernel shows 10% better performance

• L2FWD has trivial compute

• significantly more complex to use

• Regular kernels are flexible and can give similar performance

• Latencies get overlapped with larger workloads

• System HW:

• Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (Skylake)

• NVIDIA GPU V100

• Mellanox ConnectX-5, 100Gbps

• PCIe bridge Broadcom PLX Technology 9797

Performance comparison

28

AERIAL SDK 5G vRAN

29

E2E INTEGRATIONOverview

ORAN: Standard to define communication protocol between the Distributed Unit and the Radio Unit• Distributed Unit (O-DU): a logical node hosting RLC/MAC/High-PHY layers

based on a lower layer functional split• Radio Unit (O-RU): a logical node hosting Low-PHY layer and RF processing

based on a lower layer functional split• Control Unit (O-CU): a logical node hosting PDCP, RRC and other control

functions. At present, NVIDIA uses external 3rd party stack vendor for this component

Two type of O-DU<-->O-RU interactions:• Uplink: O-RU --> O-DU• Downlink: O-DU --> O-RU

Communication planes:• C-plane: configure how to process next time slot data packets• U-plane: data packets• M-plane: network setup and management• S-plane: network synchronization

L1

L2

L3

MACRLC

PHY-LO-RAN

cuPHY-U UE Stack

C/U

Sync

Mgnt

C/U

Sync

MgntRU UE-EM

PDCPS1-U

EPC

SGi

PTP Synch

CU

DU

5G gNB

IP

SwitchPTP Grand

Master

IPFAPI

PTP4L

PHC2SYS

30

O-RAN FH COMPLIANT 5G L1 INTERFACE

Components:

cuVNF to send/receive packets from/into GPU memory

ORAN flow identification: GPU DPDK + Mellanox FW to identify different RX queues for ORAN packets based on header's values

cuBB for L1 PHY processing on the GPU (cuPHY)

O-RAN: C-plane + U-plane

Result: O-DU (or 5G gNB) able to communicate with O-RAN compliant O-RU(s)

Aerial SDK components

31


Uplink procedure:

O-DU sends configuration to O-RU (C-plane packets)

O-RU replies with client’s data (U-plane)

cuBB data placement: order U-plane PRBs into a single buffer for cuPHY

O-DU L1 processing: PUSCH on the GPU

O-DU forwards PUSCH output to upper layers

Uplink

32


Downlink procedure:

O-DU sends configuration to O-RU (C-plane packets)

O-DU sets configuration parameters to cuBB

O-DU L1 processing on GPU:PDSCH

PDCCH

PBCH

O-DU sends data (U-plane)

O-RU receives the data

Downlink

33

NVIDIA’s AERIAL Solution

World’s First Fully SW Defined BBU

Industry can innovate at a faster pace.

Highest Performing & Scalable Cloud-Native Architecture

Significant Capacity and Power Efficiency gains.

Fastest PHY Processing

Efficient COTS based Platform for Edge Cloud

Improves utilization.

AERIAL5G

cuBB cuVNF

AERIAL delivers the industry's highest-performance software-defined 5G vRAN.

34

Poll Question #2

Are you familiar with GPUDIRECT RDMA ?

Yes

No

Thank You

Documents

BUILDING O-RAN BASED HIGH PERFORMANCE 5G RAN ......2 BUILDING O-RAN BASED HIGH PERFORMANCE 5G RAN SYSTEMS WITH NVIDIA GPUS AND MELLANOX NIC In this webinar, we will walk through NVIDIA