19
Join the Conversation #OpenPOWERSummit Introduction to the OpenCAPI Interface Brian Allison, STSM OpenCAPI Technology and Enablement

Introduction to the OpenCAPI Interface · Contains Phy configuration (PHYx), Transaction Layer (TLx), Data Link Layer (DLx) and Config Core (CFG) 25.78125 GT/s x8 link per PHYx/TLx/DLx/CFG

  • Upload
    others

  • View
    24

  • Download
    0

Embed Size (px)

Citation preview

Speaker name, TitleCompany/Organization Name

Join the Conversation #OpenPOWERSummit

Introduction to the OpenCAPI InterfaceBrian Allison, STSM

OpenCAPI Technology and Enablement

Introduction to the OpenCAPI Interface

Topics

OpenCAPI Protocol Stack

OpenCAPI Reference Design Overview OpenCAPI TLx Reference Code

OpenCAPI TLx-AFU Interface and Snippets

OpenCAPI Reference Design Cards

OpenCAPI Reference AFUs

OpenCAPI Performance

OpenCAPI Roadmap

Transaction Layer (TL) specifies the control and response packets between a host and an endpoint OpenCAPI device

TL On the host side converts:

Host specific protocol requests into transaction layer defined commands (downbound)

TLx commands into host specific protocol requests. (upbound)

Responses to Endpoint initiated commands

Data Link layer supports a 25.78125 Gbps serial data rate per lane connecting a processor to an accelerator dvice: DL and DLX

TLx: On the endpoint OpenCAPI device, the transaction layer converts:

AFU protocol requests into transaction layer commands

TL commands into AFU protocol requests.

Responses to host initiated commands

Host bus protocol layer

TLTL Frame/Parser

DL

PHY

PHYX

DLX

TLX Frame/ParserTLX

AFU protocol layer

AFU

Host p

roce

sso

rO

pe

nC

AP

I d

evic

e

OpenCAPI Protocol StackHost bus interface

OpenCAPI packets

DL packet (format)

DL packet

Serial link

DLX packet

DLX packet (format)

AFU packets

AFU protocol stack

interface

Host fabric bus

4

HostCPU

TL DL DLX TLXPHY PHYX

AFU

CFGconfig_writeconfig_read

All otherTL commands

OpenCAPI Protocol Stack

6

OpenCAPI FPGA Reference Design OverviewXilinx based Verilog design for Ultrascale+ FPGA’s

Contains Phy configuration (PHYx), Transaction Layer (TLx), Data Link Layer (DLx) and Config Core (CFG)

25.78125 GT/s x8 link per PHYx/TLx/DLx/CFG using Xilinx GTY PHY

Tightly integrated with the DLx logic

400MHz @ 64B dataflow (P9 Nest runs 1600MHz @ 16B)

Vivado toolchain flow with tcl script project creation

Currently using internal github for project repository

Made available to NDA customers

Will be available to OpenCAPI consortium members via consortium github

7

OpenCAPI TLx Reference Code

Config accesses hidden from AFU and sent directly to CFG Core by the TLx

Don’t want the AFU to have the complexity nor the ability to “brick” a card

TLx interfaces to AFU via low level Transaction Layer Protocol (think parallel interface(s))

Interface specification defined in the TLX 3.0 Reference Design Specification

TLx Parser receives OpenCAPI Host initiated TL Architected packets and decodes

Separate Command & Response Interface for the separate Virtual Channels

• Can send 1 command per cycle on each interface to the AFU

Separate Data interface – Commands not presented to AFU until data on the link is received for that command

TLx Framer receives AFU commands and responses and packetizes them into efficient OpenCAPI TLxArchitected packets to send to the host

Separate Command & Response Interface for separate Virtual Channels

Can receive 1 command per cycle on each interface

8

OpenCAPI TLx – AFU Interface

Individual TL packet contents are driven or received by the AFU

TLx only parses and packs the contents from/to the link packets into interface fields

No knowledge of location of fields within packets is necessary by the AFU

No knowledge of template usage is necessary by the AFU

TLx has no intelligent logic for architected sequences of flows –

• AFU must perform the proper sequences and follow the architecture

Credit based interface to the AFU

Host to Accelerator Command Snippet

9

Signal Name Bits Source Description

tlx_afu_cmd_valid 1 TLX Command Valid. The remaining signals in this table are valid coincident with the

assertion of tlx_afu_cmd_valid.

tlx_afu_cmd_opcode 8 TLX Command Opcode. Note: Please see OpenCAPI 3.0 TL Specification for valid opcodes

tlx_afu_cmd_capptag 16 TLXUnique handle specifying the host CAPP and command instance. Provided by the CAPP

requesting command services of the TL.

tlx_afu_cmd_pa 64 TLX Physical Address

tlx_afu_cmd_dl 2 TLX

Command Data Length

Encodings Size

2b’00 Reserved2b’01 64 Bytes2b’10 128 Bytes2b’11 256 Bytes

tlx_afu_cmd_pl 3 TLX

Partial Length

Encodings Size

3b’000 1 Byte3b’001 2 Bytes3b’010 4 Bytes3b’011 8 Bytes3b’100 16 Bytes3b’101 32 Bytes3b’110-111 Reserved

Host to AFU Data Snippet

10

Signal Name Bits Source Description

tlx_afu_cmd_data_valid 1 TLX Command Data Valid. Valid data is present.

tlx_afu_cmd_data_bus 512 TLX Command Data Bus.

tlx_afu_cmd_data_bdi 1 TLXBad Data Indicator. If asserted indicates the data received during the same cycle has

an error and cannot be trusted.

afu_tlx_cmd_rd_req 1 AFU AFU requests host command data known to be available

afu_tlx_cmd_rd_cnt 3 AFU

AFU specifies the number of data packets it will accept.

Encodings Size3b’000 512 Bytes3b’001 64 Bytes

3b’010 128 Bytes

3b’011 256 Bytes

3b’100 192 Bytes

3b’101 320 Bytes

3b’110 384 Bytes

3b’111 448 Bytes

Note: ‘001’, ‘010’, and ‘011’ were set to match the data length encoding.

AFU Initiated Command and Data Snippet

11

Signal Name Bits Source Description

afu_tlx_cmd_valid 1 AFUIndicates that a valid AP command has arrived from the AFU to the TLX. Any command field that pertains to the arriving opcode should contain valid information at this time. Other command fields are undefined and may contain garbage.

afu_tlx_cmd_opcode 8 AFU AP Command Opcode. (see TL Specification)

afu_tlx_cmd_actag 12 AFU Address Context tag (see TL Specification)

afu_tlx_cmd_ea_or_obj 68 AFU Effective Address/Object Handle. (see TL Specification)

afu_tlx_cmd_afutag 16 AFU AFU Tag. (see TL Specification)

afu_tlx_cmd_be 64 AFU Byte enable. (see TL Specification)

afu_tlx_cmd_bdf 16 AFU Bus Device Function (see TL Specification)

afu_tlx_cmd_pasid 20 AFU User process ID (see TL Specification)

afu_tlx_cdata_valid 1 AFUAP Command Data Valid. Indicates that a valid packet of command immediate data has

arrived from the TLX. The data bus and the bdi bit contain valid information.

afu_tlx_cdata_bus 512 AFU AP Command Data Bus.

afu_tlx_cdata_bdi 1 AFU Bad Data Indicator. Indicates that the AP command data packet is bad.

12

OpenCAPI Reference Design CardsInitial work done on Xilinx VU3P FPGA with Alpha Data 9V3 card

Currently using Vivado 2018.2, but floorplan snapshot below is from 2017.1

Images also created and tested on KU15P FPGA (Mellanox Innova-2)

Work is ongoing with Xilinx ZU19P FPGA

Next generation images to be created on

Nallatech 250SOC

Alpha Data 9H7 (VU37P) and 9H3 (VU33P)

VU3P Resources CLB FlipFlops LUT as Logic LUT Memory Block Ram Tile

DLx 9392/788160 (1.19%)

19026/394080(4.82%)

0/197280(0%)

7.5/720(1.0%)

TLx 13806/788160(1.75%)

8463/394080(2.14%)

2156/197280(1.09%)

0/720(0%)

13

OpenCAPI 3.0 Reference AFU’s MemCopy

The MemCopy example is a data mover from source address -> destination address using Virtual Addressing and includes these features

• Work queue for each context which can be configured to do copy commands, interrupts, translation touch, wake host thread (all command types for host validation)

• Configuration and MMIO Register Space

• acTag Table used for Bus/Device/Function and Process ID identification

• 512 processes/contexts and configurable up to 32 engines supporting up to 2K transfers using 64B, 128B, or 256B operations

Memory Home Agent (LPC)

The Memory Home Agent example implements memory off the endpoint OpenCAPI accelerator to act as a coherent extension to the host processor memory

The Memory Home Agent example includes these features

• Configuration and MMIO Register Space

• Individual and pipelined operation for memory loads and stores

• Interrupts, with error details reported to software through MMIO registers

• Sparse Address Mapping feature to extend 1 MB of real space to 4 TB of address

14

OpenCAPI 3.0 Reference AFU’sAFP

Main performance AFU

Single process programmed to do streaming reads, streaming writes or a mix

Data is not checked – purely for bandwidth and latency testing

Interrupt and Wake Host Thread latency counters

Ping-Pong latency test added (MMIO to AFP->DMA store to memory)

CAPI and OpenCAPI Performance

15

CAPI 1.0

PCIE Gen3 x8

Measured Bandwidth @8Gb/s

CAPI 2.0

PCIE Gen4 x8

Measured Bandwidth @16Gb/s

OpenCAPI 3.0

25 Gb/s x8

Measured Bandwidth @25Gb/s

128B DMA Read

3.81 GB/s 12.57 GB/s 22.1 GB/s

128B DMA Write

4.16 GB/s 11.85 GB/s 21.6 GB/s

256B DMA Read

N/A 13.94 GB/s 22.1 GB/s

256B DMA Write

N/A 14.04 GB/s 22.0 GB/s

Power 8/9 CPU

Xilinx

KU60/VU3P FPGA

First

Introduction

in 2013

2nd

Generation

Open Architecture with a

Clean Slate Focused on

Bandwidth and Latency

Latency Ping-Pong Test

Simple workload created to

simulate communication

between system and

attached FPGA

Bus traffic recorded with

protocol analyzer and

PowerBus traces

Response times and

statistics calculated

TL, DL, PHY

Host Code

1. Copy 512B from cache to FPGA

2. Poll on incoming 128B cache injection

3. Reset poll location

4. Repeat

TLx, DLx, PHYx

FPGA Code

1. Poll on 512B received from host

2. Reset poll location

3. DMA write 128B for cache injection

4. Repeat

OpenCAPI Link

PCIe Stack

Host Code

1. Copy 512B from cache to FPGA

2. Poll on incoming 128B cache injection

3. Reset poll location

4. Repeat

FPGA PCIe HIP*

FPGA Code

1. Poll on 512B received from host

2. Reset poll location

3. DMA write 128B for cache injection

4. Repeat

PCIe Link

* HIP refers to hardened IP

Latency Test Results

OpenCAPILink

P9 OpenCAPI3.9GHz Core, 2.4GHz Nest

Xilinx FPGA VU3P

298ns‡

2ns Jitter

TL, DL, PHY

TLx, DLx, PHYx (80ns‖)

378ns† Total Latency

PCIe G4Link

P9 PCIe Gen4

Xilinx FPGA VU3P

est. <337ns

PCIe Stack

Xilinx PCIe HIP (218ns¶)

est. <555ns§ Total Latency

PCIe G3Link

P9 PCIe Gen33.9GHz Core, 2.4GHz Nest

Altera FPGA Stratix V

337ns

7ns Jitter

PCIe Stack

Altera PCIe HIP (400ns¶)

737ns§ Total Latency

PCIe G3Link

Kaby Lake PCIe Gen3* 3.9GHz Core, 2.4GHz Nest

Altera FPGA Stratix V

376ns

31ns Jitter

PCIe Stack

Altera PCIe HIP (400ns¶)

776ns§ Total Latency

* Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz TurboBoost)

† Derived from round-trip time minus simulated FPGA app time‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time

§ Derived from measured CPU turnaround time plus vendor provided HIP latency‖ Derived from simulation¶ Vendor provided latency statistic

18

Roadmap - OpenCAPI 4.0 (P9’/Axone)Adds posted DMA Store operations with Address Translation Cache

New AFU validation/reference design needs

Address Translation Cache in MemCopy (in development)

Storage Class Memory Development

Technology previews being developed on OpenCAPI 3.0

Table of Enablement Deliveries

19

Item Delivery Name Where to Obtain Available When

OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications)

<snapshot>.tar.gz Enablement WG Today

Xilinx Vivado Project Build with Memcopy Exerciser

Vivado Project Flow Enablement WG Today

Device Discovery and Configuration Specification and RTL

OpenCAPI 3.0 Configuration Sub-System Reference Design Specification

Enablement WG Causeway Today

AFU Interface Specification TLX 3.0 Reference Design.pdf Enablement WG Causeway Today

25Gbps PHY Signal Specification OC PHY 25G Specification PHY Signalling WG Causeway Today

25Gbps PHY Mechanical Specification

25Gbps Interface Mechanical Spec PHY Mechanical WG Causeway Today

OpenCAPI Simulation Environment (OCSE)

ocse-<version>.tar.gzOpenCAPIDemokit.pdf

Enablement WG TodayToday

Memcopy and Memory Home Agent Exercisers

MCP3 and LPC<snapshot>.tar.gz

Enablement WG Today

Reference Driver Available LIBOCXL Ubuntu 18.04GitHub

Today