25
On Demand Paging for User-level Networking Liran Liss Mellanox Technologies

Liran Liss Mellanox Technologies

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Liran Liss Mellanox Technologies

On Demand Paging for User-level Networking Liran Liss Mellanox Technologies

Page 2: Liran Liss Mellanox Technologies

Agenda

• Memory registration • RDMA programming challenges • On Demand Paging (ODP) • Page pre-fetching • Initial testing • Future work • Conclusions

© 2013 Mellanox Technologies 2

Page 3: Liran Liss Mellanox Technologies

Memory Registration

• Apps register Memory Regions (MRs) for IO – Referenced memory

must be part of process address space at registration time

– Memory key returned to identify the MR

• Registration operation – Pins down the MR – Hands off the virtual to

physical mapping to HW

© 2013 Mellanox Technologies 3

Application U

ser-space driver stack

HW

MR SQ

CQ

Key

RQ

Kernel driver stack

Page 4: Liran Liss Mellanox Technologies

Memory Registration – continued • Fast path

– Applications post IO operations directly to HCA

– HCA accesses memory using the translations referenced by the memory key

© 2013 Mellanox Technologies 4

Application U

ser-space driver stack

HW

MR SQ

CQ

RQ

Kernel driver stack

Key

Wow !!! But…

Page 5: Liran Liss Mellanox Technologies

Challenges

• Size of registered memory must fit physical memory • Applications must have memory locking privileges • Continuously synchronizing the translation tables

between the address space and the HCA is hard – Address space changes (malloc, mmap, stack) – NUMA migration – fork()

• Registration is a costly operation – No locality of reference

© 2013 Mellanox Technologies 5

Page 6: Liran Liss Mellanox Technologies

Achieving High Performance

• Requires careful design • Dynamic registration

– Naïve approach induces significant overheads – Pin-down cache logic is complex and not complete

• Pinned bounce buffers – Application level memory management – Copying adds overhead

© 2013 Mellanox Technologies 6

Page 7: Liran Liss Mellanox Technologies

On Demand Paging

• MR pages are never pinned by the OS – Paged in when HCA needs them – Paged out when reclaimed by the OS

• HCA translation tables may contain non-present pages – Initially, a new MR is created with non-present pages – Virtual memory mappings don’t necessarily exist

© 2013 Mellanox Technologies 7

Page 8: Liran Liss Mellanox Technologies

Semantics

• ODP memory registration – Specify IBV_ACCESS_ON_DEMAND access flag

• Work request processing – WQEs in HW ownership must reference mapped memory

• From Post_Send()/Recv() until PollCQ() – RDMA operations must target mapped memory – Access attempts to unmapped memory trigger an error

• Transport – RC semantics unchanged – UD responder drops packets while page faults are resolved

• Standard semantics cannot be achieved unless wire is back-pressured

© 2013 Mellanox Technologies 8

Page 9: Liran Liss Mellanox Technologies

Advantages

• Simplified programming – MPI rendezvous without

dynamic registrations – No dedicated buffer pools to

manage

• Practically unlimited memory registrations – No special privileges are

required

• Physical memory optimized to hold working set

© 2013 Mellanox Technologies 9

SendSomething() { char buf[SIZE]; WQE wqe; ... FillBuf(buf); wqe.sge[0].addr = buf; wqe.sge[0].length = SIZE; wqe.sge[0].lkey = STACK_KEY; ... Post_Send(wqe); while (!PollCQ()); }

Page 10: Liran Liss Mellanox Technologies

Design

• Kernel only – Transparent to applications

• Generic code (ib_core) tasks – Manage page invalidations

• Register for MMU notifier calls • Provide context for invalidations • Locate intersection between page invalidations and

MRs

– Support page faults • Synchronize between invalidations and page faults • Page-in user pages and map to dma

© 2013 Mellanox Technologies 10

Page 11: Liran Liss Mellanox Technologies

Design – continued

• Driver code (mlx4_core/ib) tasks – Process page faults

• Catch and classify HW page faults • Provide context for page faults

– Per-QP work_struct for requester/responder

– Handle HW page invalidations

© 2013 Mellanox Technologies 11

Page 12: Liran Liss Mellanox Technologies

Data Structures

© 2013 Mellanox Technologies 12

ib_core

mlx4_core / mlx4_ib

Key MR tree

Per user virt_addrumem

interval tree

umem Page list DMA list

mlx4_ib_mr Translation

table QP Requestor ctx Responder ctx

mmu_notifier handlers

MR

HCA

Page 13: Liran Liss Mellanox Technologies

Page-in Flow

© 2013 Mellanox Technologies 13

ib_core

mlx4_core / mlx4_ib

Key MR tree

Per user virt_addrumem

interval tree

umem Page list DMA list

mlx4_ib_mr Translation

table QP Requestor ctx Responder ctx

mmu_notifier handlers

MR

HCA 1. Page

fault event

2. Look up MR

3. Request pages

4. Get pages + map to

DMA

5. Update HW mappings 6. Resume

QP

Page 14: Liran Liss Mellanox Technologies

Invalidation Flow

© 2013 Mellanox Technologies 14

ib_core

mlx4_core / mlx4_ib

Key MR tree

Per user virt_addrumem

interval tree

umem Page list DMA list

mlx4_ib_mr Translation

table QP Requestor ctx Responder ctx

mmu_notifier handlers

MR

HCA

1. Page invalidation

2. Look up intersecting

MRs 5. Acknowledge

invalidation

3. Request invalidation

4. Flush HW caches

6. Unmap DMA and return

Page 15: Liran Liss Mellanox Technologies

Page Pre-fetching

• New Verb for pre-fetching pages • Uses

– Warming up new memory mappings – MPI rendezvous optimization – UD responder optimization

© 2013 Mellanox Technologies 15

Page 16: Liran Liss Mellanox Technologies

Initial Testing

• ODP support – Implemented all RC transport flows for IB and RoCE

• Excluding SRQ and memory windows – UD over IB and RoCE – Raw Ethernet QP

• Inter-operability – Latency of non-ODP applications running concurrently hardly

affected – Mixed requestors/responders also work well

• Native performance for memory-resident ODP pages • Page-in performance

– 4K page fault takes approximately 135us – 4M page fault takes approximately 1ms

© 2013 Mellanox Technologies 16

Page 17: Liran Liss Mellanox Technologies

Execution Time Breakdown (Send Requestor)

© 2013 Mellanox Technologies 17

2%

36%

4% 5% 16%

37%

0%

Schedule in

WQE read

Get User Pages

PTE update

TLB flush

QP resume

Misc

0%

7%

83%

1%

2%

7%

0%

Schedule in

WQE read

Get User Pages

PTE update

TBL flush

QP resume

Misc

4K Page fault (135us total time) 4M Page fault (1ms total time)

Page 18: Liran Liss Mellanox Technologies

Future work: Huge MR Support • Support MRs in the size of TBs • Implicit ODP

– Register complete application address space up-front – Effectively eliminate memory registration

• Meta-data size must be a function of currently mapped memory instead of MR size – Applies to all data structures (IB core, driver, and HW)

• Memory Windows (MWs) become the main vehicle for controlling access rights

© 2013 Mellanox Technologies 18

Page 19: Liran Liss Mellanox Technologies

Future Work: Improve OS integration • Update PTE accessed/dirty bits according to IO

accesses • Page invalidation batching

– Page eviction in the swapper – NUMA migration process

• Extend ODP to guest physicalmachine translations for virtualization

© 2013 Mellanox Technologies 19

Page 20: Liran Liss Mellanox Technologies

Conclusions

• RDMA performance is great – But requires careful design

• ODP simplifies RDMA programming and deployment – Moves memory management to the OS – Lifts memory-pinning limits

• ODP does not sacrifice performance or interoperability

• ODP eliminates memory registrations!!! – Coming up soon

© 2013 Mellanox Technologies 20

Page 21: Liran Liss Mellanox Technologies

Thank You

© 2013 Mellanox Technologies

Page 22: Liran Liss Mellanox Technologies

Backup

© 2013 Mellanox Technologies

Page 23: Liran Liss Mellanox Technologies

Concurrent Page Faults

• Each QP has at most 2 concurrent page faults – Requestor – Responder

• Faulting QP temporarily suspended until fault is resolved by SW – Even if another QP satisfies the fault in the meantime – Required for correct completion semantics

© 2013 Mellanox Technologies 23

Page 24: Liran Liss Mellanox Technologies

Page-in / Invalidation Races

• Invalidations may race with page faults – HW will complete all in-flight memory accesses to an

invalidated range before completing the invalidation – New accesses will trigger a page fault normally

• Page-in requests are not serviced while handling mmu_notifier invalidations – QP is resumed without updating the page tables – HW will retry access optionally triggering another

page fault – Simplifies the code considerably

© 2013 Mellanox Technologies 24

Page 25: Liran Liss Mellanox Technologies

Forward Progress

• Challenge – Single MTU-sized packet may refer to multiple S/G entries

in WQEs – Single RDMA-W transaction may span multiple pages

• Forward progress generally not guaranteed – Pages are not pinned inherent race with page

invalidation – Not any different than CPU accesses

• Alleviate by paging-in multiple pages at once – Read multiple SGEs in WQE page faults – Pre-fetch large consecutive ranges in RDMA faults

• Not an issue in practice

© 2013 Mellanox Technologies 25