Transcript
Page 1: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com
Page 2: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Driver Scalability

Davis WalkerPrincipal Development LeadWindows [email protected]

Page 3: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Agenda

• Locality for Driver Scalability• Case Studies

• Storport NUMA I/O• Load Balancing and NDIS Receive-Side Scaling (RSS)

• Analysis Tools• Complementary Windows Support

• NUMA and locality APIs• Processor Groups

Page 4: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Overview

• Systems are getting larger and more parallel• To scale, devices must increase throughput with system

size• This requires efficient use of processors, memory, and I/O links• Measurable performance differences are available to drivers

that focus on scalability techniques

• This talk is targeted to driver writers who need to push a lot of data• Server storage• High-speed networking• And so forth

Page 5: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Hardware Evolution

• Industry is moving towards distributed processing complexes• CPUs share internal caches between cores• CPU and memory are a “node”, connected via links to other

nodes• I/O is sometimes associated with a node, sometimes not

• Accessing “close” hardware vs. “far” hardware is measurably faster• “Close” operations are vastly cheaper

• Cache lookups vs. memory lookups• Local node memory lookups vs. remote node memory

lookups• Links between units have fixed throughput and can become

saturated

Page 6: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

P1

Cache1

MemA

Node Interconnect

MemBDiskA

P3

Cache3

P4

Cache4

Cache(s)

P2

Cache2

DiskB

NUMANodes

Page 7: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Drivers Can Optimize for This Environment

• Optimize for end-to-end processing on “close” hardware• If data travels long distances, scalability is impacted

• Define discrete blocks of work that can be processed in parallel• Understand system topology• Understand software components involved in I/O processing• Process these blocks in distinct sections of hardware• Minimize interaction between processing blocks

• Locality is important for:• DMA buffers• Buffer copies for data transformation• Processors that execute various phases of driver processing

Page 8: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Case Study 1: Storport NUMA I/O

• Improvements to Windows Server 2008 to improve data locality of storage transactions

• Includes:• Per-I/O interrupt redirection (with appropriate hardware

support)• NUMA-aware control structure allocation• I/O concurrency improvements through lock breakup

• Target ~10% reduction in CPU utilization on I/O intensive workloads

• Used as an example here• Pointer to implementation details on Resources slide• Different I/O models have different data flows—this case study

is designed to demonstrate the analysis and thought process

Page 9: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Storage I/O Sequence of Steps

• Application issues a file read or write• Ends up as a storage adapter DMA transaction

• Potentially in the application buffer, more likely to an intermediate buffer

• Interrupt is generated after the I/O completes• Interrupt source is cleared in the device• Control structures associated with the request are updated• DPC is queued to process the data

• DPC performs the bulk of the processing• Flushes DMA adapters (potentially copying data into the user

buffer)• Potentially performs data transformation• Each component (adapter, file systems, etc.) processes control

structures

• Application thread is notified of completion

Page 10: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

NUMA I/O Concept

• Process requests vertically on the processor initiating the request

• ISR/DPC happen on the initiating processor• Optimizes for:

• Buffers and control structures allocated on the local node• Control structures accessed both at initiate and complete time

• Tries to avoid:• Saturating links by pulling data across boundaries• Interrupting processing that occurs in parallel to this request

Page 11: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

P1

Cache1

MemA

Node Interconnect

MemBDiskA

P3

Cache3

P4

Cache4

0. DiskB statically affinitized to P2 when initialized (random)

1. P3 selects buffer: MemA

2. P3 starts I/O: fill buffer from DiskB

3. DiskB DMA triggers Invalidate(s)

4. Buffer written to MemA (or Node Cache)

5. HW Interrupt and ISR: DiskB P2

6. P2 executes DPC (by default)

• Completion processing accesses control state in Cache3

• Data may be pulled into Cache2

7. Originating thread alerted (APC or synch I/O): P2 P3

• May require InterProc Interrupt

8. Data must be in Cache3 to use

Windows Server 2003 Disk Read

Cache(s)

(0)

(3)(4)

(6)

(7) I/O Initiator

ISR

(1) I/O Buffer Home

DPC(2)(6)

(5)

P2

Cache2

DiskB

(8)

Locked out for I/O Initiation

Locked out for I/O

Initiation

Page 12: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

P1

Cache1

MemA

Node Interconnect

MemBDiskA

P3

Cache3

P4

Cache4

Performance Optimizations:

1. Concurrent I/O initiation

2. Interrupt I/O-initiating processor

3. Execute DPC on I/O-initiating processor

1. P3 selects buffer: MemB

2. P3 starts I/O: fill buffer from DiskB

3. Buffer written to MemB (or Node Cache)

4. HW Interrupt and ISR: DiskB P3 or P4

5. P3 executes DPC

• Control state is hot in Cache3

6. Originating thread likely running locally

7. Data exists in Cache3 for application use

Windows Server 2008 Disk Read (NUMA I/O HBA)

Cache(s)

(3)

(3)

I/O Initiator

ISR DPC

(2)P2

Cache2

DiskB

ISR

(2)

Page 13: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

How NUMA I/O Works: Interrupt Processing

• MSI-X hardware has one message pointed at every node (at minimum)

• Device supports sending interrupts to requesting processor• Beneficial cache effects

• XRB/SRB are accessed by miniport on initiation and in the ISR• Cache residency is generally very good

• If thread-initiated asynch I/O, chances are good that it is still running• Less chance of cache pollution of unrelated threads

• Means that DPC will also occur on initiating processor

Page 14: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

How NUMA I/O Works: DPC Processing

• DPC typically contains the bulk of I/O completion work• Control structure and queue processing by a stack of drivers• Data transformation for compressed files, anti-virus, etc• Double-buffering if device cannot DMA to all of memory

• Control processing • Tends to be symmetric between initiation and completion• Cache considerations—same memory touched on initiation/completion• Node considerations—drivers may have NUMA-aware buffer allocation• Balancing considerations—components may keep per-node lists

• Random completions can cause “sloshing”

• Data transformation• Not typically part of high-end I/O flow, but something to keep in mind• Copies across a link increase latency, impact perf of the whole system

Page 15: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Case Study 2: NDIS Receive-Side Scaling (RSS)

• Reacting to application locality doesn’t work in all cases• Inbound network traffic

• A data queue might contain data for more than one application• Data transformations might occur before the application is

known

• In this case scalability can be improved by load balancing• Split work into blocks• Process the blocks independently

• Locality still matters• Ensure that processing of one block does not interfere with

another• Interrupt, DPC, and worker processing can still happen

vertically

Page 16: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

How RSS Works: Multiple Queues

• Hardware divides received packets across multiple receive queues• Each receive queue collects packets for one or more TCP

connections• Target queue selection ensures in-order processing of each

connection

• Packet processing occurs vertically• Each receive queue interrupts a different processor using MSI-X• Most network processing done in a DPC on the interrupting

processor

• RSS processor selection parameters• Speed of the NIC• The processor topology of the system• The “distance” of the NIC to NUMA nodes (if known)• Administrative settings

Page 17: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

How RSS Works: Scalability Improvements

• Cache benefits result from accessing:• The same data structures repeatedly on the same processors• The packet header by multiple layers of the stack on the same

processor

• Partitioning benefits• TCP connections structures are partitioned to match the

queues • Processor load is monitored at run time • Adjustments are made to avoid over-utilization of a processor

• TCP connections are moved across queues• Processors are changed by re-targeting MSI-X vectors

Page 18: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Locality Counter-Example: Power Management

• When a driver enforces locality it can limit flexibility• E.g. the OS’s ability to migrate work for system efficiency

• Consider power management• Work consolidation might be best on a lightly loaded system

• Use a small set of hardware, allow the rest to idle to sleep• This may provide the best throughput-per-watt and power

scalability• Aggressive load balancing by the driver can limit the OS

• A balance between these considerations is required

Page 19: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

The Balance

• Aligning chunks of work to NUMA nodes provides a good balance

• Provides OS flexibility within a node, but keeps work partitioned

• A good, static default• No resource management facilities exist for migrating work

based on system conditions• Load monitoring systems like NDIS are very complicated

• Or, react to the apps/drivers sending you requests, so that you move gracefully with system changes around you, as in NUMA I/O

Page 20: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Windows Performance Toolkit—Performance Analyzer

• Based on Event Tracing for Windows (ETW) instrumentation• Instrumentation built into the retail Windows operating system• Provides coverage of kernel-level activity

• Controlling, decoding, and displaying ETW events in kernel

– process lifetime– thread lifetime– image lifetime– sample profile– context switch– DPC– ISR– driver delay

– disk I/O– file I/O– registry– hardfault– pagefault– virtual allocation– heap– TCP/UDP

Page 21: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Scalability Analysis Using Performance Analyzer

• Can do scalability analysis on a per-driver basis• Total DPC/ISR time per processor• Wait classification (new in Windows 7), showing:

• Which functions in the driver cause the calling threads to sleep

• Who wakes up sleeping threads• IRP tracing• And much more

• The tool can be downloaded—location on Resources slide• MSDN has comprehensive documentation

Page 22: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Performance Analyzer Screen Shot—DPC/ISR Time

Page 23: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Resource Monitor—Per-Node CPU Activity

Page 24: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Windows 7 Improvements: NUMA and Topology

• Understanding topology is necessary for locality optimizations

• User-mode NUMA APIs have existed for some time• Symmetric APIs coming for kernel-mode• KeQueryLogicalProcessorRelationship

• Cache, Package, Node relationships• “Distances” between nodes

• Current node information• KeQueryNodeActiveAffinity• KeGetCurrentNodeNumber

• Device proximity• IoGetDeviceNumaNode

Page 25: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Windows 7 Improvements: Processor Groups

• KAFFINITY limits a system to 64 processors• KAFFINITY is a bitmap—easy and useful to manipulate• But increasingly a problematic limitation

• Extensibility through processor “groups” of up to 64 processors• Groups are static—defined by keeping NUMA nodes together• Within a group, traditional processor management techniques

apply

• Allows support for arbitrary numbers of processors. A given Windows release will support what can be validated.

Page 26: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Processor Groups: Locality Leads to Scalability

• The philosophy: in order to scale to hundreds of processors, work must be partitioned with topology in mind

• Most work happens within a single group• Threads are scheduled within a group unless migrated

explicitly• Processes run in a single group by default• A single interrupt can only be attached to processors within a

group

• Other groups can be used intentionally• A process can migrate threads between groups explicitly• MSI-X messages can each be attached to an independent

group

Page 27: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Processor Group APIs

• Any kernel routine using processor numbers or affinity is changing

• New representations:• GROUP_AFFINITY: group number + KAFFINITY• PROCESSOR_NUMBER: group number + group-relative number

• Example APIs that are changing• KeGetCurrentProcessorNumber• KeQueryActiveProcessorCount• KeSetSystemAffinityThreadEx• KeSetTargetProcessorDpc

• New mechanisms to extend drivers past the first group• Registry key to indicate support for interrupts past first group

• Complete details available in white paper

Page 28: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Processor Groups: Do You Need to Care?

• Use of per-processor data structures?• Possible functional issues—discussed on the next slide

• High performance device• Multiple MSI-X messages to be spread across the system• System-wide load balancing requirements• Likely to be limited to the first group without changes

• System management utility• Likely not to comprehend all processors in the system without

changes

Page 29: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Processor Groups: Compatibility Risk

• Drivers run in the context of many apps• Must accept requests across all processors• Unmodified APIs return per-group information• Example issue:

• Per-processor data structure accessed in a DPC• Because it is per-processor, at DISPATCH_LEVEL, it is self-

synchronized• Two processors in separate groups now map to the same data

structure• Possible list corruption, etc.

• Drivers expecting to run on very large systems will need to be updated, and will need to test in this environment

• Multi-group testing available on any MP machine

Page 30: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Related SessionsSession Day / Time

Implementing Efficient RSS Capable Hardware and Drivers for Windows 7 Tues. 1:30-2:30

Network Power Management in Windows 7 Tues. 5:15-6:15

Storport Drivers from the Ground Up Tues. 8:30-9:30 andWed. 9:45-10:45

Driver Scalability Tues. 11-12

Page 31: Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@microsoft.com

Resources

• Windows Performance Tools Kithttp://www.microsoft.com/whdc/system/sysperf/perftools.mspx

• NUMA I/O presentation (WinHEC 2007)http://download.microsoft.com/download/a/f/d/afdfd50d-6eb9-425e-84e1-b4085a80e34e/SVR-T332_WH07.pptx

• “Scalable Networking with RSS” on the WHDC Web sitehttp://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx


Recommended