Driver Scalability
Davis WalkerPrincipal Development LeadWindows [email protected]
Agenda
• Locality for Driver Scalability• Case Studies
• Storport NUMA I/O• Load Balancing and NDIS Receive-Side Scaling (RSS)
• Analysis Tools• Complementary Windows Support
• NUMA and locality APIs• Processor Groups
Overview
• Systems are getting larger and more parallel• To scale, devices must increase throughput with system
size• This requires efficient use of processors, memory, and I/O links• Measurable performance differences are available to drivers
that focus on scalability techniques
• This talk is targeted to driver writers who need to push a lot of data• Server storage• High-speed networking• And so forth
Hardware Evolution
• Industry is moving towards distributed processing complexes• CPUs share internal caches between cores• CPU and memory are a “node”, connected via links to other
nodes• I/O is sometimes associated with a node, sometimes not
• Accessing “close” hardware vs. “far” hardware is measurably faster• “Close” operations are vastly cheaper
• Cache lookups vs. memory lookups• Local node memory lookups vs. remote node memory
lookups• Links between units have fixed throughput and can become
saturated
P1
Cache1
MemA
Node Interconnect
MemBDiskA
P3
Cache3
P4
Cache4
Cache(s)
P2
Cache2
DiskB
NUMANodes
Drivers Can Optimize for This Environment
• Optimize for end-to-end processing on “close” hardware• If data travels long distances, scalability is impacted
• Define discrete blocks of work that can be processed in parallel• Understand system topology• Understand software components involved in I/O processing• Process these blocks in distinct sections of hardware• Minimize interaction between processing blocks
• Locality is important for:• DMA buffers• Buffer copies for data transformation• Processors that execute various phases of driver processing
Case Study 1: Storport NUMA I/O
• Improvements to Windows Server 2008 to improve data locality of storage transactions
• Includes:• Per-I/O interrupt redirection (with appropriate hardware
support)• NUMA-aware control structure allocation• I/O concurrency improvements through lock breakup
• Target ~10% reduction in CPU utilization on I/O intensive workloads
• Used as an example here• Pointer to implementation details on Resources slide• Different I/O models have different data flows—this case study
is designed to demonstrate the analysis and thought process
Storage I/O Sequence of Steps
• Application issues a file read or write• Ends up as a storage adapter DMA transaction
• Potentially in the application buffer, more likely to an intermediate buffer
• Interrupt is generated after the I/O completes• Interrupt source is cleared in the device• Control structures associated with the request are updated• DPC is queued to process the data
• DPC performs the bulk of the processing• Flushes DMA adapters (potentially copying data into the user
buffer)• Potentially performs data transformation• Each component (adapter, file systems, etc.) processes control
structures
• Application thread is notified of completion
NUMA I/O Concept
• Process requests vertically on the processor initiating the request
• ISR/DPC happen on the initiating processor• Optimizes for:
• Buffers and control structures allocated on the local node• Control structures accessed both at initiate and complete time
• Tries to avoid:• Saturating links by pulling data across boundaries• Interrupting processing that occurs in parallel to this request
P1
Cache1
MemA
Node Interconnect
MemBDiskA
P3
Cache3
P4
Cache4
0. DiskB statically affinitized to P2 when initialized (random)
1. P3 selects buffer: MemA
2. P3 starts I/O: fill buffer from DiskB
3. DiskB DMA triggers Invalidate(s)
4. Buffer written to MemA (or Node Cache)
5. HW Interrupt and ISR: DiskB P2
6. P2 executes DPC (by default)
• Completion processing accesses control state in Cache3
• Data may be pulled into Cache2
7. Originating thread alerted (APC or synch I/O): P2 P3
• May require InterProc Interrupt
8. Data must be in Cache3 to use
Windows Server 2003 Disk Read
Cache(s)
(0)
(3)(4)
(6)
(7) I/O Initiator
ISR
(1) I/O Buffer Home
DPC(2)(6)
(5)
P2
Cache2
DiskB
(8)
Locked out for I/O Initiation
Locked out for I/O
Initiation
P1
Cache1
MemA
Node Interconnect
MemBDiskA
P3
Cache3
P4
Cache4
Performance Optimizations:
1. Concurrent I/O initiation
2. Interrupt I/O-initiating processor
3. Execute DPC on I/O-initiating processor
1. P3 selects buffer: MemB
2. P3 starts I/O: fill buffer from DiskB
3. Buffer written to MemB (or Node Cache)
4. HW Interrupt and ISR: DiskB P3 or P4
5. P3 executes DPC
• Control state is hot in Cache3
6. Originating thread likely running locally
7. Data exists in Cache3 for application use
Windows Server 2008 Disk Read (NUMA I/O HBA)
Cache(s)
(3)
(3)
I/O Initiator
ISR DPC
(2)P2
Cache2
DiskB
ISR
(2)
How NUMA I/O Works: Interrupt Processing
• MSI-X hardware has one message pointed at every node (at minimum)
• Device supports sending interrupts to requesting processor• Beneficial cache effects
• XRB/SRB are accessed by miniport on initiation and in the ISR• Cache residency is generally very good
• If thread-initiated asynch I/O, chances are good that it is still running• Less chance of cache pollution of unrelated threads
• Means that DPC will also occur on initiating processor
How NUMA I/O Works: DPC Processing
• DPC typically contains the bulk of I/O completion work• Control structure and queue processing by a stack of drivers• Data transformation for compressed files, anti-virus, etc• Double-buffering if device cannot DMA to all of memory
• Control processing • Tends to be symmetric between initiation and completion• Cache considerations—same memory touched on initiation/completion• Node considerations—drivers may have NUMA-aware buffer allocation• Balancing considerations—components may keep per-node lists
• Random completions can cause “sloshing”
• Data transformation• Not typically part of high-end I/O flow, but something to keep in mind• Copies across a link increase latency, impact perf of the whole system
Case Study 2: NDIS Receive-Side Scaling (RSS)
• Reacting to application locality doesn’t work in all cases• Inbound network traffic
• A data queue might contain data for more than one application• Data transformations might occur before the application is
known
• In this case scalability can be improved by load balancing• Split work into blocks• Process the blocks independently
• Locality still matters• Ensure that processing of one block does not interfere with
another• Interrupt, DPC, and worker processing can still happen
vertically
How RSS Works: Multiple Queues
• Hardware divides received packets across multiple receive queues• Each receive queue collects packets for one or more TCP
connections• Target queue selection ensures in-order processing of each
connection
• Packet processing occurs vertically• Each receive queue interrupts a different processor using MSI-X• Most network processing done in a DPC on the interrupting
processor
• RSS processor selection parameters• Speed of the NIC• The processor topology of the system• The “distance” of the NIC to NUMA nodes (if known)• Administrative settings
How RSS Works: Scalability Improvements
• Cache benefits result from accessing:• The same data structures repeatedly on the same processors• The packet header by multiple layers of the stack on the same
processor
• Partitioning benefits• TCP connections structures are partitioned to match the
queues • Processor load is monitored at run time • Adjustments are made to avoid over-utilization of a processor
• TCP connections are moved across queues• Processors are changed by re-targeting MSI-X vectors
Locality Counter-Example: Power Management
• When a driver enforces locality it can limit flexibility• E.g. the OS’s ability to migrate work for system efficiency
• Consider power management• Work consolidation might be best on a lightly loaded system
• Use a small set of hardware, allow the rest to idle to sleep• This may provide the best throughput-per-watt and power
scalability• Aggressive load balancing by the driver can limit the OS
• A balance between these considerations is required
The Balance
• Aligning chunks of work to NUMA nodes provides a good balance
• Provides OS flexibility within a node, but keeps work partitioned
• A good, static default• No resource management facilities exist for migrating work
based on system conditions• Load monitoring systems like NDIS are very complicated
• Or, react to the apps/drivers sending you requests, so that you move gracefully with system changes around you, as in NUMA I/O
Windows Performance Toolkit—Performance Analyzer
• Based on Event Tracing for Windows (ETW) instrumentation• Instrumentation built into the retail Windows operating system• Provides coverage of kernel-level activity
• Controlling, decoding, and displaying ETW events in kernel
– process lifetime– thread lifetime– image lifetime– sample profile– context switch– DPC– ISR– driver delay
– disk I/O– file I/O– registry– hardfault– pagefault– virtual allocation– heap– TCP/UDP
Scalability Analysis Using Performance Analyzer
• Can do scalability analysis on a per-driver basis• Total DPC/ISR time per processor• Wait classification (new in Windows 7), showing:
• Which functions in the driver cause the calling threads to sleep
• Who wakes up sleeping threads• IRP tracing• And much more
• The tool can be downloaded—location on Resources slide• MSDN has comprehensive documentation
Performance Analyzer Screen Shot—DPC/ISR Time
Resource Monitor—Per-Node CPU Activity
Windows 7 Improvements: NUMA and Topology
• Understanding topology is necessary for locality optimizations
• User-mode NUMA APIs have existed for some time• Symmetric APIs coming for kernel-mode• KeQueryLogicalProcessorRelationship
• Cache, Package, Node relationships• “Distances” between nodes
• Current node information• KeQueryNodeActiveAffinity• KeGetCurrentNodeNumber
• Device proximity• IoGetDeviceNumaNode
Windows 7 Improvements: Processor Groups
• KAFFINITY limits a system to 64 processors• KAFFINITY is a bitmap—easy and useful to manipulate• But increasingly a problematic limitation
• Extensibility through processor “groups” of up to 64 processors• Groups are static—defined by keeping NUMA nodes together• Within a group, traditional processor management techniques
apply
• Allows support for arbitrary numbers of processors. A given Windows release will support what can be validated.
Processor Groups: Locality Leads to Scalability
• The philosophy: in order to scale to hundreds of processors, work must be partitioned with topology in mind
• Most work happens within a single group• Threads are scheduled within a group unless migrated
explicitly• Processes run in a single group by default• A single interrupt can only be attached to processors within a
group
• Other groups can be used intentionally• A process can migrate threads between groups explicitly• MSI-X messages can each be attached to an independent
group
Processor Group APIs
• Any kernel routine using processor numbers or affinity is changing
• New representations:• GROUP_AFFINITY: group number + KAFFINITY• PROCESSOR_NUMBER: group number + group-relative number
• Example APIs that are changing• KeGetCurrentProcessorNumber• KeQueryActiveProcessorCount• KeSetSystemAffinityThreadEx• KeSetTargetProcessorDpc
• New mechanisms to extend drivers past the first group• Registry key to indicate support for interrupts past first group
• Complete details available in white paper
Processor Groups: Do You Need to Care?
• Use of per-processor data structures?• Possible functional issues—discussed on the next slide
• High performance device• Multiple MSI-X messages to be spread across the system• System-wide load balancing requirements• Likely to be limited to the first group without changes
• System management utility• Likely not to comprehend all processors in the system without
changes
Processor Groups: Compatibility Risk
• Drivers run in the context of many apps• Must accept requests across all processors• Unmodified APIs return per-group information• Example issue:
• Per-processor data structure accessed in a DPC• Because it is per-processor, at DISPATCH_LEVEL, it is self-
synchronized• Two processors in separate groups now map to the same data
structure• Possible list corruption, etc.
• Drivers expecting to run on very large systems will need to be updated, and will need to test in this environment
• Multi-group testing available on any MP machine
Related SessionsSession Day / Time
Implementing Efficient RSS Capable Hardware and Drivers for Windows 7 Tues. 1:30-2:30
Network Power Management in Windows 7 Tues. 5:15-6:15
Storport Drivers from the Ground Up Tues. 8:30-9:30 andWed. 9:45-10:45
Driver Scalability Tues. 11-12
Resources
• Windows Performance Tools Kithttp://www.microsoft.com/whdc/system/sysperf/perftools.mspx
• NUMA I/O presentation (WinHEC 2007)http://download.microsoft.com/download/a/f/d/afdfd50d-6eb9-425e-84e1-b4085a80e34e/SVR-T332_WH07.pptx
• “Scalable Networking with RSS” on the WHDC Web sitehttp://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx