Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@

  • View

  • Download

Embed Size (px)

Text of Driver Scalability Davis Walker Principal Development Lead Windows Kernel dwalker@

1Driver ScalabilityDavis WalkerPrincipal Development LeadWindows Kerneldwalker@microsoft.com2AgendaLocality for Driver ScalabilityCase StudiesStorport NUMA I/OLoad Balancing and NDIS Receive-Side Scaling (RSS)Analysis ToolsComplementary Windows SupportNUMA and locality APIsProcessor Groups3OverviewSystems are getting larger and more parallelTo scale, devices must increase throughput with system sizeThis requires efficient use of processors, memory, and I/O linksMeasurable performance differences are available to drivers that focus on scalability techniquesThis talk is targeted to driver writers who need to push a lot of dataServer storageHigh-speed networkingAnd so forth4Hardware EvolutionIndustry is moving towards distributed processing complexesCPUs share internal caches between coresCPU and memory are a node, connected via links to other nodesI/O is sometimes associated with a node, sometimes notAccessing close hardware vs. far hardware is measurably fasterClose operations are vastly cheaperCache lookups vs. memory lookupsLocal node memory lookups vs. remote node memory lookupsLinks between units have fixed throughput and can become saturated5P1Cache1MemANode InterconnectMemBDiskAP3Cache3P4Cache4Cache(s)P2Cache2DiskBNUMANodes6Drivers Can Optimize for This EnvironmentOptimize for end-to-end processing on close hardwareIf data travels long distances, scalability is impactedDefine discrete blocks of work that can be processed in parallelUnderstand system topologyUnderstand software components involved in I/O processingProcess these blocks in distinct sections of hardwareMinimize interaction between processing blocksLocality is important for:DMA buffersBuffer copies for data transformationProcessors that execute various phases of driver processing7Case Study 1: Storport NUMA I/OImprovements to Windows Server 2008 to improve data locality of storage transactionsIncludes:Per-I/O interrupt redirection (with appropriate hardware support)NUMA-aware control structure allocationI/O concurrency improvements through lock breakupTarget ~10% reduction in CPU utilization on I/O intensive workloadsUsed as an example herePointer to implementation details on Resources slideDifferent I/O models have different data flowsthis case study is designed to demonstrate the analysis and thought process8Storage I/O Sequence of StepsApplication issues a file read or writeEnds up as a storage adapter DMA transactionPotentially in the application buffer, more likely to an intermediate bufferInterrupt is generated after the I/O completesInterrupt source is cleared in the deviceControl structures associated with the request are updatedDPC is queued to process the dataDPC performs the bulk of the processingFlushes DMA adapters (potentially copying data into the user buffer)Potentially performs data transformationEach component (adapter, file systems, etc.) processes control structuresApplication thread is notified of completion9NUMA I/O ConceptProcess requests vertically on the processor initiating the requestISR/DPC happen on the initiating processorOptimizes for:Buffers and control structures allocated on the local nodeControl structures accessed both at initiate and complete timeTries to avoid:Saturating links by pulling data across boundariesInterrupting processing that occurs in parallel to this request10P1Cache1MemANode InterconnectMemBDiskAP3Cache3P4Cache40.DiskB statically affinitized to P2 when initialized (random) P3 selects buffer: MemAP3 starts I/O: fill buffer from DiskBDiskB DMA triggers Invalidate(s)Buffer written to MemA (or Node Cache)HW Interrupt and ISR: DiskB P2

P2 executes DPC (by default) Completion processing accesses control state in Cache3Data may be pulled into Cache2Originating thread alerted (APC or synch I/O): P2 P3May require InterProc InterruptData must be in Cache3 to useWindows Server 2003 Disk ReadCache(s)(0)(3)(4)(6)(7)I/O InitiatorISR(1) I/O Buffer HomeDPC(2)(6)(5)P2Cache2DiskB(8)Locked out for I/O InitiationLocked out for I/O Initiation11P1Cache1MemANode InterconnectMemBDiskAP3Cache3P4Cache4Performance Optimizations: Concurrent I/O initiationInterrupt I/O-initiating processorExecute DPC on I/O-initiating processorP3 selects buffer: MemBP3 starts I/O: fill buffer from DiskBBuffer written to MemB (or Node Cache)HW Interrupt and ISR: DiskB P3 or P4 P3 executes DPCControl state is hot in Cache3Originating thread likely running locallyData exists in Cache3 for application use

Windows Server 2008 Disk Read (NUMA I/O HBA)Cache(s)(3)(3)I/O InitiatorISRDPC(2)P2Cache2DiskBISR(2)12How NUMA I/O Works: Interrupt ProcessingMSI-X hardware has one message pointed at every node (at minimum)Device supports sending interrupts to requesting processorBeneficial cache effectsXRB/SRB are accessed by miniport on initiation and in the ISRCache residency is generally very goodIf thread-initiated asynch I/O, chances are good that it is still runningLess chance of cache pollution of unrelated threadsMeans that DPC will also occur on initiating processor13How NUMA I/O Works: DPC ProcessingDPC typically contains the bulk of I/O completion workControl structure and queue processing by a stack of driversData transformation for compressed files, anti-virus, etcDouble-buffering if device cannot DMA to all of memoryControl processing Tends to be symmetric between initiation and completionCache considerationssame memory touched on initiation/completionNode considerationsdrivers may have NUMA-aware buffer allocationBalancing considerationscomponents may keep per-node listsRandom completions can cause sloshingData transformationNot typically part of high-end I/O flow, but something to keep in mindCopies across a link increase latency, impact perf of the whole system14Case Study 2: NDIS Receive-Side Scaling (RSS)Reacting to application locality doesnt work in all casesInbound network trafficA data queue might contain data for more than one applicationData transformations might occur before the application is knownIn this case scalability can be improved by load balancingSplit work into blocksProcess the blocks independentlyLocality still mattersEnsure that processing of one block does not interfere with anotherInterrupt, DPC, and worker processing can still happen vertically15How RSS Works: Multiple QueuesHardware divides received packets across multiple receive queuesEach receive queue collects packets for one or more TCP connectionsTarget queue selection ensures in-order processing of each connectionPacket processing occurs verticallyEach receive queue interrupts a different processor using MSI-XMost network processing done in a DPC on the interrupting processorRSS processor selection parametersSpeed of the NICThe processor topology of the systemThe distance of the NIC to NUMA nodes (if known)Administrative settings16How RSS Works: Scalability ImprovementsCache benefits result from accessing:The same data structures repeatedly on the same processorsThe packet header by multiple layers of the stack on the same processorPartitioning benefitsTCP connections structures are partitioned to match the queues Processor load is monitored at run time Adjustments are made to avoid over-utilization of a processorTCP connections are moved across queuesProcessors are changed by re-targeting MSI-X vectors

17Locality Counter-Example: Power ManagementWhen a driver enforces locality it can limit flexibilityE.g. the OSs ability to migrate work for system efficiencyConsider power managementWork consolidation might be best on a lightly loaded systemUse a small set of hardware, allow the rest to idle to sleepThis may provide the best throughput-per-watt and power scalabilityAggressive load balancing by the driver can limit the OSA balance between these considerations is required18The BalanceAligning chunks of work to NUMA nodes provides a good balanceProvides OS flexibility within a node, but keeps work partitionedA good, static defaultNo resource management facilities exist for migrating work based on system conditionsLoad monitoring systems like NDIS are very complicatedOr, react to the apps/drivers sending you requests, so that you move gracefully with system changes around you, as in NUMA I/O

19Windows Performance ToolkitPerformance AnalyzerBased on Event Tracing for Windows (ETW) instrumentationInstrumentation built into the retail Windows operating systemProvides coverage of kernel-level activityControlling, decoding, and displaying ETW events in kernel

process lifetimethread lifetimeimage lifetimesample profilecontext switchDPCISRdriver delay

disk I/Ofile I/Oregistryhardfaultpagefaultvirtual allocationheapTCP/UDP

20Scalability Analysis Using Performance AnalyzerCan do scalability analysis on a per-driver basisTotal DPC/ISR time per processorWait classification (new in Windows 7), showing:Which functions in the driver cause the calling threads to sleepWho wakes up sleeping threadsIRP tracingAnd much moreThe tool can be downloadedlocation on Resources slideMSDN has comprehensive documentation

21Performance Analyzer Screen ShotDPC/ISR Time

22Resource MonitorPer-Node CPU Activity

23Windows 7 Improvements: NUMA and TopologyUnderstanding topology is necessary for locality optimizationsUser-mode NUMA APIs have existed for some timeSymmetric APIs coming for kernel-modeKeQueryLogicalProcessorRelationshipCache, Package, Node relationshipsDistances between nodesCurrent node informationKeQueryNodeActiveAffinityKeGetCurrentNodeNumberDevice proximityIoGetDeviceNumaNode24Windows 7 Improvements: Processor GroupsKAFFINITY limits a system to 64 processorsKAFFINITY is a bitmapeasy and useful to manipulateBut increasingly a problematic limitationExtensibility through processor groups of up to 64 processorsGroups are staticdefined by keeping NUMA nodes togetherWithin a group, traditional processor management techniques applyAllows support for arbitrary numbers of processors. A give