19
HPE Integrity Superdome X system architecture and RAS A new level of x86 compute for mission-critical environments Technical white paper

HPE Integrity Superdome X system architecture and RAS technical

Embed Size (px)

Citation preview

Page 1: HPE Integrity Superdome X system architecture and RAS technical

HPE Integrity Superdome X system architecture and RAS A new level of x86 compute for mission-critical environments

Technical white paper

Page 2: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper

Contents Introduction ....................................................................................................................................................................................................................................................................................................................................................3

Performance and capacities..................................................................................................................................................................................................................................................................................................... 4

Reliability, availability, serviceability ................................................................................................................................................................................................................................................................................. 4

Manageability ........................................................................................................................................................................................................................................................................................................................................ 4

System architecture ............................................................................................................................................................................................................................................................................................................................... 5

BL920s Gen9 Server Blade ....................................................................................................................................................................................................................................................................................................... 6

BL920s Gen9 Server Blade Management ................................................................................................................................................................................................................................................................... 8 New sx3000 Crossbar Fabric Module ............................................................................................................................................................................................................................................................................. 9

Superdome Onboard Administrator ................................................................................................................................................................................................................................................................................. 9

RAS ....................................................................................................................................................................................................................................................................................................................................................................... 9

Fault management strategy .................................................................................................................................................................................................................................................................................................... 9

Firmware first.......................................................................................................................................................................................................................................................................................................................................10

RAS differentiators .........................................................................................................................................................................................................................................................................................................................10 Management .............................................................................................................................................................................................................................................................................................................................................. 15

Built-in management components .................................................................................................................................................................................................................................................................................. 15

Additional management resources ................................................................................................................................................................................................................................................................................. 18

Conclusion .................................................................................................................................................................................................................................................................................................................................................... 19

Resources, contacts, and additional links .................................................................................................................................................................................................................................................................. 19

Page 3: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 3

Introduction Businesses are looking to enterprise applications, databases, the web, and social media as sources of competitive advantage. As customers’ expectations rise, the infrastructure that powers these applications becomes increasingly business-critical. As the workload for these applications grows, IT organizations face an important decision; whether to scale-out by adding new servers to support the workloads or to scale-up by migrating the workloads to more powerful servers.

Scale-up represents an attractive alternative for a variety of reasons:

• Operational cost reductions because of reduced server management staffing requirements

• Reduced power and cooling costs

• Reduced software licensing costs

• Reduced IT infrastructure costs

Scale-up can also yield improved performance, including increased resource utilization, improved application performance, less unplanned downtime, and extended datacenter life. International Data Corporation’s (IDC's) Business Value research with scores of IT organizations found that this type of scale-up consolidation, combined with the powerful effects of virtualization, resulted in savings of over 35 percent. IDC has identified SAP®, Oracle, Microsoft SQL Server and custom-developed application environments as prime candidates for scale-up.

HPE Integrity Superdome X offers a scale-up solution for data centers with growing mission-critical workloads. Integrity Superdome X blends trusted Integrity Superdome reliability with a standard x86-based design. This document describes the Integrity Superdome X architecture and explains its significant benefits in performance, manageability, and reliability for your mission-critical environment.

Figure 1. HPE Integrity Superdome X system

Integrity Superdome X leverages a number of components from the HPE Integrity Superdome 2 Enclosure, as well as, components from the HPE ProLiant BladeSystem c7000 enclosure. The Onboard Administrators, Global Partition Service Modules, the enclosure DVD module and the mechanical enclosure itself are all re-used from HPE Integrity Superdome 2. Other components, like hot-swappable bulk power supplies, enclosure fans, and even I/O interconnect modules are interchangeable with c7000 systems, allowing a reduction in on-site spare component inventory within environments that have a mixture of system types.

c-Class 2450W power supplies

(12)

DVD module

Air intake plenum

Insight display

BL920s Gen8 Server Blades

(8)

Air exhaust plenum

4X Xbar fabric modules

Global partition service modules

Interconnect modules

Superdome Onboard

Administrator

AC input module (2)

Air exhaust plenum

Active cool fans (15)

Front view Rear view

Page 4: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 4

Performance and capacities The Integrity Superdome X Server is designed to achieve a new level of x86 computational capability and availability, allowing Intel® Xeon® processors to be utilized for true mission-critical environment workloads without concerns about system downtime. With 16 sockets for Intel Xeon-E7 v4 processors, the Integrity Superdome X allows scaling up to 288 processor cores and supports up to 384 DIMMs for a maximum memory footprint of 24 TB using 64GB DDR4 LR-DIMMs. On-blade I/O provides a system total of 16 FlexLOMs and 24 mezzanine cards in a fully provisioned 8-blade system.

The compute power of the Superdome X platform yields ground-breaking performance. Comparison testing revealed Superdome X as:

• #1 overall performance @ 10000 GB scale factor on non-clustered TPC-H benchmark1

• #1 overall 8-socket price/performance @ 10000 GB scale factor on non-clustered TPC-H benchmark2

• #1 x86 16P SPECjbb2015-MultiJVM max-JOPS and critical-JOPS benchmark results2

• x86 16P world record on the SPECjbb2015 Multi-JVM benchmark3

• Highest performing 16-processor result on the two-tier SAP® Sales and Distribution (SD) standard application benchmark of 100,000 SAPbenchmark users and 545,780 SAPS3

Reliability, availability, serviceability Reliability, availability, and serviceability are collectively known as “RAS,” and one of the top reasons customers deploy mission-critical workloads on HPE Superdome servers. Robust RAS has always been a designed-in philosophy for Superdome servers, and the Superdome X builds on that strategy with such RAS strengths as:

• Full implementation of Intel Xeon E7 v4 RAS capabilities

• Fault-tolerant crossbar (Xbar) design offering:

– Hard partitioning for a standard x86 platform for reliability

– Fully fault-resilient fabric (traffic is automatically re-routed around any failed links)

– Passive midplanes with end-to-retry and link failover

• Onboard Administrator with built-in Analysis Engine providing error-correcting, self-healing, and advanced diagnostics

• Hot-swappable power and cooling components

A detailed discussion of RAS is provided in the RAS section of this document.

Manageability As part of the HPE Mission-Critical Converged Infrastructure, Integrity Superdome X includes built-in management components that work with a variety of resources allowing a comprehensive management strategy.

Management components and resources include:

• Superdome Onboard Administrator (SD OA)

• Insight Display

1 TPC-H is a trademark of the Transaction Processing Performance Council. TPC-H results show the HPE Integrity Superdome X with a result of 780,346.9 QphH @ 10000GB and $2.27QphH @ 10000GB with system availability as of February 3, 2016; See tpc.org/3317. The TPC believes that comparisons of TPC-H results published with different scale factors are misleading and discourages such comparisons.

2 SPEC, the SPEC logo, and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). All rights reserved; see spec.org. Results as of February 22, 2016.

3 SAP HANA, SAP NetWeaver, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. Benchmark configurations. HPE received certification on February 15, 2016, from SAP SE of the results of Superdome X on the two-tier SAP® SD standard application benchmark performed in Houston on February 9, 2016. The Integrity Superdome X has achieved the leadership 16-processor result on the two-tier SAP Sales and Distribution (SD) standard application benchmark of 100,000 SAP benchmark users and 545,780 SAPS. To achieve this result, the Integrity Superdome X used 16 Intel® Xeon® Processors E7-8890 v3 at 2.5 GHz and 4 TB of memory running Microsoft Windows Server 2012 R2 Datacenter Edition, Microsoft SQL Server 2014, and SAP enhancement package 5 for the SAP ERP application 6.0. See sap.com/benchmark for up-to-date information.

Page 5: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 5

• HPE Insight Remote Support

• HPE Smart Update Manager

The SD OA and Insight Display are both similar to those used in the c-Class enclosure, offering a common management “look and feel” for health alerts and monitoring, automated power and cooling management, role-based security, and other features. The LCD display on the outside rack door allows IT personnel to view the complex name, overall complex health, power, and ambient air temperature in a single glance without opening the rack door. HPE resources such as HPE Insight Remote Support, and HPE System Update Manager offer remote, single-point management of Superdome X servers.

A detailed discussion of Superdome X management capabilities is provided in the Management section of this document.

System architecture HPE Integrity Superdome X combines the best of x86 architectures and the sx3000 chipset to provide a performance-scalable server with the necessary RAS features to operate in mission-critical environments. Figure 2 shows the system architecture of the HPE Integrity Superdome X enclosure and the BL920s Gen8 Server Blade. Each enclosure consists of up to eight BL920s Gen8 Server Blades, one upper midplane that provides support for the four sx3000 Xbar Fabric modules, and one lower midplane that interfaces to I/O interconnect modules that plug into bays in the rear of the enclosure. Each enclosure also includes a shared DVD module and two Global Position Service Modules (GPSM) that are used for server management and global clock sourcing.

Figure 2. HPE Integrity Superdome X enclosure architecture

Page 6: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 6

BL920s Gen9 Server Blade Each BL920s Gen9 Server Blade accommodates two Intel Xeon E7 4800/8800 v4 or v3 series processors that provide up to 24 cores per processor socket. With hyper-threading enabled, a 48-core BL920s Gen9 Server Blade provides 96 logical processors. The two processors are connected together via a single 8.0 GT/s Quick-Path Interconnect® (QPI) link. Each processor is connected to an HPE custom-designed eXternal Node Controller 2 (XNC2) ASIC via two more QPI links for sending remote data traffic to the Xbar. The Intel Xeon E7 v4 processors offer new memory capabilities as well as reliability, availability, and serviceability (RAS) enhancements over earlier generation processors.

Intel Xeon E7 v4 processor advantages include:

• Up to 24 cores per processor

• Up to 48 threads per processor with Intel Hyper-Threading Technology

• Three Quick-Path Interconnect links that run at link speeds up to 8.0 GT/s

• Up to 60 MB of shared L3 cache per processor

• Integrated memory controllers that:

– Connect directly to Intel’s scalable memory buffer (SMB) chips via Scalable Memory Interconnect 2 (SMI 2) links

– Provide up to 24 DIMMs per processor

eXternal Node Controller The eXternal Node Controller (XNC2) is a new HPE custom-designed ASIC that interfaces Xeon E7 v4 processors to the sx3000 Xbar originally designed for Superdome 2. The XNC2 provides four QPI links to interface with two Xeon E7 v4 processors on one side, and provides up to eight fabric links on the other side to connect to other blades in the Superdome X enclosure across the upper midplane and sx3000 Xbar Fabric Modules (XFMs).

The XNC2 replaces the two sx3000 Agent ASICs utilized in Superdome 2, and provides the following features:

• Physical address support for up to 64 TB of main memory

• Large Remote Tag Cache for scalable coherency

• Link-level retry, link-width reduction, and end-to-end retry to provide fault tolerance from fabric errors

• 5 GT/s fabric links for maximum Xbar bandwidth

• Management link for reading/writing ASIC registers to perform real-time error analysis

Memory subsystem Each BL920s Gen9 Blade has 48 DDR4 DIMM slots that accommodate 64GB LR-DIMMs for a maximum per-blade capacity of 3.0 TB. This gives the HPE Integrity Superdome X a total memory capacity of 24TB of main memory to support the most intensive in-memory applications. A diagram of the memory subsystem architecture is shown in Figure 3.

Page 7: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 7

Figure 3. BL920s Gen9 Server Blade Memory Subsystem

As illustrated in Figure 3, the memory subsystem includes the following features:

• Each Xeon E7 v4 processor provides two fully independent integrated Memory Controllers (MCs)

• Each memory controller provides two fully independent memory channels

• Each memory channel connects to its own Scalable Memory Buffer chip and six DDR4 DIMM slots

Since these memory channels are fully independent, they can all run simultaneously at DRAM data transfer rates up to 1600 MT/s to provide the HPE Integrity Superdome X with as much as 1 TB/s of memory bandwidth in a fully populated enclosure. Such memory capacity and performance levels are necessary to keep all 384 cores of Xeon E7 v4 processing power running the most demanding workloads.

The I/O subsystem Achieving excellent system performance means that maintaining an appropriate balance between processing power, memory capacity/performance, crossbar interconnectivity and system I/O capabilities is of paramount importance. Each BL920s Gen9 Server Blade provides the necessary I/O capabilities to maintain that balance through a flexible and customer configurable I/O design offering two Flexible LAN-on-Motherboard (FlexLOM) devices and three mezzanine card slots. These five configurable component slots connect directly to the E7 v4 processors through multi-channel PCIe Gen3 links, and are capable of providing a maximum I/O bandwidth of up to 100 GB/s per blade. FlexLOM technology and mezzanine cards put the I/O interface on daughter cards, allowing you to choose a specific communication technology while keeping the interfaces closely-coupled to the main system architecture. Figure 4 shows a diagram of the BL920s Gen9 Server Blade I/O subsystem’s architecture.

Page 8: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 8

Figure 4. BL920s Gen9 Server Blade I/O Subsystem

As illustrated in Figure 4, the I/O subsystem includes the following features:

• Each E7 v4 processor connects directly to downstream I/O slots with no Southbridge chip to add latency or limit bandwidth as in previous generation designs

• Each FlexLOM slot provides two Ethernet or FlexFabric ports via a full-speed x8 PCIe Gen3 link

• Mezzanine slot 1 (Type A) provides four ports of I/O connectivity via a x8 PCIe Gen3 link (16 GB/s)

• Mezzanine slots 2/3 (Type B) provide four ports of I/O connectivity via a x16 PCIe Gen3 link each (32 GB/s)

• All I/O card slots are connected to the outside world via Interconnect Modules (ICMs) located in the rear of the HPE Integrity Superdome X enclosure

All Blades have identical port mapping to the ICM bays, enabling you to build enclosure-internal network interconnections between hard partitions (HPE nPars) without requiring any external cabling. Matching such extensive I/O capabilities to the processing power and memory capacity allows you to maintain performance when scaling from 2- to 16-sockets.

BL920s Gen9 Server Blade Management Each blade includes a Platform Controller Hub (PCH) chip, a Platform Dependent Hardware (PDH) controller, and an Integrated Lights Out (iLO) management engine to provide all the features required for HPE Integrity Superdome X server management. The PCH chip provides initial reset functionality and Real-Time clock functionality. The PDH controller provides the bulk of hard partitioning capabilities and error handling. The iLO 4 hardware and firmware provide remote server management capabilities over an Ethernet management network. The PDH controller and iLO processor of each blade interface directly with the SD Onboard Administrator to provide the processing power needed to manage a large and flexible server like the HPE Integrity Superdome X.

Page 9: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 9

New sx3000 Crossbar Fabric Module The sx3000 Crossbar (Xbar) Fabric Module (XFM) has been redesigned to increase blade crossbar bandwidth by 33% (from ~75 GB/s of per blade bidirectional bandwidth to ~100 GB/s per blade of bi-directional bandwidth). The sx3000 XFM is the heart of the system fabric and is based on the HPE custom-design sx3000 Xbar ASIC. Each XFM connects to all BL920s Gen9 Server Blades installed in the system enclosure, and brings a differentiating capability of hard partitioning (HPE nPars) to the x86 server space. In addition, the sx3000 Xbar provides a high-bandwidth coherent path with low latency between all the processors defined within each HPE nPar. Hard partitioning allows firewalls that provide true electrical isolation for fault tolerance, allowing each nPar to be serviced independently and without interruption to other nPars residing within the enclosure. Each 20-port, non-blocking Xbar ASIC now has 16 connections (previously 12) to and from the blades installed in the enclosure, and there are four XFMs in a 16-socket system. In addition to providing for the interconnection between Blades, the sx3000 chipset is designed with advanced RAS features that can tolerate the loss of channels, links or even an entire XFM and still keep all the nPars defined within the system running.

The sx3000 chipset maintains system cache coherency with a Remote Ownership Tag Cache using on-chip SRAM. By tracking remote ownership for the cache memory that the blade hosts, scalability of the Xeon E7 v4 processors can be extended without compromising the latency of socket local or buddy socket memory accesses. This coherency scheme is a critical factor in HPE Integrity Superdome X’s ability to performance a near linear scaling from 2-sockets all the way up to 8-sockets. Typical glue-less architectures do not have this “directory” cache, and performance scaling is limited by broadcast snooping.

Superdome Onboard Administrator Integrity Superdome X is managed through the Superdome Onboard Administrator (SD OA). Based on the c-class Onboard Administrator, the SD OA has expanded functionality, including the ability to manage the partitioning of the system and component inventory and health. While each blade has its own Integrated Lights Out (iLO) processor, the SD OA collectively manages all blades with the aid of each Blade’s PDH controller, avoiding the need to “drill down” when managing individual nodes.

The SD OA’s built-in Analysis Engine is constantly analyzing all hardware to detect faults, predict failures, and initiate automatic recovery actions as well as notifications to Administrators and to HPE Insight Remote Support. HPE Insight Remote Support allows connecting to the Superdome X OS and SD OA to monitor for problems and troubleshooting. It works with the Analysis Engine and can connect to HPE Back-end for automatic notification to HPE Support of any problem with the system.

RAS HPE Integrity Superdome X servers offer RAS features in key hardware subsystems—processor, memory, and I/O—and provide the ideal foundation for mission-critical Linux® operating environments. HPE mission-critical Superdome X running the Linux operating environment reflects the growing emphasis on availability by ensuring that your business is always on and providing availability through a layered approach that offers application, file system, and operating system protection. HPE mission-critical Superdome X infrastructure and the Linux operating environment provide a comprehensive RAS strategy that covers all layers—from application to hardware.

Fault management strategy Superdome X servers fully realize HPE’s design strategy for systems handling mission-critical workloads, which is to implement, when applicable, a four-stage RAS strategy of detection, logging, analyzing, and repair (Figure 5).

Figure 5. HPE hardware RAS strategy

This strategy keeps customer workloads up and their data available even in the presence of faults. In the rare event of an unrecoverable fault, the strategy still provides detection and containment to protect corrupt data from reaching the network and permanent storage.

Page 10: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 10

When faults do occur that require support action, accurate diagnosis of the fault is critical to determine what is wrong and how to fix it right the first time. Below are some of the design keys for diagnostic abilities that are built in to every Superdome X server

• Minimize time to repair

• Capture enough data to diagnose failures the first time

• Allow system to run after failure for complete error logging

• Ability to diagnose all system components (software, firmware, and hardware) via complete error logging

• Field Replaceable Unit (FRU) level granularity for repair

• Component level granularity for self-healing

Firmware first Part of HPE Integrity Superdome X’s comprehensive strategy for fault management includes “Firmware First” problem diagnosis. With Firmware First, firmware with detailed knowledge of the Superdome X system is first on the scene of problems to quickly and accurately determine what's wrong and how to fix it. Intel Xeon E7 processors Enhanced Machine Check Architecture (eMCA) allows firmware a first look at error logs so that firmware can diagnose problems and take appropriate actions for the platform before OS and higher-level software involvement. Firmware First covers correctable errors, uncorrectable errors, and gives firmware the ability to collect error data and diagnose faults even when the system processors have limited functionality. Firmware First enables many platform specific actions for faults including predictive fault analysis for system memory, CPUs, IO, and interconnect.

RAS differentiators While features such as hot-swap n+1 power supplies and single-/multi-bit memory error correction have become common in the industry, a number of RAS differentiators set Superdome X servers apart from other industry-standard servers. Superdome X servers offer several types of RAS differentiators:

• Self-Healing capabilities

• Processor RAS

• Memory RAS

• Platform RAS

• Application RAS

• OS RAS

Self-healing When faults do occur, Superdome X provides several mechanisms to react so that unplanned downtime is avoided. Primary means of downtime avoidance include disabling failed or failing components during boot and deactivating failed or failing components during run-time. Taking failed or failing hardware offline allows the system to remain running with healthy hardware until the system can be serviced. Such self-healing capabilities avoid unplanned system downtime.

Deconfiguration of failed or failing components Superdome X provides the ability to deconfigure components so that any single hardware fault can be tolerated.

• Memory DIMM and CPU core deconfiguration: Reactive and predictive fault analysis allows for deconfiguration of failed or failing memory DIMMs and CPU cores so that the system can remain available with only healthy memory DIMMs and CPU cores in use.

• Blade deconfiguration: Serious faults on Superdome X blade hardware can be proactively dealt with through blade deconfiguration capabilities. Any multi-blade configurations of Superdome X can survive blade hardware faults by deconfiguring the faulty blade and allowing the remaining blades to boot with healthy hardware.

Page 11: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 11

Run-time deactivation of components before failure Faults in many areas of Superdome X servers result in runtime deactivation of resources to avoid continued using of failing components. This level of self-healing provides zero system downtime and allows for repair actions at the next planned downtime event. System interconnects and the memory subsystem provide self-healing capabilities with deactivation of failing resources when needed:

• System Crossbar Fabric self-healing with link width reduction, online port deactivation, and alternate routing for fabric connections

• QuickPath Interconnect (QPI) link width reduction at run-time

• Memory Interconnect link width reduction at run-time

• Enhanced Memory Double Device Data Correction (DDDC) to tolerate 2 failed devices on a DIMM

Processor RAS Superdome X servers use the Intel Xeon E7 v4 processors. These processors include extensive capabilities for detecting, correcting, and reporting hard and soft errors. Since these RAS capabilities require firmware support from the platform, they are often not supported in other industry-standard servers. HPE Integrity Superdome X implements all of the RAS functionality provided in Xeon E7 v2 and v3 series processors including:

• Corrupt data containment

• PCIe Live error containment

• Viral error containment

• Processor interconnect fault resiliency

• Advanced MCA recovery

Corrupt Data Containment Superdome X servers with Intel Xeon processor E7 v4 processors enable Corrupt Data Containment mode that provides the detection and possible recovery of uncorrectable errors. When Corrupt Data Containment mode is enabled, the producer of the uncorrected data will not signal a Machine Check Exception. Instead, the corrupted data is flagged with an “Error Containment” bit. Once the consumer of the data receives the data with the ”Error Containment” bit set, the error is signaled and handled by firmware and the operating system. Several recovery flows are possible including Uncorrected No Action (UCNA), Software Recovery Action Optional (SRAO), and Software Recovery Action Required (SRAR). The mission-critical Superdome X infrastructure and the Linux operating environment support all of these Corrupt Data error flows and provide end-to-end hardware/firmware/software error recovery where possible.

PCIe Live Error Recovery Containment Uncorrectable errors in a server’s PCIe subsystem can potentially propagate to other components, resulting in a crash of the partition—if not the entire server. To minimize this risk in Superdome X servers, HPE implemented specific firmware features leveraging Intel’s Live Error Recovery (LER) mechanism that provides a means of trapping errors at a root port to prevent error propagation. LER containment allows the platform to detect a subset of Advanced Error Reporting (AER) and proprietary based PCIe errors in the inbound and outbound PCIe path. When a PCIe error occurs, LER is able to contain the error by stopping I/O transfers to avoid corrupted data from reaching the network and/or permanent storage. LER containment also avoids the propagation of the error and an immediate crash of the machine. In parallel of this error containment, HPE firmware is informed and in turn the OS and upper layer device drivers are made aware of the error. HPE contribution to the enhancement of the Advanced Error Reporting PCIe implementation allows Linux to better report the details of such errors in the Linux syslog files as well as cooperating with device drivers to resume from recoverable PCIe errors. See the video demo at vrp.glb.itcs.hpe.com/SDP/Content/ContentDetails.aspx?ID=4376 for details on Superdome X Live Error Recovery containment, recovery, and detailed error reporting. Superdome X’s innovative solution for Live Error Recovery is not available on typical Xeon processor based systems.

Viral Error Containment Superdome X servers further expand protection of customer data from corruption by enabling Viral Error Containment (VEC) mode in the E7 processor and scalable Server chipset. While Corrupt Data Containment mode enables containment for data errors, VEC mode does the same for address, control, or miscellaneous fatal errors. The goal of VEC mode is to contain the error and prevent it from being committed to the network and/or permanent storage. VEC mode takes additional steps in hardware to detect and signal error beyond those that impact a single data packet. When the system enters VEC mode, all transactions that are possibly corrupted are marked as contaminated so that no corrupt data can reach permanent storage.

Page 12: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 12

Processor interconnect fault resiliency All processor interconnects, including QuickPath Interconnect (QPI), the Memory Interconnect, and PCIe, have extensive cyclic redundancy checks (CRCs) to correct data communication errors on the respective busses. They also have self-healing mechanisms that allow for continued operation through a hard failure such as a failed link.

With QPI and memory link self-healing, full-width links will automatically be reduced to half-width when persistent errors are recognized on the QPI or memory link. This capability means that operation can continue until repairs can be made. PCIe links also support width reduction and bandwidth reduction when full width/full speed operation is not possible.

Advanced MCA Recovery Advanced MCA Recovery is a technology that is a combination of processor, firmware, and operating system features. The technology allows for errors that can’t be corrected within the hardware alone to be optionally recovered from by the operating system. Without MCA recovery, the system would be forced into a crash. With MCA recovery, the operating system examines the error, determines if it is contained to an application, a thread, or an OS instance. The OS then determines how it wants to react to that error.

Xeon processor E7 v4 processors expand upon previous Xeon E7 processor capabilities for advanced error recovery. Xeon E7 v4 processors now provide the ability to recover from uncorrectable memory errors in the instruction and data execution path (Software Recovery Action Required, or SRAR) in addition to handling non execution path uncorrectable memory errors (Software Recovery Action Optional, or SRAO). In expanding on E7 processor memory error recovery including SAP HANA application recovery (Intel, 2011), HPE has done extensive development and testing of execution path recovery. See a demo of this feature at: vrp.glb.itcs.hpe.com/SDP/Content/ContentDetails.aspx?ID=4407

When certain uncorrectable errors are detected, the processor interrupts the OS or virtual machine and passes the address of the error to it. The OS resets the error condition and marks the defective location as bad so it will not be used again and continues operation.

Memory RAS Main memory failures have been a significant cause of hardware downtime. Superdome X servers use several technologies for enhancing the reliability of memory: proactive memory scrubbing, Double Device Data Correction (DDDC). Additionally, HPE Smart Memory DIMMs are qualified to provide both performance and quality.

Proactive memory scrubbing To better protect memory, systems such as Integrity Superdome 2 and Integrity Superdome X implement a memory scrubber. The memory scrubber actively scans through memory looking for errors. When an error is discovered, the scrubber rewrites the correct data back into memory. This scrubbing, combined with ECC prevents multiple-bit, transient errors from accumulating. However, if the error is persistent, then the memory is still at risk for multiple-bit errors. Accumulated memory DIMM errors can result in multi-bit errors that cannot be corrected and can result in data corruption. Proactive memory scrubbing is a hardware function included in Superdome X servers that finds memory errors before they accumulate.

Enhanced Double Device Data Correction (DDDC) +1 The industry standard for memory protection is single error detecting and double error detecting (SECDED) of data errors. Additionally, many servers on the market provide Single Device Data Correction also known as Chip Sparing or Chip-kill.

Single device correction capabilities protect the system from any single-bit data errors within a memory device, whether they originate from a transient event such as a radiation strike, or from persistent errors such as a bad dynamic random access memory (DRAM) device. However, Single-Chip Sparing will generally not protect the system from a failed DRAM and a single-bit error. Though detected, these will cause a system to crash.

Combined with memory scrubbing, ECC prevents multiple-bit, transient errors from accumulating. However, persistent errors can put the memory at risk for multiple-bit errors.

Double Device Data Correction (DDDC) +1 in Superdome X servers addresses this problem. DDDC +1 technology determines when the first DRAM in a rank has failed, corrects the data and maps that DRAM out of use by moving its data to spare bits in the rank. Once this is done, Single Device correction is still available for the corrected rank. Thus, a total of two entire DRAMs in a rank of dual in-line memory modules (DIMMs) can fail and the memory is still protected with ECC. This amounts to the system essentially being tolerant of a DRAM failure on every DIMM and still maintaining ECC protection. “+1” correction provides an additional layer of protection for single bit errors even in the presence of 2 entire device failures.

Page 13: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 13

DDDC +1 drastically improves system uptime, as fewer failed DIMMs need to be replaced. This technology delivers up to a 17x improvement in the number of DIMM replacements versus those systems that use only Single-Chip Sparing technologies. Furthermore, DDDC +1 significantly reduces the chances of memory related crashes compared to systems that only have Single-Chip Sparing capabilities.

Although DDDC +1 is based upon an Intel® Xeon® processor E7 processor feature, Superdome X has enhanced the feature with specific firmware and hardware algorithms. HPE enhanced DDDC +1 provides a memory RAS improvement over Intel base code and reduces memory outage rates by 33% to 95% over standard x86 offerings.

New Memory RAS introduced with Intel® Xeon® processor E7 v3 E7 v3 (and v4) processors and their DDR4 memory subsystem provide 2 new memory RAS features not available in previous E7 versions. These new features are described below.

DRAM Bank Sparing To better target the most likely memory failure modes at the DRAM level, DRAM Bank Sparing provides the ability to move data away from a faulty Bank. DRAM Bank Sparing is automatically enabled as part of HPE enhanced DDDC +1 and provides up to 33% more error resiliency compared to E7 v2 enhanced DDDC +1.

DDR4 Command/Address Parity Error Retry DDR4’s Command/Address bus is parity protected and the E7 v4 and v3 integrated Memory Controller and Memory Buffer provide detection and logging of parity errors. In previous E7 platforms, all Command/Address bus parity errors were fatal events which caused an OS crash. The v4 and v3 Memory Controller helps prevent crashes by automatically retrying any transactions reporting parity errors. Command/Address Parity Error Retry, HPE enhanced DDDC +1 (with Bank Sparing), and memory interconnect self-healing work in harmony to provide resiliency to errors across all memory interfaces and components.

Platform RAS The Integrity Superdome X offers built-in RAS features including Clock Redundancy, System Fabric RAS, and Fault-Tolerant RAS.

Clock redundancy The fully redundant clock distribution circuit of Superdome X contains the clock source and continues through the distribution to the blade itself. The system clocks are powered by two fully redundant and hot-pluggable Hardware Reference Oscillators (HSOs) which support automatic, “glitch-free” fail-over/reconfiguration and are hot pluggable under all system operating conditions (Figure 6).

During normal operation, the system selects one of the two HSOs as the source of clocks for the platform. If only one HSO is installed then its output is used (assuming it is of valid amplitude). If both HSOs are plugged in and both outputs are valid, then one of the two is selected by the clock switch logic on the blade. If one of the HSO outputs fails to have the correct amplitude, the clock switch logic will use the valid HSO as the source of clocks and send an alarm to the system indicating which HSO failed. A green LED will be lit on the good HSO and a yellow LED will be lit on the failed HSO. This clock source can then be repaired through a hot-plug operation.

Figure 6. RAS clock failover

Page 14: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 14

System Fabric RAS HPE Integrity Superdome X leverages its crossbar interconnect from Integrity Superdome 2 and implements the HPE Mission-Critical infrastructure strategy of attacking IT sprawl with standards-based modular architecture. The heart of Superdome X architecture is the fault-tolerant HPE Crossbar Fabric consisting of passive midplanes with end-to-retry and link failover functionality. HPE innovative scalable enterprise system chipset includes extensive self-healing, error-detection, and error correction capabilities.

Fault-tolerant fabric Superdome X sets the industry standard for fault-tolerant fabric resiliency. The basics of the fabric are high bandwidth links providing multiple paths and a packet-based transport layer that guarantees delivery of packets through the fabric. The physical links contain availability features such as link width reduction that essentially allow individual wires or I/O pads on devices to fail and the links are reconfigured to eliminate the bad wire. Strong CRCs are used to guarantee data integrity.

Beyond the reliability of the links themselves, the next stage of defense is end-to-end retry. When a packet is transported, the receiver of the packet is required to send acknowledgement back to the transmitter. If there is no acknowledgement, the packet is retransmitted over a different path to the receiver. Thus, end-to-end retry guarantees reliable communication for any disruption or failure in the communication path including bad cables and chips.

The system crossbar provides unprecedented containment between partitions. High reliability for single partition systems is accomplished by offering high-grade parts for the crossbar chipset, and fault-tolerant communication paths between Integrity Superdome 2 blades and I/O. Furthermore, unlike other systems with partitioning, HPE provides specific hardware dedicated to guarding partitions from errant transactions generated on failing partitions.

Partitioning and Error Isolation Resiliency is a prerequisite for true hard partitions. HPE nPartitioning (nPar) is a hard partition technology providing electrical isolation, enabling you to configure a single blade-based server as one large server or as multiple, smaller, independent servers. Each HPE nPar has its own independent processors, memory, and I/O resources of the blades that make up the partition. Resources may be removed from one partition and added to another by using commands that are part of the System Management interface, without having to manipulate the hardware physically.

Many systems use a shared backplane, where all blades are competing for the same electrical bus (Figure 7A). This raises the potential for a number of shared failure modes. For example, high queuing delays and saturation of the shared backplane limit performance scaling. On the Superdome X system (Figure 7B), the fault-tolerant Xbar fabric logically separates the physical partitions, providing performance and isolation for a more reliable and scalable system.

Figure 7. Hard partition error containment

Page 15: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 15

Application level RAS HPE Serviceguard for Linux (SGLX) monitors the availability and accessibility of your critical IT services including databases, standard applications, and custom applications. Those applications—and everything they rely upon to do their job—are meticulously monitored for any fault in hardware, software, operating system, virtualization, storage, or network. When a failure or threshold violation is detected, HPE SGLX automatically and transparently resumes your normal operations in mere seconds by restarting the service in the right way and in the right place to enable improved performance.

HPE Serviceguard Storage Management Suite (SMS) also allows you to use a clustered file system to achieve the highest levels of availability and outstanding performance for applications benefitting from improved manageability and scalability. Furthermore, you can extend the comprehensive protection of HPE Serviceguard for Linux beyond the walls of your data center. HPE Serviceguard Metro cluster for Linux and HPE Serviceguard Continental clusters for Linux offer robust recovery mechanisms for geographically dispersed clusters and enable your business to remain online even after a catastrophic event with disaster recovery solutions.

OS level RAS The HPE mission-critical Superdome X environment provides an unparalleled set of features to detect and recover from faults. Many years of collaboration between the processor, firmware, OS, and application design teams has led to the delivery of several advanced error recovery capabilities. Specifics of OS error recovery for memory and PCIe faults are described in the RAS differentiators section.

Management The Integrity Superdome X offers extensive management capabilities through both built-in management components and additional management resources.

Built-in management components The Integrity Superdome X offers the following built-in management components:

• Superdome Onboard Administrator (SD OA)

• Integrated Lights Out (iLO)

• Insight Display

SD OA The SD OA offers a built-in, always available platform and partition management system. While based on the C-class Onboard Administrator, the SD OA has expanded functionality such as the ability to manage partitions, gather detailed knowledge of component inventory and health, and evaluate system faults with an Analysis Engine.

The SD OA provides a user-friendly experience (Figure 8) and makes managing the Superdome X much easier by centralizing the control and building the management into the hardware and firmware of the system.

Page 16: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 16

Figure 8. SD OA Complex Overview GUI screen

The SD OA provides the following key features:

• Error Analysis Engine

• Onboard Firmware Manager

• Onboard Partition Manager

• Choice of user interfaces:

– Command Line Interface (CLI) for easy scripting and power user convenience

– Graphical User Interface (GUI) for intuitive operation

• A console for each partition

The Error Analysis Engine is constantly analyzing all hardware for faults. Based on detected errors, the Analysis Engine can predict failures and initiate automatic recovery actions as well as notifications to Administrators and to HPE Insight Remote Support.

Onboard Firmware Manager can scan a partition and report components with incompatible firmware versions. Mismatched firmware (due to parts replacements or system upgrades) of a single HPE nPars or an entire Superdome X can have firmware updated to a consistent level with just a click of a button. Partitions with consistent firmware levels are fully validated for the most reliable operation by the system developers. Superdome firmware is managed as one version of “complex firmware” and one version for each nPartition of nPartition firmware, similar to server BIOS. The complex firmware is all the infrastructure components of the enclosure including the SD OA, the iLOs, and auxiliary management processors that maintain fabric and enclosure health. Having a single installation and version that is upgradable without affecting the state of running nPartitions greatly simplifies firmware management and enhances platform reliability.

Onboard Partition Manager (Figure 9) is implemented entirely in firmware. There are no dependencies on additional software tools and no need for an external management station or special hypervisor to build your desired partition configuration. The result is faster and easier partition configuration and partition start/stop.

Page 17: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 17

Figure 9. SD OA Modify nPartition GUI screen

Online Partition Manager can utilize “ParSpecs,” which are a way to save, create, and build partitions from resource definitions. ParSpec definitions allow you to have overlapping resources as long as the partitions booted don't all claim the same resources at the same time. One way to use ParSpecs is to create one set for "end of month" jobs, and another set for “daily work”. ParSpec commands are built into the SD OA CLI.

The rich set of capabilities available through SD OA’s secure network interface enables data-center-level management tools such as HPE Insight Remote Support, and HPE Insight Online to add remote management capabilities at the data-center level.

Insight Display The Insight Display (Figure 10) provides a graphical representation of the physical configuration of the Superdome X enclosure. The Insight Display indicates when any device, configuration, power, or cooling errors are detected. A display highlighted in green indicates no errors; a display highlighted in amber indicates that an error has been detected.

Figure 10. SD OA Insight Display

Page 18: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper Page 18

iLO on Superdome Each blade on the Superdome platform has an iLO management engine for use as an engine by virtual media and virtual KVM features. One iLO is active for any given nPartition, and has its programmatic LAN interfaces in use, but not its web GUI. The flexible and aggregate nature of Superdome nPars and the SD OA allows the SD OA to provide all the management interaction necessary for working with the servers created within the Superdome enclosure. The SD OA has full inventory and status information by nPartition as well as by blade. This allows the ability to launch console and virtual media directly from the SD OA for any particular nPar that uses the iLO management engine on the blade.

Additional management resources Additional management resources such as HPE Insight Remote Support, HPE Insight Online, and HPE Systems Update Manager offer efficient and comprehensive monitoring and control of the Superdome X from virtually anywhere.

HPE Insight Remote Support HPE Insight Remote Support is for connecting to the Superdome X OS and SD OA to monitor for problems and troubleshooting. It works with the embedded Analysis Engine in the SD OA, and can connect to HPE Back-end for automatic notification to HPE Support of any problem with the system. Various Support contract levels are available for Superdome X. For more information go to the HPE Insight Remote Support website.

HPE Insight Online HPE Insight Online provides one stop, secure access to the information you need to support Superdome X with standard warranty and contract services. It is a new addition to the HPE Support Center portal for your IT staff who deploy, manage and support systems. Through the HPE Support Center, Insight Online can automatically display devices remotely monitored by HPE. It provides the ability to easily track service events and support cases, view device configurations and proactively monitor your HPE contracts and warranties. This allows your staff or HPE Authorized Services Partner to be more efficient in supporting your HPE environment. What’s more, they have the ability to do all this from anywhere and at any time. HPE Insight Online also provides online access to reports provided by HPE Proactive Care services. The embedded management capabilities built into the Superdome X server have been designed to seamlessly integrate with Insight Online and Insight Remote Support 7.0 (and later).

HPE Smart Update Manager HPE Smart Update Manager (SUM) is HPE firmware management and update tool for enterprise environments. It can remotely update all HPE blade firmware as well as firmware from other HPE Enterprise products. HPE SUM gives recommendations for firmware that needs updating, and has an easy-to-use web user interface providing reporting capabilities, dependency checking, and installing updates in the correct order through CLI and/or GUI.

Page 19: HPE Integrity Superdome X system architecture and RAS technical

Technical white paper

Sign up for updates

Rate this document © Copyright 2015–2016 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.

Microsoft, Windows and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. SAP and SAP HANA are registered trademarks of SAP SE in Germany and other countries. Oracle is a registered trademark of Oracle and/or its affiliates. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Intel, Pentium, Intel Inside and the Intel Inside logo are trademarks of Intel Corporation in the U.S. and other countries.

4AA5-6824ENW, June 2016, Rev. 2

Conclusion Through our very close association with Intel during processor development, we have been able to have Superdome X fully exploit the performance and RAS functionality built into the CPU. Our Platinum membership of the Linux Foundation yields a high degree of kernel enablement to ensure scalability, reliability, and performance at the OS level. And our long-lasting partnership and technical collaboration between Hewlett Packard Enterprise and Microsoft has resulted in the combined Superdome X + Windows solution delivering the scalability and reliability necessary to address the enterprise’s most demanding mission-critical requirements. This results in the ground-breaking performance, robust RAS, and flexible manageability that sets Integrity Superdome X apart as the x86 scale-up solution for mission-critical environments.

Resources, contacts, and additional links Integrity Superdome X information hpe.com/servers/superdomex

HPE Insight Remote hpe.com/services/getconnected

HPE Insight Online hp.com/us/en/products/servers/solutions.html?compURI=1487547#

HPE Smart Update Manager hpe.com/info/hpsum

Learn more at hpe.com/info/superdomex