12
WHITE PAPER | AUGUST 2015 Replacing the Black-Box Commercial Off the Shelf (COTS) Virtualization High Capacity Security, Media and Analytics Solutions Using High Performance Virtual Machines (HPVMs) on Industry Standard Servers Overview Black-box or "appliance" products are prolific in industry, and for good reason: they perform. Black-boxes provide tightly coupled functionality, processing and throughput needed for demanding applications in small, efficient 1U and 2U form- factors. The 1U "pizza box" form-factor is especially popular for low SWaP (size, weight, and power) applications and installation is simple and efficient. At the same time, the disadvantages of the black box approach are also well known: Proprietary, closed architecture make it difficult or impossible to add custom features Expensive – often times very expensive Upgrades may be slow in coming End-users are restricted and vendor dependent Warranties and RMA policies not on par with major server vendors This whitepaper discusses Using Gazoo’s High Performance Virtual Machines (HPVMs), running on industry standard server architectures, Vendors will now be able to develop much more powerful applications without the need for fixed function appliances like “Black Boxes”. High Performance Virtual Machines provide unprecedented scales of efficiency. By utilizing hundreds of compute-intensive multicores in a single server, Gazoo has changed the game. The result is an order of magnitude increase in the numbers of virtual machines per server, machine compute power and teraflops-per-watt realized across the enterprise. Contents Introduction pg. 2 Hardware Requirements pg. 2 Architecture & Data Flow Programming Interface pg. 7 CPU vs GPU Overview pg. 8 ISS Example – HP DL380 pg. 9 ISS Capacity Table pg. 10 ISS Example - XstreamCore pg. 11 Summary pg. 12 Accelerator Only pg. 3 3rd Party & Encryption pg. 4 Security Accelerator pg. 5 Accelerator + HT pg. 6 High Performance Computing

Gazoo white paper

Embed Size (px)

DESCRIPTION

Commercial Off the Shelf (COTS) Virtualization

Citation preview

Page 1: Gazoo white paper

WHITE PAPER | AUGUST 2015

Replacing the Black-Box Commercial Off the Shelf (COTS) VirtualizationHigh Capacity Security, Media and Analytics Solutions Using High Performance Virtual Machines (HPVMs) on Industry Standard Servers

Overview Black-box or "appliance" products are prolific in industry, and for good reason: they perform. Black-boxes provide tightly coupled functionality, processing and throughput needed for demanding applications in small, efficient 1U and 2U form-factors. The 1U "pizza box" form-factor is especially popular for low SWaP (size, weight, and power) applications and installation is simple and efficient.

At the same time, the disadvantages of the black box approach are also well known:

• Proprietary, closed architecture make it difficult or impossible to addcustom features

• Expensive – often times very expensive

• Upgrades may be slow in coming

• End-users are restricted and vendor dependent

• Warranties and RMA policies not on par with major server vendors

This whitepaper discusses Using Gazoo’s High Performance Virtual Machines (HPVMs), running on industry standard server architectures, Vendors will now be able to develop much more powerful applications without the need for fixed function appliances like “Black Boxes”.

High Performance Virtual Machines provide unprecedented scales of efficiency. By utilizing hundreds of compute-intensive multicores in a single server, Gazoo has changed the game. The result is an order of magnitude increase in the numbers of virtual machines per server, machine compute power and teraflops-per-watt realized across the enterprise.

Contents

Introduction pg. 2

Hardware Requirements pg. 2

Architecture & Data Flow

Programming Interface pg. 7

CPU vs GPU Overview pg. 8

ISS Example – HP DL380 pg. 9

ISS Capacity Table pg. 10

ISS Example - XstreamCore pg. 11

Summary pg. 12

• Accelerator Only pg. 3

• 3rd Party & Encryption pg. 4

• Security Accelerator pg. 5

• Accelerator + HT pg. 6

High Performance Computing

Page 2: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 2

Introduction For a given application, what if the black box could be replaced by a server? Typically a replacement is possible if a server is custom engineered; and there are many “High Performance Server” vendors who are more than willing to provide customized hardware.

But what if the black box can be replaced entirely by an industry standard server (ISS) from HP, Dell, Supermicro or others? What if you could go online, order a 1U or 2U server, a few off-the-shelf cards, add state-of-the-art software and achieve "black box" performance? All without violating any ISS vendor reliability or MTBF warranties?

Thanks to the evolution of high performance computing (HPC) and the proliferation of very stable, off-the-shelf, multicore CPU hardware, Gazoo has made this possible. This white paper discusses how to leverage current hardware and server infrastructure to deliver HPC supporting virtually any compute intensive application.

Hardware/Software Requirements The following are typical hardware specifications for HPC in an Industry Standard Server:

• Minimum of four (4) x86 cores. A typicalmotherboard might be 8-core Sandy Bridge

• One or more CIM® (Compute Intensive Multicore)accelerators featuring 32 or 64 C66x cores per x8PCIe card

• Optionally (depending on throughput requirements),one or more high speed network adapters, forexample the Mellanox ConnectX-3 card (x8 PCIe)

• Intel® DPDK (Data Plane Development Kit) Software(open source)

• Gazoo® advanced software supporting API andOpenMP programming interfaces

Server Architecture and Data Flow: Accelerator Only Below is a server software architecture block diagram showing a dual Sandy Bridge box (16 total physical cores) and from one to six (6) CIM® accelerator cards (32 to 384 total C66x cores per server).

Control plane software –signaling, session monitoring and statistics, resource allocation

Data plane software, dedicated to high performance, outside of virtualization scope. May include encryption, routing, load balancing

x86 cores

DPDK

Gazoo DPD igh Capacity Media Architecture In dustry Standard Server C66x Accelerator Block Diagram © Gazoo 2015

Network MemoryHost CPU cores C66x cores PCIe

1 to N cores

x86 cores

Linux

1 to N cores

8 cores

2 GB DDR3 Mem

PCIe risers

C66x cores

voice/video framework

8 cores

2 GB DDR3 Mem

8 cores

2 GB DDR3 Mem

C66x cores

voice/video framework

8 cores

2 GB DDR3 Mem

C66x accelerator cards

Up to 6 cards (3 per riser)50 W per card

C66x cores

voice/video framework

C66x cores

voice/video framework

Industry Standard Server

Enet1 GbE

CIM® SoftwareDirectCore® Software SigMRF Software

DPDK

Enet1 GbE

Enet1 GbE

PCIe switch (root

complex)

PCIe switch / bridge

Data Plane Path Control Plane Path

Figure 1 - Server Architecture and Data Flow, Accelerator Only

Page 3: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 3

In the above diagram, note that network I/O on the accelerator is not used. This is appropriate for applications where data plane throughput and latency, while important, are not the crucial factors, for example media applications such as VoIP, video content delivery, and video analytics. In these applications, network I/O is handled by x86 data plane cores, which are outside the Linux “Petri dish”, i.e. outside of the Linux (or other GP OS) and virtual machines (VMs). In this environment control plane flow stays within the Linux environment.

In applications requiring high network throughput and low latency, the network I/O on the accelerator can be used. In these applications signal processing code is offloaded to the accelerator and frees up x86 data plane cores, while control plane data stays within the customary Linux environment. With a higher percentage of application code running on CIM® cores, the issue of programming interface – API or OpenMP – must be carefully considered. This is discussed further below.

Server Architecture - Accelerator + High Throughput Below is a server software architecture block diagram showing a dual Sandy Bridge box (16 total physical cores) and from one to five (5) CIM® accelerator cards (32 to 320 total C66x cores per server), and a 40 GbE network adapter card.

Figure 2 - Server Architecture, Accelerator + High Throughput Network Adapter

Page 4: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 4

Custom and Third Party Data/Control Plane Application Environment

Below is a software architecture block diagram showing a modular adaptation of HPC using custom and/or 3rd Party Application/Analytics. In this approach, integrated applications use 2GB of P-Space and D-Space provided natively by the x86 data plane core environment and scheduled by simple configurations based upon application requirements, scaling needs and run time priority. One or multiple applications can run on a single core as required. One use case example is the application of high-speed encryption. For example: identity management can be enforced and secured within each core along with inline encrypt/decrypt capabilities using traditional PKI encryption methodologies along with on-board algorithms or other plug-in software designed for advanced encryption. Resource loading and other load balancing capabilities or network analysis are additional third party application examples.

Figure 3 - Server Architecture, Accelerator + Custom and Third Party Data/Control Plane Application Environment

Page 5: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 5

Security AcceleratorThe security accelerator provides wire-speed processing on 1-Gbps Ethernet traffic on IPSec, SRTP, and 3GPP Air interface security protocols. It functions on the packet level with the packet and the associated security context being one of these above three types. The security accelerator is coupled with network coprocessor, and receives the packet descriptor containing the security context in the buffer descriptor, and the data to be encrypted/decrypted in the linked buffer descriptor.

– Authentication Engine provides hardware modules tosupport keyed (HMAC) and non-keyed hash calculations:› CMAC authentication› GMAC authentication› HMAC MD5 authentication› HMAC SHA1 authentication› HMAC SHA2 224 authentication› HMAC SHA2 256 authentication

– Air Cipher Engine› AES CTR cipher› AES CMAC authentication› Kasumi F8 cipher› Snow3G F8 cipher› Kasumi F9 authentication

– Programmable Header Parsing module› PDSP based header processing engine for packet parsing,algorithmcontrol and decode› Carry out protocol related packet header and trailerprocessing

• Support null cipher and null authentication for debugging

• True Random number generator– True (not pseudo) random number generator– FIPS 140-1 Compliant (if KeyStone II)– Non-deterministic noise source for generating keys, IV, etc.

• Public Key accelerator– High performance public key engine for large vector mathoperation– Supports modulus size up to 4096 bits– Extremely useful for public key computations

• Context cache module to automatically fetch securitycontext

Protocol stack features provided:

– Provides IPsec protocol stack› Support transport mode for both AH and ESPprocessing› Support tunnel mode for both AH and ESP processing› Full header parsing and padding checks› Constructs initialization vector from header› Supports anti-replay› Supports true 64K bytes packet processing

– Provides SRTP protocol stack› Supports F8 mode of processing› Supports replay protection› Supports true 64K bytes packet processing

– Provides 3GPP protocol stack, Wireless Air cipherstandard› AES counter› ECSD A5/3 key generation› GEA3 (GPRA) key generation› GSM A5/3 key generation› Kasumi F› Snow3G

Features provided by respective hardware modules:

– Encryption and Decryption Engine› 3DES CBC cipher› AES CTR cipher› AES CBC cipher› AES F8 cipher› AES XCBC authentication› CCM cipher› DES CBC cipher› GCM cipher

Features provided by the security accelerator

Page 6: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 6

Server Data Flow - Accelerator + High Throughput Below is a server Data Flow diagram showing a dual Sandy Bridge box (16 total physical cores) and from one to five (5) CIM® accelerator cards (32 to 320 total C66x cores per server), and a 40 GbE network adapter card.

Figure 4 - Server Data Flow, Accelerator + High Throughput Network Adapter

Note again the optional data flow paths through x86 data plane cores, or through accelerator I/O. In this case, these paths can be used to augment data plane throughput. Also note that control flow also has some flexibility, for example if CIM® cores are operating as Hadoop worker nodes, it may be desirable to route some control flow through the high speed network adapter; one example might be a network file system (NFS) using AoE (ATA over Ethernet).

Figure 4 – Data Flow, Accelerator + High Throughput Network Adapter

Page 7: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 7

Programming Interface CIM® accelerators offer two types of programming interface:

• API

• OpenMP

The type of interface used depends on the nature of application code. For applications containing relatively short, identifiable sections of compute-intensive code, an OpenMP interface may be suitable.

For applications where (i) entire, complex processes must be offloaded to the accelerator, or (ii) where network I/O must be “right at the network edge”, and not subject to virtualization or other system performance constraints, an API interface, or some combination of API + OpenMP, may be more suitable.

OpenMP offers an easy-to-use programming interface, and is especially effective when used with multicore accelerators, where source code must be “partitioned” between many heterogeneous CPU cores within the same system. Below is a source code example showing MPEG2 to H.264 transcoding, inside OpenMP pragmas.

Figure 5 – Video Transcoding Source Code Example Using OpenMP

Page 8: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 8

CPU vs GPU Overview

CPU and GPU devices are constructed from fundamentally different chip architectures. Both are very good at certain things, and both are not so good at some things -- and these strengths and weaknesses are mostly opposites, or complementary. In general:

CPUs tend to be good at complex algorithms that require random memory access, non-uniform treatment of data sets, unpredictable decision paths, and interface to peripherals (network I/O, PCIe, USB, etc).

GPUs tend to be good at well-defined algorithms that operate uniformly on large data sets, accurate and very fast math, and graphics applications of all types

Neither CPUs or GPUs provide a panacea for complex computing as neither is fundamentally superior to the other (in direct contrast to prevailing marketing hype). People tend to forget that top global semiconductor manufacturers are all on the same technology curve -- which of course makes sense, as they all use the state-of-the-art semiconductor manufacturing technology. If you look closely at the two key factors that form the basis for Moore's Law, performance and power consumption (the chip metric is GFlops / Watt), there is very little difference between Intel, Nvidia, Texas Instruments, Xilinx, etc. What you will find are differences in corporate practice and marketing culture, ingrained over very long periods of time -- 30 years or more -- that makes one manufacturer or another more adept at serving certain market segments, with advantages (or disadvantages!) in package size, memory bandwidth, on-chip integrated peripherals, programmability, etc.

In comparing the diagrams above, some obvious differences and similarities stand out:

GPU cores are sometimes called "CUDA cores", in reference to Nvidia's programming language. It's not easy to compare CPU and GPU cores (apples and oranges). Maybe the easiest way to think about it is (i) for any given math calculation, a GPU core can almost always do it much faster, and (ii) GPU cores do not run arbitrary C/C++, Java, or Python code, so they're not programmable in the conventional sense

A GPU accelerator can bring far more processing force to bear on massively parallel problems. Examples include graphics, bitcoin mining, climate simulations, DNA sequencing -- any problem where the data set can be subdivided into "regions", such that the results of one region do not depend on others

A CPU accelerator typically has its own NIC (or more than one), which can provide advantages in reduced latency and "data localization" -- bringing the compute cores closer to the data. Onboard NICs are typically not found on GPU accelerators as GPU cores are not designed to run device drivers, TCP/IP stack, etc.

Both types of accelerators take full advantage of high performance PCIe interfaces found in modern servers, including multiple PCIe slots and risers, accessibility to DPDK cores, and excellent software support in Linux

c66x multicore CPU accelerator diagram (shown for a CIM-64 card, with 8 cores per CPU and 2 GB mem per CPU). All CPU cores have NIC access

GPU accelerator diagram (shown for a Kepler 80 card, with 13 Streaming Multiprocessors (SMs) per GPU, 192 CUDA cores per SM, and 12 GB

mem per GPU). A GPU can have literally 1000s of "CUDA cores"

Page 9: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 9

ISS Example – HP DL380 HP Proliant series DL380 servers are economical, workhorse machines. In a 2U configuration with dual Sandy Bridge CPUs (16 cores), these servers provide a cost-effective mix of performance, efficiency, and relatively small size. However they present limitations for use of PCIe cards:

• Riser board power consumption limited to 150 W

• Riser cards are designed for single-slot PCIecards (i.e. the height of the card is about 0.65”, orwidth of a standard PCIe slot)

These limitations tend to rule out GPU boards, which because of their 2 slot width and 300 to 400 W power consumption, will void the MTBF warranty of the server. Gazoo CIM® accelerators however, operate very well within these constraints.

Using voice and video (media) algorithms as a metric, below are some example performance figures for an HP DL380p Gen8 server, configured as follows:

• Dual Sandy Bridge (16 cores, 2.2 GHz clockrate)

• Dual riser cards (3 slots each)

• Three (3) CIM® accelerator cards (x8 PCIe, 32total CIM® cores @ 1.25 GHz per core, 2 GByteDDR3 mem per core)

• Two (2) CIM® accelerator card (x8 PCIe, 64total CIM® cores @ 1.25 GHz per core, 2 GByteDDR3 mem per core)

• 32 GByte motherboard memory

256total physical compute cores

Figure 6 - HP DL380 Server with low SWaP CIM® Accelerator Cards

Page 10: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 10

ISS Capacity Table ( )

The above figures were measured on an HP DL380, configured as specified above. Average throughput across the PCIe bus was about 50 Mbps per CIM® core, which is typical for voice applications.

x86 1 CIM 2

16 643.00 1.25

30%

Fps

Encode,

Decode,

or

Both

Encode

Core

s 3

Decode

Core

s 3

H.264 720p BP 1 Mbps 15 E 2 1 36VP8 720p 1 Mbps 15 E 4 1 18MPEG2 720p BP 4 Mbps 15 E 2 36H.264 1080p BP 1 Mbps 15 E 2 1 36VP8 1080p 1 Mbps 15 E 4 1 18MPEG2 1080p BP 4 Mbps 15 E 2 36H.264 CIF MP 500 kbps 15 E 2 1 36H.264 QCIF MP 250 kbps 15 E 1 1 72

G.711 B

AMR-NB B 8576AMR-WB B 3620EVRC B

G.722 B 8777

G.722.1 (16 kHz Fs) B 10566

G.723.1A B 17554G.729AB B 11042GSM FR B 47458GSM HR B 7800iLBC B

1 Intel Sandy Bridge2 Texas Inst C66x3 Dedicated cores required due to optimized algorithm

CPU / Accelerator Type

CIM™ 64C Accelerator Voice / Video Capacity

Copyright ©

Number of cores

Clock rate (GHz)

Framework overhead (%)

Vid

eo

Sp

eech

Codec(s)

Figure 7 – HP DL380p Gen8 Server Performance Figures for High Capacity Media Application

94916

5204

6460

Page 11: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 11

Gazoo – XstreamCore XstreamCore is different from the competing solutions- Not constrained to proprietary architecture- Uses standard PCIE cards for microserver and

media processing resources which can run on anyserver

- Offers both PCI Express and EthernetCommunications Infrastructure:

- Unmatched Density per RU for processingand media applications – 15 slots in 3U

• Container for cold-swappable compute and I/Omodules

• Reliability with redundant/hot-swappable powerand cooling

• Designed for NEBS certification

• 15 PCIe slots to create application specificappliances

Configuration options include:- Intel Xeon-D MicroServer cards- Gazoo accelerator cards- 100GE I/O with load balancer- All standard 3rd party PCIE cards

Innovate Express Fabric- Virtual functions (VF) are assigned to virtual

Machines (VM) on several CPUs- Ability to share I/O between VMs- 25-100GB per card via PCIE Gen3

Direct I/O- Shared MAC with 2x 10GBaseT at chassis front- Optional Ethernet switch with 1GE ports to- Server Cards and two 10GBaseT at chassis front

736total physical compute cores

Figure 8 - XstreamCore Server with low SWaP CIM® Accelerator Cards

XstreamCore PCIe Chassis configurations available in densities up 15,000 physical cores in a 21U rack

Page 12: Gazoo white paper

Replacing The Black-Box | Gazoo White Paper 12

Summary

Gazoo provides hardware and software that adds hundreds of compute cores to industry standard servers. This functionality covers a broad range of leading business applications including the following:

Artificial Intelligence –

Image Analytics –

Data Analytics –

Virtual Desktop Infrastructure (VDI) –

Media Transcoding –

Today, the HPC industry is transforming from traditional hardware-based deployment models to software-based, and ultimately, cloud- based. By deploying virtualized solutions, operators have the opportunity to gain valuable experience in the technology and processes related to virtualization, while reaping the benefits of improved service agility, deployment flexibility and reduced CAPEX. Gazoo’s virtualized portfolio of solutions leverage NFV and SDN technologies for deployments on cloud-based infrastructure.

Our patented High Performance Virtual Machines (HPVMs) allow more complex applications to be virtualized and moved to standard server architectures, and their compute intensive functions accelerated.

Inspiration for this white paper is derived from actual customer situations where this approach has been utilized to replace one or more costly black-boxes from a customer rack. Specific case study information is available under NDA.

At Gazoo we invent, perfect, patent, and license software solutions that accelerate High Performance Computing (HPC).

Gazoo, partnered with Signalogic© has worked closely with TI for 25+ years and is an authorized TI Design Network Member. TI’s globalNetwork provides turnkey products and services, system modules,embedded software and development tools that help customersaccelerate development and reduce time-to-market.

*

Intel® IOT Solutions

Alliance

Gazoo, partnered with Signalogic® is a General Member of the Intel®

Internet of Things Solutions Alliance. Intel and 250+ global member companies of the Alliance provide the hardware, software, firmware, tools and systems integration that developers need to take a leading role in the rise of the Internet of Things. Learn more at: iotsolutionsalliance.intel.com/member-roster/signalogic-inc

Gazoo, partnered with Signalogic® is an HP AllianceOne Partner. AllianceOne is a worldwide program composed of hundreds of ISVs, IHVs, consultants, SIs, service providers and OEMs who develop market-leading solutions running on key HP technologies and platforms.

Design Network

3891 S. Traditions Dr. College Station, TX 77845 USA

979-220-7753

www.gazoohpc.com | [email protected]

© 2015 Gazoo, Inc.

High Performance Computing