ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many-Core Architectures

ProOnE: A General-Purpose Protocol

Onload Engine for Multi- and Many-

Core Architectures

P. Lai, P. Balaji, R. Thakur and D. K. Panda

Computer Science and Engg., Ohio State University

Math. and Computer Sci., Argonne National Laboratory

Hardware Offload Engines

• Specialized Engines– Widely used in High-End Computing (HEC) systems for

accelerating task processing

– Built for specific purposes• Not easily extendable or programmable

• Serve a small niche of applications

• Trends in HEC systems: Increasing size and complexity– Difficult for hardware to deal with complexity

• Fault tolerance (understanding application and system

requirements is complex and too environment specific)

• Multi-path communication (optimal solution is NP-complete)

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

General Purpose Processors

• Multi- and Many-core Processors– Quad- and hex-core processors are

commodity components today

– Intel Larrabee will have 16-cores; Intel

Terascale will have 80-cores

– Simultaneous Multi-threading (SMT or

Hyperthreading) is becoming common

• Expected future– Each physical node will have a

massive number of processing

elements (Terascale on the chip)

Future multicore systems

ISC (06/23/2009)

Benefits of Multi-core Architectures

• Cost– Cheap: Moore’s law will continue to drive costs down

– HEC is a small market; Multi-cores != HEC

• Flexibility– General purpose processing units

– A huge number of tools already exist to program and utilize

them (e.g., debuggers, performance measuring tools)

– Extremely flexible and extendable

ISC (06/23/2009)

Multi-core vs. Hardware Accelarators

• Will multi-core architectures eradicate hardware

accelerators?– Unlikely: Hardware accelerators have their own benefits

– Hardware accelerators provide two advantages:• More processing power, better on-board memory bandwidth

• Dedicated processing capabilities– They run compute kernels in a dedicated manner

– Do not deal with shared mode processing like CPUs

– But more complex machines need dedicated processing for

more things• More powerful hardware offload techniques possible, but

decreasing returns!

ISC (06/23/2009)

ProOnE: General Purpose Onload Engine• Hybrid hardware-software engines

– Utilize hardware offload engines for low-hanging fruit• Where return for investment is maximum

– Utilize software “onload” engines for more complex tasks• Can imitate some of the benefits of hardware offload engines,

such as dedicated processing

• Basic Idea of ProOnE– Dedicate a small subset of processing elements on a multi-

core architecture for “protocol onloading”

– Different capabilities added to ProOnE as plugins

– Application interacts with ProOnE for dedicated processing

ISC (06/23/2009)

ProOnE: Basic Overview

• ProOnE does not try to take over tasks performed well by

hardware offload engines

• It only supplements their capabilities

Software middleware (e.g. MPI)

Network with advanced

accelerationNetwork with

basic acceleration

Basic networkhardware

ProOnEProOnE

ProOnE

ISC (06/23/2009)

Presentation Layout

• Introduction and Motivation

• ProOnE: A General Purpose Protocol Onload Engine

• Experimental Results and Analysis

• Conclusions and Future Work

ISC (06/23/2009)

ProOnE Infrastructure

• Basic Idea:– Onload a part of the work to the ProOnE daemons

– Application processes and ProOnE daemons communicate

through intra-node and inter-node communication

Shared memory

Node 1

Shared memory

Node 2

Network

ProOnE Application process

ISC (06/23/2009)

Design Components

• Intra-node Communication– Each ProOnE process allocates a shared memory segment

for communication with the application processes

– Application processes use this shared memory to

communicate requests, completions, signals, etc., to and from

ProOnE• Use queues and hash functions to manage the shared memory

• Inter-node Communication– Utilize any network features that are available

– Build a logical all-to-all connectivity on the available

communication infrastructure

ISC (06/23/2009)

Design Components (contd.)

• Initialization Infrastructure– ProOnE processes

• Create its own shared memory region

• Attach to the shared memory created by other ProOnE

processes

• Connect to remote ProOnE processes

• Listen and accept connections from other processes (ProOnE

processes and application processes)

– Application processes• Attach to ProOnE shared memory segments

• Connect to remote ProOnE processes

ISC (06/23/2009)

ProOnE Capabilities• Can “onload” tasks for any component in the application

executable:– Application itself

• Useful for progress management (master-worker models) and

work-stealing based applications (e.g., mpiBLAST)

– Communication middleware• MPI stack can utilize it to offload certain tasks that can take

advantage of dedicated processing

• Assuming that the CPU is shared with other processes makes

many things inefficient

– Network Protocol stacks• E.g., Greg Reigner’s TCP Onload Engine

ISC (06/23/2009)

Case Study: MPI Rendezvous Protocol

• Used for large messages to avoid copies and/or large

unexpected data

RTSMPI_Recv

Sender Receiver

MPI_Send

ISC (06/23/2009)

Issues with MPI Rendezvous

• Communication Progress in

MPI Rendezvous– Delay in detecting control

messages

– Hardware support for One-

sided communication

useful, but not perfect• Many cases are

expensive to implement in

hardware

– Bad computation

communication overlap!

CTS MPI_Wait

RTSMPI_IsendMPI_Irecv

computationcomputation

DATAMPI_Wait Wait for data

Sender Receiver

Delayed!

No overlap!

ISC (06/23/2009)

MPI Rendezvous with ProOnE

• ProOnE is a dedicated

processing engine– No delay because

the receiver is “busy”

with something else

• Communication

progress is similar to

earlier– Completion

notifications included

as appropriate

MPI sender MPI receiver

ProOnE 0 ProOnE 1

SEND request

readrequest

Try to match a postedRECV request

CTS (matched)

RTS (no match)

CMPLT CMPLT

RECVrequestRTS

CTSRTS

Shmem 0 Shmem 1

readrequest

Receiver arrives earlierReceiver arrives later

ISC (06/23/2009)

Design Issues and Solutions• MPICH2’s internal message matching uses the three-tuple

for matching– (src, comm, tag)

– Issue: out-of-order messages (even when the

communication sub-system is ordered)

– Solution: Add a sequence number to the matching tuple

MPI Rank 0Send1: (src=0, tag=0, comm=0, len=1M)

Send2: (src=0, tag=0, comm=0, len=1K)

MPI Rank 1Recv1: (src=0, tag=0, comm=0, len=1M)

Recv2: (src=0, tag=0, comm=0, len=1K)

MPI Rank 0Send1: (src=0, tag=0, comm=0, len=1M)

Send2: (src=0, tag=0, comm=0, len=1K)

MPI Rank 1Recv1: (src=0, tag=0, comm=0, len=1M)

Recv2: (src=0, tag=0, comm=0, len=1M)

ISC (06/23/2009)

Design Issues and Solutions (contd.)

• ProOnE is “shared” between all processes on the node– It cannot distinguish requests to different ranks

– Solution: Add a destination rank to the matching tuple

• Shared memory lock contention– Shared memory divided into 3 segments: (1) SEND/RECV

requests, (2) RTS messages and (3) CMPLT notifications

– (1) and (2) use the same lock to avoid mismatches

• Memory Mapping– ProOnE needs access to application memory to avoid extra

copies

– Kernel direct-copy used to achieve this (knem, LiMIC)

ISC (06/23/2009)

Presentation Layout

ISC (06/23/2009)

Experimental Setup

• Experimental Testbed– Dual Quad-core Xeon Processors

– 4GB DDR SDRAM

– Linux kernel 2.6.9.34

– Nodes connected with Mellanox InfiniBand DDR adapters

• Experimental Design– Use one-core on each node to run ProOnE

– Compare the performance of ProOnE-enabled MPICH2 with

vanilla MPICH2, marked as “ProOnE” and “Original”

ISC (06/23/2009)

Overview of the MPICH2 Software Stack• High Performance and Widely Portable MPI

– Support MPI-1, MPI-2 and MPI-2.1

– Supports multiple network stacks (TCP, MX, GM)

– Commercial support by many vendors• IBM (integrated stack distributed by Argonne)

• Microsoft, Intel (in process of integrating their stacks)

– Used by many derivative implementations• MVAPICH2, BG, Intel-MPI, MS-MPI, SC-MPI, Cray, Myricom

• MPICH2 and its derivatives support many Top500 systems– Estimated at more than 90%

– Available with many software distributions

– Integrated with ROMIO MPI-IO and the MPE profiling library

ISC (06/23/2009)

Computation/Communication Overlap

• Sender Overlap

• Receiver Overlap

• Computation/Communication Ratio: W/T

MPI_Isend(array);

Compute();

MPI_Wait();

WTMPI_Recv(array);

MPI_Irecv(array);

Compute();

MPI_Wait();

WTMPI_Send(array);

ISC (06/23/2009)

Sender-side Overlap

1 2 3 4 5 6 7 8 90

1.2Message Size: 1MB

ProOnE

Original

Computation (ms)

100 200 300 400 500 600 700 8000

1.2Message Size: 256KB

Computation (us)

ISC (06/23/2009)

Receiver-side Overlap

1 2 3 4 5 6 7 8 9 10 11 120

1.2Message Size: 1MB

ProOnE

Original

Computation (ms)

1Message Size: 256KB

Computation (ms)

ISC (06/23/2009)

Impact of Process Skew

5 10 15 20 25 30 35 40 45 500

120Receiver-side Overlap (1MB)

ProOnE

Computation (msec)

• BenchmarkProcess 0:

for many loops

lrecv(1MB, rank 1)

Send(1MB, rank1)

Computation()

Wait()

Process 1:

Irecv()

for many loops

Computation

Wait()

lrecv(1MB, rank 0)

Send (1MB, rank 0)

ISC (06/23/2009)

Sandia Application Availability Benchmark

1 4 16 64256 1K 4K

16K64K

256K 1M

3500Overhead

ProOnE

Original

Message Size (Bytes)

120Availability

Message Size (Bytes)

ISC (06/23/2009)

Matrix Multiplication Performance

128x128 256x256 512x5120

90With varying Problem Size

ProOnE

Original

Problem Size

16x1 8x2 4x40

With Varying System Config

System Size (#nodes x #cores)

ISC (06/23/2009)

2D Jacobi Sweep Performance

8K 16K 32K 64K0

With Varying Boundary Data Size

ProOnE

Original

Boundary Data Size

ISC (06/23/2009)

Presentation Layout

ISC (06/23/2009)

Concluding Remarks and Future Work

• Proposed, designed and evaluated a general purpose

Protocol Onload Engine (ProOnE)– Utilize a small subset of the cores to onload complex tasks

• Presented detailed design of the ProOnE infrastructure

• Onloaded MPI Rendezvous protocol as a case study– ProOnE-enabled MPI provides significant performance

benefits for benchmarks as well as applications

• Future Work:– Study performance and scalability on large-scale systems

– Onload other complex tasks including application kernels

ISC (06/23/2009)

Acknowledgments

• The Argonne part of the work was funded by:– NSF Computing Processes and Artifacts (CPA)

– DOE ASCR

– DOE Parallel Programming models

• The Ohio State part of the work was funded by– NSF Computing Processes and Artifacts (CPA)

– DOE Parallel Programming models

ISC (06/23/2009)

Thank You!

Contacts:

{laipi, panda} @ cse.ohio-state.edu

{balaji, thakur} @ mcs.anl.gov

Web Links:

MPICH2: http://www.mcs.anl.gov/research/projects/mpich2

NBCLab: http://nowlab.cse.ohio-state.edu

ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many-Core Architectures

Documents

8 D Lift Onload FT

Planmeca ProOne The best choice for dental office

PLANMECA · 2014. 10. 6. · ProOne - 2D Digital X-ray-Item/Desciption Catalog # Suggested Retail Regular Wholesale Price Discounted Wholesale Price with 5% off ProOne 2D Digital

Planmeca ProOne - Dental Union · 2012. 1. 6. · Planmeca ProOne, the fully digital all-round dental X-ray unit, was designed with simplicity in mind. Embracing the latest, state-of-the-art

HP ProOne 400 G1 All-in-One Business PC - Englishcdn.cnetcontent.com/a3/39/a33954c3-b442-4d49-9e9f-f80708a6c71d.pdfOverview Americas English — HP ProOne 400 G1 AiO Non-Touch —

Solarflare Application Onload Engine and University Program - Japanese

ProOne inc Product Brochure Chinese 031714 2

Onload Boiler Cleaning System

PLANMECA · 2014. 10. 6. · ProOne Buy a ProOne Digital pan and receive a 5% discount off of regular wholesale and earn 250 Patterson Perk Points. ProX Buy a listed PLANMECA ProX

Planmeca ProOne · 2019. 7. 25. · The Planmeca ProOne X-ray unit produces digital X-ray images for the diagnosis of dentomaxillofacial anatomy. The X-ray unit may be used by health

ProOne Brochure

Planmeca ProOne

HP ProOne 600 G1 All-in-One...1 製品の特徴概要図 1-1 HP ProOne 600 G1 All-in-OneHP ProOne 600 G1 All-in-One には、以下の特徴があります。液晶一体型オールインワン構成

G6 22 Ordenador All-in- One HP ProOne 600

ProOne 600 G5 de 21,5 Equipo profesional Todo-en-Uno HP · Hoja de datos | Equipo profesional Todo-en-Uno HP ProOne 600 G5 de 21,5" HP recomienda Windows 10 Pro para las empresas

Planmeca ProOne RS - PLASTDENT

One Business PC HP ProOne 400 G5 23.8-inch All-in- · 2020-06-18 · Data sheet | HP ProOne 400 G5 23.8-inch All-in- One Business PC HP recommends Windows 10 Pro. HP ProOne 400 G5

Cloud Onload NGINX Web Server Cookbook · This document is the NGINX Web Server Cookbook for Cloud Onload. It gives procedures for technical staff to configure and run tests, to benchmark

HP ProDesk and ProOne 600 G2 Business PC Series QuickSpecsQuickSpecs HP ProDesk 600 G2 and HP ProOne 600 G2 Business Desktops PCs Overview HP ProOne 600 G2 21 Not all configuration

Évènement Petites fonction déjà toute faites Exemple : Construction onLoad (au chargement) = "ce qui doit se passer" onLoad s’exécute toujours dans le