A Survey on in-a-box parallel computing and its implications on system software research

A Survey on in-a-box parallel computing

and its implications on system software

research

Changwoo Min (multics69@gmail.com)

Motivation

Technology ratios matter, Jim Gray

In the face of such "10X" forces, you can lose control of your destiny, Andrew S Grove

What is the implications of multicore evolution for system software researcher?

Survey Scope and Strategy

Multicore

CPU GPGPU

Operating System

System Library

Parallel Programming Model

Parallel

Application

Parallel

Middleware

Multicore

CPU GPGPU GPGPU …

Virtual Machine Monitor

Contents

Background

Parallel Programming Model and Productivity Tools

Optimization of System Software

Supporting GPU in a Virtualized Environment

Utilizing GPU in Middleware

Conclusion

Background

Why multicore?

Multicore CPU

Power wall

ILP(instruction level parallelism) wall

Memory wall

Wire delay

GPGPU(General Purpose computing on a Graphic Processing Unit)

GPU typically handles computation only for computer graphics.

Add followings to the rendering pipelines

programmable stages

higher precision arithmetic

Use stream processing on non-graphics data.

Architecture of GPGPU core

Parallel Programming Model and

Productivity Tools

OpenMP

Parallel Programming API for shared memory

multiprocessing programming in C, C++, Fortran

Use language extension – “#pragma omp”

Need compiler support

OpenMP (cont’d)

Fork-and-join model

Bounded parallel loop, reduction

Task-creation-and-join model

Unbounded loop, recursive algorithm, producer/consumer

Intel TBB (Threading Building Block)

Similar to OpenMP

API for shared memory multiprocessing

Fork-and-join

parallel-for, parallel-reduce

Task-creation-and-join

Task scheduler

Different from OpenMP

C++ template library

Concurrent container class

Hash map, vector, queue

Various synchronization mechanism

mutex, spin lock, …

Atomic type, atomic operations

Scalable memory allocator

Nvidia CUDA (Compute Unified Device Architecture)

Computing engine in Nvidia GPU

Programming framework for Nvidia GPU

Use CUDA extended C

declspecs, keywords, intrinsic, runtime API, function launch, …

CUDA extended C Compiling CUDA Code Processing flow on CUDA

Nvidia CUDA (cont’d)

Execution Model Kernel Memory Access

OpenCL (Open Compute Language)

CPU/GPU heterogeneous computing framework

standardized by Khronous group

OpenCL Memory Model CUDA, OpenCL Example

Lithe: Enabling Efficient Composition of

Parallel Libraries

ParLab, UC Berkeley, HotPar’09

Problem

Composition of parallel libraries shows performance anomaly

Lithe: Enabling Efficient Composition of

Parallel Libraries (cont’d)

Solution

Virtualized thread are bad for parallel libraries.

Unvirtualized hardware thread context

Sharing harts

Cooperative hierarchical scheduler framework for harts

Concurrency bug detection: DataCollider

Microsoft Research, OSDI’10

Problem

Detecting concurrency data race bug is difficult.

For large system such as Windows kernel, runtime overhead is critical.

Solution

Sampling using code break point

When a code break point is trapped,

Set data break point for its operand

Sleep for a while

If the data is changed, it could be data race.

Concurrency bug detection: SyncFinder

UC San Diego, OSDI ’10

Problem

How to find ad-hoc synchronization

Solution

Formalize patterns of ad-hoc synchronization

Detect such patterns using LLVM

Optimization of System Software

Memory Allocation: Hoard

UT, ASPLOS’00

Problem

Memory allocator is performance bottleneck in multi

processor environment.

Lock contention, False sharing, Blow up

Allocator induced false sharing

Memory Allocation: Hoard (cont’d)

Solution

Per-processor heap to reduce

lock contention and false

sharing

Global heap

Borrow memory from global

heap to increase per-processor

Return memory to global heap if

there are too much free memory

in a per-processor heap

Memory Allocation: Xmalloc

UIUC, ICCIT’10

Problem

Scalable malloc for CUDA whereby hundreds of threads run

concurrently.

Solution

Memory allocation coalescing

System Call: FlexSC

University of Toronto, OSDI’10

Problem

Negative performance impact of system call is huge.

Direct cost + indirect cost

Solution

Batching, asynchronous system call

Revisiting OS Architecture

Multikernel

ETH Zurich, Microsoft Research Cambridge, SOSP’09

Problem

System diversity

It is no longer acceptable (or useful) to tune a general-purpose OS

design for a particular hardware model.

Multikernel (cont’d)

Problem (cont’d)

The interconnects matters

Core diversity

Programmable NICs

FPGA in CPU sockets

8-socket Nahelem On-chip interconnects

SHM vs. Message Passing

Multikernel (cont’d)

Solution

Today’s computer is already a distributed system. Why isn’t

your OS?

Barallelfish

Implementation of the multikernel approach

Message passing, shared nothing, replica maintenance

An Analysis of Linux Scalability to Many

MIT CSAIL, OSDI’10

Problem

If so, is Linux scalable enough?

Solution

Test linux scalability using 48 Intel cores with 7 applications

No kernel problems up to 48 cores

3002 LOC patches

Sloopy counter

: replicated reference counter

Supporting GPU in a virtualized

environment

HyVM (Hybrid Virtual Machines)

Georgia Tech

Problem

Asymmetries in performance, memory and cache

Functional differences

Multiple accelerators

Vector processor

Floating point

Additional instructions for accelerations

Solution

heterogeneity- and asymmetry-aware hypervisors

HyVM (cont’d)

Solution (cont’d)

HyVM Architecture GViM: GPU Virtualization Architecture

Memory management in GViM Harmony CPU/GPU co-scheduling

VMGL (Virtualizing OpenGL)

University of Toronto, VEE’07

Problem

How to support OpenGL in a virtual machine environment

Solution

Forward OpenGL command to the driver domain

Utilizing GPU in Middleware

StoreGPU

University of British Columbia, HDPC’10

Problem

In CAS(Contents Addressable Storage),

How to minimizing hash calculation cost

Solution

Offloading to GPU

StoreGPU Architecture

PacketShader

KAIST, SIGCOMM’10, NSDI’11

Problem

How to boot up performance of software router

Solution

Offload stateless (parallelizable) packet processing to GPU

PacketShader Architecture Basic Workflow of PacketShader

Conclusion

A Survey on in-a-box parallel computing and its implications on system software research

Documents

MANUAL ON SIMULTANEOUS OPERATIONS ON PARALLEL OR … Air Navigation... · MANUAL ON SIMULTANEOUS OPERATIONS ON PARALLEL OR NEAR-PARALLEL INSTRUMENT RUNWAYS (SOIR) FIRST EDITION -

Parallel processing across neural systems: Implications

PARALLEL IMPORTS IN PHARMACEUTICALS: IMPLICATIONS … · PARALLEL IMPORTS IN PHARMACEUTICALS: IMPLICATIONS FOR COMPETITION AND PRICES IN ... on the market in another country ... against

Purpose: On Parallel Tracks

More on Parallel Computing

Introduction to Signal Processing on Databases · Graph Statistics • 90 minutes ... Live on Parallel Computers. MemoryHierarchy Parallel Architecture. Unitof Memory. Implications

On Neutrosophic Implications

1 Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications Mark K. Gardner (Virginia Tech) Wu-chun Feng (Virginia

PARALLEL SESSIONS 2nd INTERNATIONAL CONFERENCE ON …repository.lppm.unila.ac.id/6175/1/Parallel Session PAADM 2017.pdf · parallel sessions 2nd international conference on religion

Fine-grain parallel processing on a commodity platform : … · Fine-grain parallel processing on a commodity platform : ... Fine-grain parallel processing on a commodity platform

On-line Parallel Tomography

Reactive Molecular Dynamics on Massively Parallel ... · Abstract—We present a parallel implementation of the ReaxFF force ﬁeld on massively parallel heterogeneous architectures,

Data parallel algorithms - mathematik.uni-dortmund.degoeddeke/arcs2008/B1_Data... · ARCS 2008 2 Overview • Parallel Processing on GPUs • Types of Parallel Data Flow • Parallel

Statements on Societal Implications

PARALLEL TEXT RETRIEVAL ON TEMPORALLY VERSIONED DOCUMENT ... · PARALLEL TEXT RETRIEVAL ON TEMPORALLY VERSIONED DOCUMENT COLLECTIONS a thesis ... PARALLEL TEXT RETRIEVAL ON TEMPORALLY

The nature of conspiracy: implications for parallel vs ...roa.rutgers.edu/content/article/files/1690_jeffrey_adler_1.pdf · The nature of conspiracy: implications for parallel vs

Impacts on Safety – Results and Implications · Impacts on Safety – Results and Implications Report Summary Impacts on Safety – Results and Implications Report on results and

Parallel computing on internet

Parallel Communications On-Chip

Implications of Parallel Imports · Implications of Parallel Importsof Passenger Motor Vehicles . i Pegasus Economics • • PO ox 449 Jamison entre, ... 6.2 Grey Motor Vehicles