31
System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Embed Size (px)

Citation preview

Page 1: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

System Software for Big Data ComputingCho-Li WangThe University of Hong Kong

Page 2: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

2CS Gideon-II & CC MDRP Clusters

HKU High-Performance Computing Lab. Total # of cores: 3004 CPU + 5376 GPU

coresRAM Size: 8.34 TBDisk storage: 130 TB Peak computing power: 27.05 TFlops

0

5

10

15

20

25

30

35

2007. 7 2009 2010 2011. 1

2007. 7200920102011. 12.6T 3.1T

31.45TFlops (X12 in 3.5 years)

20T

GPU-Cluster (Nvidia M2050, “Tianhe-1a”): 7.62 Tflops

Page 3: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Big Data: The "3Vs" Model

• High Volume (amount of data) • High Velocity (speed of data in and out)• High Variety (range of data types and sources)

2.5 x 1018

2010: 800,000 petabytes (would fill a stack of DVDs reaching from the earth to the moon and back)

By 2020, that pile of DVDs would stretch half way to Mars.

Page 4: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Our Research

Page 5: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

(1) Heterogeneous Manycore Computing (CPUs+ GUPs)

JAPONICA : Java with Auto-Parallelization ON GraphIcs Coprocessing Architecture

Page 6: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

6Heterogeneous Manycore Architecture

GPU

CPUs

Page 7: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

New GPU & Coprocessors

7

Vendor ModelLaunch

DateFab. (nm)

#Accelerator Cores (Max.)

GPU Clock (MHz)

TDP (watts)

MemoryBandwidth

(GB/s)Programming

ModelRemarks

Intel

Sandy Bridge

2011Q1 3212 HD graphics

3000 EUs (8 threads/EU)

850 – 1350 95

L3: 8MBSys mem (DDR3)

21

OpenCLBandwidth is system

DDR3 memory bandwidthIvy

Bridge2012Q2 22

16 HD graphics 4000 EUs (8 threads/EU)

650 – 1150 77

L3: 8MBSys mem (DDR3)

25.6

Xeon Phi

2012H2 2260 x86 cores(with a 512-bit vector unit)

600-1100 300 8GB

GDDR5 320OpenMP#, OpenCL*,

OpenACC%

Less sensitive to branch divergent workloads

AMD

Brazos 2.0

2012Q2 40 80 Evergreen shader cores 488-680 18

L2: 1MBSys mem (DDR3)

21OpenCL, C+

+AMPTrinity 2012Q2 32

128-384 Northern

Islands cores723-800 17-100

L2: 4MBSys mem(DDR3)

25 APU

Nvidia

Fermi 2010Q1 40512 Cuda

cores(16 SMs)

1300 238L1: 48KB

L2: 768KB6GB

148CUDA, OpenCL,

OpenACCKepler(GK110)

2012Q4 28 2880 Cuda cores 836/876 300 6GB

GDDR5 288.5 3X Perf/Watt, Dynamic Parallelism, HyperQ

Page 8: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

• 18,688 AMD Opteron 6274 16-core CPUs (32GB DDR3) .

• 18,688 Nvidia Tesla K20X GPUs

• Total RAM size: over 710 TB

• Total Storage: 10 PB.

• Peak Performance: 27 Petaflop/s o GPU: CPU = 1.311 TF/s: 0.141 TF/s = 9.3 :

1

• Linpack: 17.59 Petaflop/s

• Power Consumption: 8.2 MW

8

#1 in Top500 (11/2012):Titan @ Oak Ridge National Lab.

NVIDIA Tesla K20X (Kepler GK110) GPU: 2688 CUDA cores

Titan compute board: 4 AMD Opteron + 4 NVIDIA Tesla K20X GPUs

Page 9: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Design Challenge: GPU Can’t Handle Dynamic Loops

99

Dynamic loops

for(i=0; i<N; i++){ C[i] = A[i] + B[i];}

for(i=0; i<N; i++){ A[ w[i] ] = 3 * A[ r[i] ];}

GPU = SIMD/Vector Data Dependency Issues (RAW, WAW) Solutions?

Static loops

Non-deterministic data dependencies inhibit exploitation of inherent parallelism; only DO-ALL loops or embarrassingly parallel workload gets admitted to GPUs.

Page 10: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Dynamic loops are common in scientific and engineering applications

10Source: Z. Shen, Z. Li, and P. Yew, "An Empirical Study on Array Subscripts and Data Dependencies"

Page 11: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

GPU-TLS : Thread-level Speculation on GPU• Incremental parallelization

o sliding window style execution.

• Efficient dependency checking schemes

• Deferred updateo Speculative updates are stored in the write buffer of

each thread until the commit time.

• 3 phases of execution

11

intra-thread RAWvalid inter-thread RAW in GPU

GPU: lock-step execution in the same warp (32 threads per warp).

true inter-thread RAW

Page 12: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

JAPONICA : Profile-Guided Work Dispatching

12

2880 cores

64 x86 cores

8 high-speed x86 cores

Dynamic Profiling

HighMedium

Low/None

Highly parallel

Multi-core CPU

Massively parallelParallel

Many-core coprocessors

Scheduler

Dependency density

Inter-iteration dependence: -- Read-After-Write (RAW) -- Write-After-Read (WAR)-- Write-After-Write (WAW)

Page 13: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

JAPONICA : System ArchitectureSequential Java Code with user

annotation

JavaRStatic Dep.

Analysis

Code Translatio

n

Uncertain

Dependency Density Analysis

Intra-warp Dep.

Check

Inter-warp Dep.

Check

DO-ALL Parallelizer Speculator

GPUCPUCommunicatio

n

CPU-Multi

threads

GPU-Many

threads

GPU-TLS Privatization

Task Sharing

CPU queue: low, high, 0

GPU queue: low, 00 : CPU multithreads + GPU

Low DD : CPU+GPU-TLS

High DD : CPU single core

Profiling Results

CUDA kernels with GPU-TLS| Privatization & CPU Single-thread

CUDA kernels & CPU Multi-threads

No dependence RAW WAW/WAR

13

Profiler (on GPU)

Program Dependence Graph (PDG)one loop

Task Stealing

Task Scheduler : CPU-GPU Co-Scheduling Assign the tasks

among CPU & GPU according to their dependency density (DD)

Page 14: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

(2) Crocodiles: Cloud Runtime with Object Coherence On Dynamic tILES”

Page 15: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

“General Purpose” Manycore

Tile-based architecture: Cores are connected through a 2D network-on-a-chip

Page 16: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

鳄鱼 @ HKU (01/2013-12/2015)

• Crocodiles: Cloud Runtime with Object Coherence On Dynamic tILES for future 1000-core tiled processors”

16

Mem

ory

Con

troll

er

Mem

ory

Con

troll

er

PCI-EPCI-E

ZONE 2

ZONE 1

ZONE 3

ZONE 4

PCI-EPCI-E

Gb

EG

bE

DRAM ControllerDRAM Controller

Memory ControllerMemory Controller

GbEGbE

GbEGbEP

CI-

EP

CI-

E

Mem

ory

Con

troll

er

Mem

ory

Con

troll

er

Gb

EG

bE

PC

I-E

PC

I-E

RAM

RAM

RAM

RAM

Page 17: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

• Dynamic Zoning o Multi-tenant Cloud Architecture Partition varies

over time, mimic “Data center on a Chip”.o Performance isolationo On-demand scaling. o Power efficiency (high flops/watt).

Page 18: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Design Challenge: “Off-chip Memory Wall” Problem

– DRAM performance (latency) improved slowly over the past 40 years.

(a) Gap of DRAM Density & Speed (b) DRAM Latency Not Improved

Memory density has doubled nearly every two years, while performance has improved slowly (e.g. still 100+ of core clock cycles per memory access)

Page 19: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

• Physical memory allocation performance sorted by function. As more cores are added more processing time is spent contending for locks.

Lock Contention in Multicore System

Lock Contention

Page 20: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Exim on Linux

collapse

Kernel CPU time (milliseconds/message)

Page 21: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Challenges and Potential Solutions

21

• Cache-aware designo Data Locality/Working Set getting critical! o Compiler or runtime techniques to improve data reuse

• Stop multitasking o Context switching breaks data locality o Time Sharing Space Sharing

马其顿方阵众核操作系统 : Next-generation Operating System for 1000-core processor

Page 22: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Thanks!

C.L. Wang’s webpage: http://www.cs.hku.hk/~clwang/

For more information:

Page 23: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

http://i.cs.hku.hk/~clwang/recruit2012.htm

Page 24: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Multi-granularity Computation Migration

24

WAVNet Desktop Cloud

G-JavaMPI

JESSICA2

Fine

Coarse

Small Large

SOD

Granularity

System scale(Size of state)

Page 25: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

WAVNet: Live VM Migration over WAN

25

A P2P Cloud with Live VM Migration over WAN “Virtualized LAN” over the Internet”

High penetration via NAT hole punching Establish direct host-to-host connection Free from proxies, able to traverse most NATs

VM

VM

Key Members

Zheming Xu, Sheng Di, Weida Zhang, Luwei Cheng, and Cho-Li Wang, WAVNet: Wide-Area Network Virtualization Technique for Virtual Private Cloud, 2011 International Conference on Parallel Processing (ICPP2011)

Page 26: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

26

WAVNet: Experiments at Pacific Rim Areas

北京高能物理所 IHEP, Beijing

深圳先进院 (SIAT)

香港大学 (HKU)

中央研究院(Sinica, Taiwan)

静宜大学 (Providence University)

SDSC, San Diego

日本产业技术综合研究所 (AIST, Japan)

26

Page 27: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

27 27

Thread Migration

JESSICA2JVM

A Multithreaded Java Program

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

Master Worker Worker

Worker

JIT Compiler Mode

Portable Java Frame

Java

Enabled

Single

System

Image

Computing

Architecture

JESSICA2: Distributed Java Virtual Machine

Page 28: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

History and Roadmap of JESSICA Project

• JESSICA V1.0 (1996-1999)– Execution mode: Interpreter Mode– JVM kernel modification (Kaffe JVM)– Global heap: built on top of TreadMarks (Lazy

Release Consistency + homeless)• JESSICA V2.0 (2000-2006)

– Execution mode: JIT-Compiler Mode– JVM kernel modification– Lazy release consistency + migrating-home

protocol• JESSICA V3.0 (2008~2010)

– Built above JVM (via JVMTI)– Support Large Object Space

• JESSICA v.4 (2010~)– Japonica : Automatic loop parallization and

speculative execution on GPU and multicore CPU– TrC-DC : a software transactional memory

system on cluster with distributed clocks (not discussed)

28

King Tin LAM,

Ricky MaKinson Chan

Chenggang Zhang

Past Members

J1 and J2 received a total of 1107 source code downloads

Page 29: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

29

Mobile node

Program Counter

Method Area

Heap Area

Stack frame A

Method Area

Heap Area Rebuilt

Stack frame A

Stack frame A

Method

Cloud node

objects

Local variables

Local variables

Local variables

Stack frame B

Object (Pre-)fetching

Program Counter

Program Counter

Stack-on-Demand (SOD)

Page 30: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Elastic Execution Model via SOD

30

(a) “Remote Method Call”

(b) Mimic thread migration

(c) “Task Roaming”: like a mobile agent roaming over

the network or workflow

With such flexible or composable execution paths, SOD enables agile and elastic exploitation of distributed resources (storage), a Big Data Solution !

Lightweight, Portable, Adaptable

Page 31: System Software for Big Data Computing Cho-Li Wang The University of Hong Kong

Xen VM

JVM

Xen-aware host OS

guest OS

Xen VM

JVM

guest OS

Desktop PC

Overloaded

Load balancer

Load balancer

Cloud service provider

Thread migration

(JESSICA2)

Internet

Live migration

Load balancer

Load balancer

comm.

JVM

Stack-on-demand (SOD)

Mobile client

iOS

Stack segments

Partial Heap

Method Area

Code

Small footprint

Stacks Heap

JVM process

Method Area

Code

… …

Multi-thread Java process

trigger live migration

duplicate VM instances for scaling

eXCloud : Integrated Solution for Multi-granularity

MigrationRicky K. K. Ma, King Tin Lam, Cho-Li Wang, "eXCloud: Transparent Runtime Support for Scaling Mobile Applications," 2011 IEEE International Conference on Cloud and Service Computing (CSC2011),. (Best Paper Award)