Parallel Computer Architecture ConceptsTDDD93/timetable/2017/01... · 6 31 Standard CPU Multicore Designs nStandard desktop/server CPUs have a few cores with shared off-chip main

1

Parallel ComputerArchitecture Concepts

TDDD93 Lecture 1

Christoph Kessler

PELAB / IDA Linköping university

Sweden

20172

Outline

Lecture 1: Parallel Computer Architecture Conceptsn Parallel computer, multiprocessor, multicomputern SIMD vs. MIMD executionn Shared memory vs. Distributed memory architecturen Interconnection networksn Parallel architecture design conceptsl Instruction-level parallelisml Hardware multithreadingl Multi-core and many-corel Accelerators and heterogeneous systemsl Clusters

n Implications for programming and algorithm design

3

Traditional Use of Parallel Computing:Large-Scale HPC Applications

NSC Triolith

n High Performance Computing (HPC)l Much computational work

(in FLOPs, floatingpoint operations)

l Often, large data setsl E.g. climate simulations, particle physics, engineering, sequence

matching or proteine docking in bioinformatics, …n Single-CPU computers and even today’s

multicore processors cannot provide suchmassive computation power

n Aggregate LOTS of computers à Clustersl Need scalable parallel algorithmsl Need to exploit multiple levels of parallelism

4

Traditional Use of Parallel Computing:in HPC

5

Example: Weather Forecast

cell

• Air pressure• Temperature• Humidity• Sun radiation• Wind direction• Wind velocity• …

(very simplified…)

• 3D Space discretization (cells)• Time discretization (steps)• Start from current observations

(sent from weather stations)• Simulation step by evaluating weather model equations

6

More Recent Use of Parallel Computing:Big-Data Analytics Applicationsn Big Data Analytics

l Data access intensive (disk I/O, memory accesses)4Typically, very large data sets (GB … TB … PB … EB …)

l Also some computational work for combining/aggregating datal E.g. data center applications, business analytics, click stream

analysis, scientific data analysis, machine learning, …l Soft real-time requirements on interactive querys

n Single-CPU and multicore processors cannotprovide such massive computation powerand I/O bandwidth+capacity

n Aggregate LOTS of computers à Clustersl Need scalable parallel algorithmsl Need to exploit multiple levels of parallelisml Fault tolerance

NSC Triolith

2

7

HPC vs Big-Data Computing

n Both need parallel computingn Same kind of hardware – Clusters of (multicore) serversn Same OS family (Linux)n Different programming models, languages, and tools

HW: Cluster

OS: Linux

Par. programming models:MPI, OpenMP, …

HW: Cluster

OS: Linux

Par. programming models:MapReduce, Spark, …

HPC prog. languages: Fortran, C/C++ (Python)

Big-Data prog. languages: Java, Scala, Python, …

à Let us start with the common basis: Parallel computer architecture

Big-data storage/access: HDFS, …

Scientific computing libraries: BLAS, …

HPC application Big-Data application

8

Parallel Computer

9

Parallel Computing for HPC Applicationsn High Performance Computingl Much computational work

(in FLOPs, floatingpoint operations)l Often, large data setsl E.g. climate simulations, particle physics, engineering,

sequence matching or proteine docking in bioinformatics, …n Single-CPU computers and even today’s multicore processors

cannot provide such massive computation powern Aggregate LOTS of computersà Clustersl Need scalable algorithmsl Need to exploit multiple

levels of parallelism

NSC Triolith

10

Parallel Computer Architecture Concepts

Classification of parallel computer architectures:n by control structuren by memory organizationl in particular, Distributed memory vs. Shared memory

n by interconnection network topology

11

Classification by Control Structure

…

op

op op op op1 2 3 4

12

Classification by Memory Organization

Most common today in HPC and Data centers:

Hybrid Memory System• Cluster (distributed memory)

of hundreds, thousands ofshared-memory servers each containing one or several multi-core CPUs

NSC Triolith

e.g. (traditional) HPC cluster e.g. multiprocessor (SMP) or computer with a standard multicore CPU

3

13

Hybrid (Distributed + Shared) Memory

M M

14

Interconnection Networks (1)

n Network= physical interconnection medium (wires, switches)+ communication protocol

(a) connecting cluster nodes with each other (DMS)(b) connecting processors with memory modules (SMS)

Classificationn Direct / static interconnection networks

l connecting nodes directly to each otherl Hardware routers (communication coprocessors)

can be used to offload processors from most communication work

n Switched / dynamic interconnection networks

P R

15

Interconnection Networks (2): Simple Topologies P

P

PP

P

Pfully connected

16

Interconnection Networks (3):Fat-Tree Networkn Tree network extended for higher bandwidth (more switches,

more links) closer to the rootl avoids bandwidth bottleneck

n Example: Infiniband network (www.mellanox.com)

17

More about Interconnection Networksn Hypercube, Crossbar, Butterfly, Hybrid networks… à TDDC78

n Switching and routing algorithms

n Discussion of interconnection network propertiesl Cost (#switches, #lines)l Scalability

(asymptotically, cost grows not much faster than #nodes)l Node degreel Longest path (à latency)l Accumulated bandwidthl Fault tolerance (worst-case impact of node or switch failure)l …

18

Instruction Level Parallelism (1):Pipelined Execution Units

4

19

SIMD computingwith Pipelined Vector Units e.g., vector supercomputers

Cray (1970s, 1980s), Fujitsu, …

20

Instruction-Level Parallelism (2):VLIW and Superscalar

n Multiple functional units in paralleln 2 main paradigms:l VLIW (very large instruction word) architecture ^

4Parallelism is explicit, progr./compiler-managed (hard) l Superscalar architecture à

4Sequential instruction stream4Hardware-managed dispatch4power + area overhead

n ILP in applications is limitedl typ. < 3...4 instructions can be issued simultaneouslyl Due to control and data dependences in applications

n Solution: Multithread the application and the processor

21

Hardware Multithreading

PP P P

E.g.,datadependence

22

SIMD Instructions

n “Single Instruction stream, Multiple Data streams”l single thread of control flowl restricted form of data parallelism

4apply the same primitive operation (a single instruction) in parallel to multiple data elements stored contiguously

l SIMD units use long “vector registers”4each holding multiple data elements

n Common todayl MMX, SSE, SSE2, SSE3,…l Altivec, VMX, SPU, …

n Performance boost for operations on shorter data typesn Area- and energy-efficientn Code to be rewritten (SIMDized) by programmer or compilern Does not help (much) for memory bandwidth

SIMD unitop

”vector register”

23

The Memory Walln Performance gap CPU – Memoryn Memory hierarchyn Increasing cache sizes shows diminishing returnsl Costs power and chip area

4 GPUs spend the area instead on many simple cores with little memory

l Relies on good data locality in the applicationn What if there is no / little data locality?l Irregular applications,

e.g. sorting, searching, optimization...n Solution: Spread out / overlap memory access delayl Programmer/Compiler: Prefetching, on-chip pipelining,

SW-managed on-chip buffersl Generally: Hardware multithreading, again!

24

Moore’s Law (since 1965)Exponential increase in transistor density

Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

5

25

The Power Issuen Power = Static (leakage) power + Dynamic (switching) power n Dynamic power ~ Voltage2 * Clock frequency

where Clock frequency approx. ~ voltage à Dynamic power ~ Frequency3

n Total power ~ #processors

26

Moore’s Law vs. Clock Frequency

• #Transistors / mm2 still growing exponentially according to Moore’s Law

• Clock speed flattening out

2003

~3GHz

More transistors + Limited frequency ⇒ More cores

27

Moore’s Law: Still running after 50 years…

Image source:

28

Conclusion: Moore’s Law Continues, But ...

0,0

2,0

4,0

6,0

8,0

10,0

12,0

14,0

16,0

Log

2 S

peed

up

Single-processor Performance Scaling

Limit: Clock rate

Limit: RISC ILP

Throughput incr. 55%/year

65 nm 45 nm 32nm 22nm90 nm

Pipelining

RISC/CISC CPI

Device speed

Parallelism

Assumed increase17%/year possible

Source: Doug Burger, UT Austin 2005

29

Solution for CPU Design: Multicore + Multithreading

n Single-thread performance does not improve any morel ILP walll Memory walll Power wall

n but we can put more cores on a chipl And hardware-multithread the cores to hide memory latencyl All major chip manufacturers produce multicore CPUs today

30

Main features of a multicore system

n There are multiple computational cores on the same chip.n The cores might have (small) private on-chip memory

modules and/or access to on-chip memory shared by several cores.

n The cores have access to a common off-chip main memoryn There is a way by which these cores communicate with each

other and/or with the environment.

6

31

Standard CPU Multicore Designs

n Standard desktop/server CPUs have a few cores with shared off-chip main memoryl On-chip cache (typ., 2 levels)

4L1-cache mostly core-private4L2-cache often shared by

groups of coresl Memory access interface shared by all or groups of cores

n Caching à multiple copies of the same data itemn Writing to one copy (only) causes inconsistencyn Shared memory coherence mechanism to enforce automatic

updating or invalidation of all copies aroundàMore about shared-memory architecture, caches, data locality,

consistency issues and coherence protocols in TDDC78/TDDD56

core core corecore

L1$ L1$ L1$ L1$

L2$ L2$

Interconnect / Memory interface

main memory (DRAM)

32

Some early dual-core CPUs (2004/2005)

P0 P1L1$ D1$ L1$ D1$

L2$

Memory Ctrl

Main memory

L2$

AMD Opteron Dualcore (2005)

P0 P1L1$ D1$ L1$

L2$

Memory Ctrl

IBM Power5(2004)

Main memory

P0 P1L1$ D1$ L1$ D1$

L2$

Memory Ctrl

Intel Xeon Dualcore(2005)

Main memory

SMT

D1$

$ = ”cache”L1$ = ”level-1 instruction cache”D1$ = ”level-1 data cache”L2$ = ”level-2 cache” (uniform)

33

SUN/Oracle SPARC T Niagara (8 cores)

P6 P7L1$ D1$ L1$ D1$

L2$

Memory Ctrl

Niagara T1 (2005): 8 cores, 32 HW threads

Main memory

P4 P5L1$ D1$ L1$ D1$

P2 P3L1$ D1$ L1$ D1$

P0 P1L1$ D1$ L1$ D1$

Memory Ctrl Memory Ctrl Memory Ctrl

Main memory Main memory Main memory

Sun UltraSPARC ”Niagara”

Niagara T1 (2005): 8 cores, 32 HW threads

Niagara T2 (2008):8 cores, 64 HW threads

Niagara T3 (2010):16 cores, 128 HW threads

T5 (2012):16 cores, 128 HW threads

34

SUN / Oracle SPARC-T5 (2012)

28nm process, 16 cores x 8 HW threads, L3 cache on-chip,On-die accelerators for common encryption algorithms

35

Scaling Up: Network-On-Chipn Cache-coherent shared memory (hardware-controlled) –

does not scale well to many coresl power- and area-hungryl signal latency across whole chipl not well predictable access times

n Idea: NCC-NUMA – non-cache-coherent, non-uniform memoryaccessl Physically distributed on-chip [cache] memory, l on-chip network, connecting PEs or coherent ”tiles” of PEsl global shared address space, l but software responsible

for maintaining coherencen Examples:

l STI Cell/B.E., l Tilera TILE64,l Intel SCC, Kalray MPPA 36

Example: Cell/B.E. (IBM/Sony/Toshiba 2006)

n An on-chip network (four parallel unidirectional rings)interconnect the master core, the slave cores and the main memory interface

n LS = local on-chip memory, PPE = master, SPE = slave

7

37

Towards Many-Core CPUs...

n Intel, second-generation many-core research processor:48-core (in-order x86) SCC ”Single-chip cloud computer”, 2010l No longer fully cache coherent

over the entire chipl MPI-like message passing

over 2D mesh network on chip

Source: Intel 38

n Tilera TILE64 (2007): 64 cores, 8x8 2D-mesh on-chip network

Towards Many-Core Architectures

1 tile: VLIW-processor+ cache + router

P C

R

(Image simplified)

Mem-controller

I/O I/O

39

Kalray MPPA-256

n 16 tiles with 16 VLIW compute cores eachplus 1 control core per tile

n Message passing network on chipn Virtually unlimited array extension

by clustering several chipsn 28 nm CMOS technology n Low power dissipation, typ. 5 W

Image source: Kalray

40

Intel Xeon PHI (since late 2012)

n Up to 61 cores, 244 HW threads, 1.2 Tflops peak performancen Simpler x86 (Pentium) cores (x 4 HW threads),

with 512 bit wide SIMD vector registersn Can also be used as a coprocessor, instead of a GPU

41

”General-purpose” GPUs

• Main GPU providers for laptop/desktop Nvidia, AMD(ATI), Intel

• Example: NVIDIA’s 10-series GPU (Tesla, 2008)has 240 cores

• Each core has a• Floating point / integer unit• Logic unit • Move, compare unit• Branch unit

• Cores managed by thread manager• Thread manager can spawn

and manage 30,000+ threads • Zero overhead thread switching

Source: NVidia

Nvidia Tesla C1060:933 GFlops

(Images removed)

42

Nvidia Fermi (2010): 512 cores1 ”shared-memory

multiprocessor” (SM)1 Fermi C2050 GPU

SM

L2

I-cacheSchedulerDispatchRegister file

32 Streaming processors(cores)

Load/Store unitsSpecial function units

64K configurable L1cache/ shared memory

1 StreamingProcessor

(SP)FPU IntU

8

43

GPU Architecture Paradigmn Optimized for high throughputl In theory, ~10x to ~100x higher throughput than CPU is

possiblen Massive hardware-multithreading hides memory access latencyn Massive parallelismn GPUs are good at data-parallel computations l multiple threads executing the same instruction on different

data, preferably located adjacently in memory

44

The future will be heterogeneous!

Need 2 kinds of cores – often on same chip:n For non-parallelizable code:

Parallelism only from running several serial applications simultaneously on different cores (e.g. on desktop: word processor, email, virus scanner, …

… not much more)à Few (ca. 4-8) ”fat” cores

(power-hungry, area-costly, caches, out-of-order issue, )for high single-thread performance

n For well-parallelizable code: (or on cloud servers)à on hundreds of simple cores

(power + area efficient)(GPU-/SCC-like)

45

Heterogeneous / Hybrid Multi-/Manycore

Key concept: Master-slave parallelism, offloadingn General-purpose CPU (master) processor controls execution

of slave processors by submitting tasks to them and transfering operand data to the slaves’ local memoryàMaster offloads computation to the slaves

n Slaves often optimized for heavy throughput computingl Master could do something else while waiting for the result,

or switch to a power-saving moden Master and slave cores might reside

on the same chip (e.g., Cell/B.E.) or on different chips (e.g., most GPU-based systems today)

n Slaves might have access to off-chip main memory (e.g., Cell) or not (e.g., today’s GPUs)

46

Heterogeneous / Hybrid Multi-/Manycore Systems

n Cell/B.E.

n GPU-based system:

CPU

GPU

Offloadheavy

computation

Data transfer

Devicememory

Mainmemory

47

Multi-GPU Systems

n Connect one or few general-purpose (CPU) multicore processors with shared off-chip memory to several GPUs

n Increasingly popular in high-performance computingl Cost and (quite) energy effective if offloaded computation

fits GPU architecture well

L2

Main Memory(DRAM)

L2

48

Reconfigurable Computing Units

n FPGA – Field Programmable Gate Array

"Altera StratixIVGX FPGA" by Altera Corp. Licensed under CC BY 3.0 via Wikimedia Commons

9

49

Example: Beowulf-class PC Clusters

with off-the-shelf CPUs(Xeon, Opteron, …)

50

Cluster Example: Triolith (NSC, 2012 / 2013)

A so-called Capability cluster (fast network for parallel applications,not for just lots of independent sequential jobs)

1200 HP SL230 servers (compute nodes), each equipped with 2 Intel E5-2660 (2.2 GHz Sandybridge) processors with 8 cores each

à 19200 cores in total

à Theoretical peak performance of 338 Tflops/s

Mellanox Infiniband network(Fat-tree topology)

NSC Triolith

51

The Challengen Today, basically all computers are parallel computers!

l Single-thread performance stagnating l Dozens of cores and hundreds of HW threads available per serverl May even be heterogeneous (core types, accelerators)l Data locality mattersl Large clusters for HPC and Data centers, require message passing

n Utilizing more than one CPU core requires thread-level parallelismn One of the biggest software challenges: Exploiting parallelism

l Need LOTS of (mostly, independent) tasks to keep cores/HW threads busy and overlap waiting times (cache misses, I/O accesses)

l All application areas, not only traditional HPC4 General-purpose, data mining, graphics, games, embedded, DSP, …

l Affects HW/SW system architecture, programming languages, algorithms, data structures …

l Parallel programming is more error-prone(deadlocks, races, further sources of inefficiencies)4 And thus more expensive and time-consuming

52

Can’t the compiler fix it for us?

n Automatic parallelization?l at compile time:

4Requires static analysis – not effective for pointer-based languages

4needs programmer hints / rewriting ...4ok for few benign special cases:

– (Fortran) loop SIMDization, – extraction of instruction-level parallelism, …

l at run time (e.g. speculative multithreading)4High overheads, not scalable

n More about parallelizing compilers in TDDD56 + TDDC78

53

And worse yet,

n A lot of variations/choices in hardwarel Many will have performance implications l No standard parallel programming model

4portability issuen Understanding the hardware will make it easier to make

programs get high performancel Performance-aware programming gets more important

also for single-threaded codel Adaptation leads to portability issue again

n How to write future-proof parallel programs?

54

The Challenge

n Bad news 1:Many programmers (also less skilled ones)

need to use parallel programming in the future

n Bad news 2:There will be no single uniform parallel programming model

as we were used to in the old sequential timesà Several competing general-purpose

and domain-specific languagesand their concepts will co-exist

10

55

What we learned in the past…

n Sequential von-Neumann model programming, algorithms, data structures, complexityl Sequential / few-threaded languages: C/C++, Java, ...

not designed for exploiting massive parallelismtime

problem size

T(n) = O ( n log n )

56

… and what we need nown Parallel programming!

l Parallel algorithms and data structuresl Analysis / cost model: parallel time, work, cost; scalability;l Performance-awareness: data locality, load balancing, communication

time

problem size

number ofprocessing units used

T(n,p) = O ( (n log n)/p + log p )

Questions?

Documents

Parallel Computer Architecture ConceptsTDDD93/timetable/2017/01... · 6 31 Standard CPU Multicore Designs nStandard desktop/server CPUs have a few cores with shared off-chip main