CPE 779 Parallel Computing - Spring 20121 CPE 779 Parallel Computing Lecture

CPE 779 Parallel Computing - Spring 2012 1

CPE 779 Parallel Computing

http://www.ju.edu.jo/sites/Academic/abusufah/Material/cpe779_spr12/index.html

Lecture 1: Introduction

Walid Abu-Sufah

University of Jordan

Acknowledgment: Collaboration

This course is being offered in collaboration with

• The IMPACT research group at the University of Illinois http://impact.crhc.illinois.edu/

• The Computation based Science and Technology Research Center (CSTRC) of the Cyprus Institute http://cstrc.cyi.ac.cy/

2CPE 779 Parallel Computing - Spring 2012

Acknowledgment: Slides

Some of the slides used in this course are based on slides by

• Jim Demmel, University of California at Berkeley & Horst Simon, Lawrence Berkeley National Lab (LBNL) http://www.cs.berkeley.edu/~demmel/cs267_Spr12/

• Wen-mei Hwu of the University of Illinois and David Kurk, Nvidia Corporation http://courses.engr.illinois.edu/ece408/ece408_syll.html

• Kathy Yelick, University of California at Berkeley http://www.cs.berkeley.edu/~yelick/cs194f07


Course Motivation

In the last few years:• Conventional sequential processors can not get faster• Previously clock speed doubled every 18 months

• All computers will be parallel

• >>> All programs will have to become parallel programs• Especially programs that need to run faster.


Course Motivation (continued)

There will be a huge change in the entire computing industry

• Previously the industry depended on selling new computers by running their users' programs faster without the users having to reprogram them.

• Multi/ many core chips have started a revolution in the software industry


Course Motivation (continued)

Large research activities to address this issue are underway

• Computer companies: Intel, Microsoft, Nvidia, IBM, ..etc• Parallel programming is a concern for the entire computing industry.

• Universities• Berkeley's ParLab (2008: $20 million grant)


Course Goals

Part 1 (~4 weeks)• focus on the techniques that are most appropriate

for multicore programming and the use of parallelism to improve program performance. Topics include• performance analysis and tuning• data techniques• shared data structures• load balancing. and task parallelism• synchronization


Course Goals (continued - I)

Part 2 (~ 12 weeks)• Learn how to program massively parallel processors

and achieve• high performance• functionality and maintainability• scalability across future generations

• Acquire technical knowledge required to achieve the above goals• principles and patterns of parallel algorithms• processor architecture features and constraints• programming API, tools and techniques


Outline of rest of lecture

• Why powerful computers must use parallel processors

• Examples of Computational Science and Engineering (CSE) problems which require powerful computers

• Why writing (fast) parallel programs is hard

• Principles of parallel computing performance

• Structure of the course

9

Commercial problems too

Including your laptops and handhelds

all

But things are improving

CPE 779 Parallel Computing - Spring 2012


What is Parallel Computing?

• Parallel computing: using multiple processors in parallel to solve problems (execute applications) more quickly than with a single processor

• Examples of parallel machines: • A cluster computer that contains multiple PCs combined together with

a high speed network • A shared memory multiprocessor (SMP*) by connecting multiple

processors to a single memory system• A Chip Multi-Processor (CMP) contains multiple processors (called

cores) on a single chip

• Concurrent execution comes from the desire for performance • * Technically, SMP stands for “Symmetric Multi-Processor”

Units of Measure • High Performance Computing (HPC) units are:

• Flop: floating point operation• Flops/s: floating point operations per second• Bytes: size of data (a double precision floating point number is 8)

• Typical sizes are millions, billions, trillions…Mega: Mflop/s = 1006 flop/sec; Mbyte = 220 = 1048576 ~ 106 bytesGiga: Gflop/s = 1009 flop/sec; Gbyte = 230 ~ 109 bytesTera: Tflop/s = 1012 flop/sec; Tbyte = 240 ~ 1012 bytes Peta: Pflop/s = 1015 flop/sec; Pbyte = 250 ~ 1015 bytesExa: Eflop/s = 1018 flop/sec; Ebyte = 260 ~ 1018 bytesZetta: Zflop/s = 1021 flop/sec; Zbyte = 270 ~ 1021 bytesYotta: Yflop/s = 1024 flop/sec; Ybyte = 280 ~ 1024 bytes

• Current fastest (public) machine ~ 11 Pflop/s• Up-to-date list at www.top500.org


12

Why powerful computers are parallel

all (2007)



Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Microprocessors have become smaller, denser, and more powerful.

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Slide source: Jack Dongarra

14

Microprocessor Transistors / Clock (1970-2000)


15

Impact of Device Shrinkage

• What happens when the feature size (transistor size) shrinks by a factor of x ?

• Clock rate goes up by x because wires are shorter• actually less than x, because of power consumption

• Transistors per unit area goes up by x2

• Die size also tends to increase• typically another factor of ~x

• Raw computing power of the chip goes up by ~ x4 !• typically x3 is devoted to either on-chip

• parallelism: hidden parallelism such as ILP• locality: caches

• So most programs x3 times faster, without changing them


Power Density Limits Serial Performance

4004

8008

8080

8085

8086

286386

486

Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010

YearP

ow

er D

ensi

ty (

W/c

m2 )

Hot Plate

Nuclear

Reactor

Rocket

Nozzle

Sun’sSurfaceSource: Patrick Gelsinger,

Shenkar Bokar, Intel

Scaling clock speed (business as usual) will not work

• High performance serial processors waste power- Speculation, dynamic dependence checking, etc. burn power

- Implicit parallelism discovery• More transistors, but not faster serial processors

• Concurrent systems are more power efficient

– Dynamic power is proportional to V2fC

– Increasing frequency (f) also increases supply voltage (V) cubic

effect– Increasing cores

increases capacitance (C) but only linearly

– Save power by lowering clock speed


17

Revolution in Processors

• Chip density is continuing increase ~2x every 2 years• Clock speed is not• Number of processor cores may double instead• Power is under control, no longer growingCPE 779 Parallel Computing - Spring 2012

18

Parallelism in 2012?

• These arguments are no longer theoretical• All major processor vendors are producing multicore chips

• Every machine will soon be a parallel machine• To keep doubling performance, parallelism must double

• Which (commercial) applications can use this parallelism?• Do they have to be rewritten from scratch?

• Will all programmers have to be parallel programmers?• New software model needed• Try to hide complexity from most programmers – eventually• In the meantime, need to understand it

• Computer industry betting on this big change, but does not have all the answers• Berkeley ParLab established to work on this


19

Parallelism in 2012?

• These arguments are no longer theoretical• All major processor vendors are producing multicore chips

• Every machine will soon be a parallel machine• To keep doubling performance, parallelism must double

• Which (commercial) applications can use this parallelism?• Do they have to be rewritten from scratch?

• Will all programmers have to be parallel programmers?• New software model needed• Try to hide complexity from most programmers – eventually• In the meantime, need to understand it

• Computer industry betting on this big change, but does not have all the answers• Berkeley ParLab established to work on this


Memory is Not Keeping Pace

Technology trends against a constant or increasing memory per core• Memory density is doubling every three years; processor logic is every two

• Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs

Technology trends against a constant or increasing memory per core• Memory density is doubling every three years; processor logic is every two

• Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs

Source: David Turek, IBM

Cost of Computation vs. Memory

Question: Can you double concurrency without doubling memory?• Strong scaling: fixed problem size, increase number of processors

• Weak scaling: grow problem size proportionally to number of processors

Source: IBM

20

• Listing the 500 most powerful computers in the world

• Yardstick: Rmax of Linpack• Solve Ax=b, dense problem, matrix is random• Dominated by dense matrix-matrix multiply

• Update twice a year:• ISC’xy in June in Germany• SCxy in November in the U.S.

• All information available from the TOP500 web site at: www.top500.org

The TOP500 Project


38th List: The TOP10

Rank Site Manufacturer Computer Country CoresRmax

[Pflops]Power[MW]

1RIKEN Advanced Institute

for Computational Science

FujitsuK Computer

SPARC64 VIIIfx 2.0GHz, Tofu Interconnect

Japan 795,024 10.51 12.66

2National SuperComputer

Center in TianjinNUDT

Tianhe-1ANUDT TH MPP,

Xeon 6C, NVidia, FT-1000 8C

China 186,368 2.566 4.04

3Oak Ridge National

LaboratoryCray

Jaguar Cray XT5, HC 2.6 GHz

USA 224,162 1.759 6.95

4National Supercomputing

Centre in ShenzhenDawning

NebulaeTC3600 Blade, Intel X5650, NVidia

Tesla C2050 GPU China 120,640 1.271 2.58

5GSIC, Tokyo Institute of

TechnologyNEC/HP

TSUBAME-2HP ProLiant, Xeon 6C, NVidia,

Linux/WindowsJapan 73,278 1.192 1.40

6 DOE/NNSA/LANL/SNL CrayCielo

Cray XE6, 8C 2.4 GHzUSA 142,272 1.110 3.98

7NASA/Ames Research

Center/NASSGI

Pleiades SGI Altix ICE 8200EX/8400EX

USA 111,104 1.088 4.10

8DOE/SC/

LBNL/NERSCCray

HopperCray XE6, 6C 2.1 GHz

USA 153,408 1.054 2.91

9Commissariat a l'Energie

Atomique (CEA)Bull

Tera 100Bull bullx super-node

S6010/S6030France 138.368 1.050 4.59

10 DOE/NNSA/LANL IBMRoadrunner

BladeCenter QS22/LS21USA 122,400 1.042 2.3422CPE 779 Parallel Computing - Spring 2012

Performance Development

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

59.7 GFlop/s59.7 GFlop/s

400 MFlop/s400 MFlop/s

1.17 TFlop/s1.17 TFlop/s

10.51 PFlop/s10.51 PFlop/s

50.9 TFlop/s50.9 TFlop/s

74.2 PFlop/s74.2 PFlop/s

SUM

N=1

N=500


Projected Performance Development

SUM

N=1

N=500

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

1 Eflop/s


Core Count


Moore’s Law reinterpreted

• Number of cores per chip can double every two years

• Clock speed will not increase (possibly decrease)

• Need to deal with systems with millions of concurrent threads

• Need to deal with inter-chip parallelism as well as intra-chip parallelism


27

Outline

• Why powerful computers must be parallel processors

• Large CSE problems require powerful computers





all



28

Drivers for Change

• Continued exponential increase in computational power • Can simulate what theory and experiment can’t do

• Continued exponential increase in experimental data

• Moore’s Law applies to sensors too

• Need to analyze all that data


29

Simulation: The Third Pillar of Science

• Traditional scientific and engineering method:(1) Do theory or paper design(2) Perform experiments or build system

• Limitations: –Too difficult—build large wind tunnels –Too expensive—build a throw-away passenger jet –Too slow—wait for climate or galactic evolution –Too dangerous—weapons, drug design, climate

experimentation

• Computational science and engineering paradigm:(3) Use computers to simulate and analyze the phenomenon• Based on known physical laws and efficient numerical methods• Analyze simulation results with computational tools and methods

beyond what is possible manually

Simulation

Theory Experiment


Data Driven Science

•Scientific data sets are growing exponentially- Ability to generate data is exceeding our ability to

store and analyze- Simulation systems and some observational

devices grow in capability with Moore’s Law•Petabyte (PB) data sets will soon be common:

• Climate modeling: estimates of the next IPCC data is in 10s of petabytes

• Genome: JGI alone will have .5 petabyte of data this year and double each year

• Particle physics: LHC is projected to produce 16 petabytes of data per year

• Astrophysics: LSST and others will produce 5 petabytes/year (via 3.2 Gigapixel camera)

•Create scientific communities with “Science Gateways” to data

30

31

Some Particularly Challenging Computations• Science

• Global climate modeling• Biology: genomics; protein folding; drug design• Astrophysical modeling• Computational Chemistry• Computational Material Sciences and Nanosciences

• Engineering• Semiconductor design• Earthquake and structural modeling• Computation fluid dynamics (airplane design)• Combustion (engine design)• Crash simulation

• Business• Financial and economic modeling• Transaction processing, web services and search engines

• Defense• Nuclear weapons -- test by simulations• Cryptography CPE 779 Parallel Computing - Spring 2012

32

Economic Impact of HPC• Airlines:

• System-wide logistics optimization systems on parallel systems.

• Savings: approx. $100 million per airline per year.• Automotive design:

• Major automotive companies use large systems (500+ CPUs) for:• CAD-CAM, crash testing, structural integrity and aerodynamics.• One company has 500+ CPU parallel system.

• Savings: approx. $1 billion per company per year.

• Semiconductor industry:• Semiconductor firms use large systems (500+ CPUs) for

• device electronics simulation and logic validation

• Savings: approx. $1 billion per company per year.

• Energy• Computational modeling improved performance of current nuclear power

plants, equivalent to building two new power plants.


33

$5B World Market in Technical Computing in 2004

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1998 1999 2000 2001 2002 2003 Other

Technical Management andSupport

Simulation

Scientific Research and R&D

MechanicalDesign/Engineering Analysis

Mechanical Design andDrafting

Imaging

Geoscience and Geo-engineering

Electrical Design/EngineeringAnalysis

Economics/Financial

Digital Content Creation andDistribution

Classified Defense

Chemical Engineering

Biosciences

Source: IDC 2004, from NRC Future of Supercomputing Report



Why writing (fast) parallel programs is hard


Principles of Parallel Computing

• Finding enough parallelism (Amdahl’s Law)• Granularity• Locality• Load balance• Coordination and synchronization• Performance modeling

All of these things makes parallel programming harder than sequential programming.

36

“Automatic” Parallelism in Modern Machines

• Bit level parallelism• within floating point operations, etc.

• Instruction level parallelism (ILP)• multiple instructions execute per clock cycle

• Memory system parallelism• overlap of memory operations with computation

• OS parallelism• multiple jobs run in parallel on commodity SMPs

Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks


37

Finding Enough Parallelism: Amdahl’s Law

T1 = execution time using 1 processor (serial execution time)

Tp = execution time using P processors

S = serial fraction of computation (i.e. fraction of computation which can only be executed using 1 processor)

C = fraction of computation which could be executed by p processors

Then S + C = 1 and Tp = S * T1+ (T1 * C)/P = (S + C/P)T1Speedup = Ψ(p) = T1/Tp = 1/(S+C/P) <= 1/S

• Maximum speedup (i.e. when P=∞), Smax = 1/S; example S=.05 , speedup max= 20

• Currently the fastest machine has 705K processors; 2nd fastest has ~186K processors +GPUs

• Even if the parallel part speeds up perfectly performance is limited by the sequential part


Speedup Barriers: (a) Overhead of Parallelism

• Given enough parallel work, overhead is a big barrier to getting desired speedup

• Parallelism overheads include:• cost of starting a thread or process• cost of communicating shared data• cost of synchronizing• extra (redundant) computation

• Each of these can be in the range of milliseconds (=millions of flops) on some systems

• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Speedup Barriers: (b) Working on Non Local Data

• Large memories are slow, fast memories are small• Parallel processors, collectively, have large, fast cache

• the slow accesses to “remote” data we call “communication”• Algorithm should do most work on local data

ProcCache

L2 Cache

L3 Cache

Memory

Conventional Storage Hierarchy

potentialinterconnects

L3 Cache

Memory

L3 Cache

Memory

ProcCache

L2 Cache

ProcCache

L2 Cache


40

Processor-DRAM Gap (latency)

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000198

0198

1198

3198

4198

5198

6198

7198

8198

9199

0199

1199

2199

3199

4199

5199

6199

7199

8199

9200

0

DRAM

CPU

198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Goal: find algorithms that minimize communication, not necessarily arithmetic



Speedup Barriers: (c) Load Imbalance

• Load imbalance occurs when some processors in the system are idle due to• insufficient parallelism (during that phase)• unequal size tasks

• Algorithm needs to balance load

42

Outline

• Why powerful computers must be parallel processors

• Large CSE problems require powerful computers





all



Instructor (Sections 2 & 3)

• Instructor: Dr. Walid Abu-Sufah• Office: CPE 10• Email: [email protected]• Office Hours: Monday 11-12, Tuesday 12-1 and by

appointment. • Course web site: http://www1.ju.edu.jo/ecourse/abusufah/cpe779_spr12/

index.html


44

Prerequisite

• CPE 432: Computer Design, and general C programming skills


45

Grading Policy

• Programming Assignments: 25%• Demo/knowledge: 25%• Functionality and Performance: 40%• Report: 35%

• Project: 35%• Design Document: 25%• Project Presentation: 25%• Demo/Functionality/Performance/Report: 50%

• Midterm: 15%• Final: 25 %


46

Bonus Days• Each of you get five bonus days

• A bonus day is a no-questions-asked one-day extension that can be used on most assignments

• You can’t turn in multiple versions of a team assignment on different days; all of you must combine individual bonus days into one team bonus day.

• You can use multiple bonus days on the same assignment

• Weekends/holidays don’t count for the number of days of extension (Thursday-Sunday is one day extension)

• Intended to cover illnesses, just needing more time, etc.



Using Bonus Days

• Bonus days are automatically applied to late projects• Penalty for being late beyond bonus days is 10% of

the possible points/day, again counting only• Things you can’t use bonus days on:

• Final project design documents, final project presentations, final project demo, exam


Academic Honesty

• You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people who’ve already taken the course is also fine.

• Any reference to assignments from web postings is unacceptable

• Any copying of non-trivial code is unacceptable• Non-trivial = more than a line or so• Includes reading someone else’s code and then going off

to write your own.


Academic Honesty (cont.)

• Giving/receiving help on an exam is unacceptable• Penalties for academic dishonesty:

• Zero on the assignment for the first occasion• Automatic failure of the course for repeat offenses


Team Projects

• Work can be divided up between team members in any way that works for you

• However, each team member will demo the final checkpoint of each project individually, and will get a separate demo grade• This will include questions on the entire design• Rationale: if you don’t know enough about the whole

design to answer questions on it, you aren’t involved enough in the project


Text/Notes

1. D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” Morgan Kaufman Publisher, 2010, ISBN 978-0123814722

2. Cleve B. Moler, Numerical Computing with MATLAB, Society for Industrial Mathematics (January 1, 2004). Available for individual download at http://www.mathworks.com/moler/chapters.html

3. NVIDIA, NVidia CUDA C Programming Guide, version 4.0, NVidia, 2011 (reference book)

4. Lecture notes will be posted at the class web site

52

Rough List of Topics• Basics of computer architecture, memory

hierarchies, performance• Parallel Programming Models and Machines

• Shared Memory and Multithreading (OpenMP)• Distributed Memory and Message Passing (MPI)


Rough List of Topics (continued)

• Programming NVIDIA processors using CUDA• Introduction to CUDA C• CUDA Parallel Execution Model with Fermi Updates• CUDA Memory Model with Fermi Updates • Tiled Matrix-Matrix Multiplication• Debugging and Profiling, Introduction to Convolution • Convolution, Constant Memory and Constant Caching



• Programming NVIDIA processors using CUDA (continued)• Tiled 2D Convolution • Parallel Computation Patterns - Reduction Trees • Memory Bandwidth • Parallel Computation Patterns - Prefix Sum (Scan) • Floating Point Considerations • Atomic Operations and Histogramming• Data Transfers and GMAC• Multi-GPU Programming in CUDA and GMAC • MPI and CUDA Programming



• Selected numerical computing topics (with MATLAB)• Linear Equations• Eigenvalues


Documents

CPE 779 Parallel Computing - Spring 20121 CPE 779 Parallel Computing Lecture