12
Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm that worked for them for parallel computing Yet to see: Easy-to-program, fast general-purpose many-core computer for single task completion time

Programmability and Portability Problems? Time for Hardware Upgrades

Embed Size (px)

DESCRIPTION

~2003 Wall Street traded companies gave up the safety of the only paradigm that worked for them for parallel computing Yet to see: Easy-to-program, fast general-purpose many-core computer for single task completion time. Programmability and Portability Problems? Time for Hardware Upgrades. - PowerPoint PPT Presentation

Citation preview

Page 1: Programmability and Portability Problems?  Time for Hardware Upgrades

Programmability and Portability Problems? Time for Hardware Upgrades

Uzi Vishkin

~2003 Wall Street traded companies gave up the safety of the only paradigm that worked for them for parallel computing

Yet to see: Easy-to-program, fast general-purpose many-core computer for single task completion time

Page 2: Programmability and Portability Problems?  Time for Hardware Upgrades

2009Develop in 2009 application-SW for 2010s many-cores, or wait?Portability/investment questions:Will 2009 code be supported in 2010s? Development-hours in 2009 vs 2010s? Maintenance in 2010s?Performance in 2010s?Good News Vendors open up to ~40 years of parallel computing.

Also SW to match vendors’ HW (2009 acquisitions). Also: new starts

However They picked the wrong part: parallel architectures are a disaster area for programmability. In any case: their programming is too constrained. Contrast with general-purpose serial computing that “set the serial programmer free”. Current direction drags general-purpose computing to an unsuccessful paradigm.

My main point Need to reproduce serial success for many-core computing.

The business food chain SW developers serve customers NOT machines. If HW developers will not get used to idea of serving SW developers, guess what will happen to customers of their HW.

Page 3: Programmability and Portability Problems?  Time for Hardware Upgrades

Technical pointsWill overview/note:-What does it mean to “set free” parallel algorithmic thinking? -Architecture functions/abilities that achieve that -HW features supporting them

Vendors must provide such functions. Simple way: just add these features

Page 4: Programmability and Portability Problems?  Time for Hardware Upgrades

Example of HW feature Prefix-Sum

• 1500 cars enter a gas station with 1000 pumps.

• Direct in unit time a car to a EVERY pump.

• Direct in unit time a car to EVERY pump becoming available.

Proposed HW solution: prefix-sum functional unit. (HW enhancement of Fetch&Add)

SPAA’97 + US Patent

Page 5: Programmability and Portability Problems?  Time for Hardware Upgrades

Objective for programmer’s model• Emerging: not sure, but analysis should be work-depth. Why

not design for your analysis? (like serial)

• [SV82] conjectured that the rest (full PRAM algorithm) just a matter of skill.

• Lots of evidence that this “work-depth methodology” works. Used as framework in PRAM algorithms textbooks: JaJa-92, Keller-Kessler-Traeff-01.

• Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase

What could I do in parallel at each step assuming

unlimited hardware

# ops

.. ..time

#ops

.. ..

.... ..

time

Time = Work Work = total #ops

Time << Work

Serial Paradigm

Natural (Parallel) Paradigm

Page 6: Programmability and Portability Problems?  Time for Hardware Upgrades

Hardware prototypes of PRAM-On-Chip

XMT big idea in a nutshell Design for work-depth 1) 1 operation now; Any #ops next time unit. 2) No need to program for locality beyond use of local thread variables, post work-depth. 3) Enough interconnection network bandwidth

64-core, 75MHz FPGA prototype[SPAA’07, Computing Frontiers’08]Original explicit multi-threaded (XMT) architecture [SPAA98] (Cray started to use “XMT” 7+ years later)

Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07]

Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype

The design scales to 1000+ cores on-chip

Page 7: Programmability and Portability Problems?  Time for Hardware Upgrades

XMT: A PRAM-On-Chip Vision• IF you could program a current manycore great

speedups. XMT: Fix the IF• XMT: Designed from the ground up to address that

for on-chip parallelism• Unlike matching current HW • Today’s position Replicate functions

• Tested HW & SW prototypes • Software release of full XMT environment

• SPAA’09: ~10X relative to Intel Core 2 Duo• For more info: Google “XMT”

Page 8: Programmability and Portability Problems?  Time for Hardware Upgrades

Programmer’s Model: Workflow Function• Arbitrary CRCW Work-depth algorithm.

- Reason about correctness & complexity in synchronous model • SPMD reduced synchrony

– Main construct: spawn-join block. Can start any number of processes at once. Threads advance at own speed, not lockstep

– Prefix-sum (ps). Independence of order semantics (IOS)– Establish correctness & complexity by relating to WD analyses– Circumvents “The problem with threads”, e.g., [Lee]

• Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL07]

• Trial&error contrast: similar startwhile insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache}

spawn join spawn join

Page 9: Programmability and Portability Problems?  Time for Hardware Upgrades

Ease of Programming• Benchmark Can any CS major program your manycore?

- cannot really avoid it. Teachability demonstrated so far:

- To freshman class with 11 non-CS students. Some prog. assignments: merge-sort, integer-sort & samples-sort.

Other teachers:- Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher.- High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.

Page 10: Programmability and Portability Problems?  Time for Hardware Upgrades

Conclusion• XMT provides viable answer to biggest challenges for the

field– Ease of programming– Scalability (up&down)Facilitates code portability

• Preliminary evaluation shows good result of XMT architecture versus state-of-the art Intel Core 2 platform

• ICPP’08 paper compares with GPUs.• Easy to build. 1 student in 2+ yrs: hardware design +

FPGA-based XMT computer in slightly more than two years time to market; implementation cost.

Replicate functions, perhaps by replicating solutions

Page 11: Programmability and Portability Problems?  Time for Hardware Upgrades

Software releaseAllows to use your own computer for programming on an XMT environment and experimenting with it, including:a) Cycle-accurate simulator of the XMT machineb) Compiler from XMTC to that machineAlso provided, extensive material for teaching or self-studying parallelism, including(i)Tutorial + manual for XMTC (150 pages)(ii)Classnotes on parallel algorithms (100 pages)(iii)Video recording of 9/15/07 HS tutorial (300 minutes)(iv) Video recording of grad Parallel Algorithms lectures (30+hours)www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html, Or just Google “XMT”

Page 12: Programmability and Portability Problems?  Time for Hardware Upgrades

Q&AQuestion: Why PRAM-type parallel algorithms matter, when we

can get by with existing serial algorithms, and parallel programming methods like OpenMP on top of it?

Answer: With the latter you need a strong-willed Comp. Sci. PhD in order to come up with an efficient parallel program at the end. With the former (study of parallel algorithmic thinking and PRAM algorithms) high school kids can write efficient (more efficient if fine-grained & irregular!) parallel programs.