Upload
zariel
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Parallel Computing Approaches & Applications. Arthur Asuncion April 15, 2008. Roadmap. Brief Overview of Parallel Computing U. Maryland work: PRAM prototype XMT programming model Current Standards: MPI OpenMP Parallel Algorithms for Bayesian Networks, Gibbs Sampling. - PowerPoint PPT Presentation
Citation preview
Parallel Computing Parallel Computing Approaches & ApplicationsApproaches & Applications
Arthur AsuncionArthur Asuncion
April 15, 2008 April 15, 2008
RoadmapRoadmap
Brief Overview of Parallel ComputingBrief Overview of Parallel Computing
U. Maryland work: U. Maryland work: PRAM prototypePRAM prototype XMT programming model XMT programming model
Current Standards: Current Standards: MPIMPI OpenMPOpenMP
Parallel Algorithms for Bayesian Networks, Gibbs SamplingParallel Algorithms for Bayesian Networks, Gibbs Sampling
Why Parallel Computing?Why Parallel Computing?
Moore’s law will Moore’s law will eventually end.eventually end.
Processors are Processors are becoming cheaper.becoming cheaper.
Parallel computing Parallel computing provides significant time provides significant time and memory savings! and memory savings!
Parallel ComputingParallel Computing
Goal is to maximize efficiency / speedup:Goal is to maximize efficiency / speedup: Efficiency = TEfficiency = Tseqseq / (P * T / (P * Tparpar) < 1) < 1 Speedup = TSpeedup = Tseqseq / * T / * Tparpar < P < P
In practice, time savings are substantial.In practice, time savings are substantial. Assuming communication costs are low and processor idle time is minimized.Assuming communication costs are low and processor idle time is minimized.
Orthogonal to:Orthogonal to: Advancements in processor speedsAdvancements in processor speeds Code optimization and data structure techniquesCode optimization and data structure techniques
Some issues to considerSome issues to consider
Implicit vs. Explicit ParallelizationImplicit vs. Explicit Parallelization Distributed vs. Shared MemoryDistributed vs. Shared Memory Homogeneous vs. Heterogeneous MachinesHomogeneous vs. Heterogeneous Machines Static vs Dynamic Load BalancingStatic vs Dynamic Load Balancing
Other Issues:Other Issues: Communication CostsCommunication Costs Fault-ToleranceFault-Tolerance ScalabilityScalability
Main QuestionsMain Questions
How can we design parallel algorithms?How can we design parallel algorithms? Need to think of places in the algorithm that can be Need to think of places in the algorithm that can be
made concurrentmade concurrent Need to understand data dependencies Need to understand data dependencies
(“critical path” = longest chain of dependent calculations)(“critical path” = longest chain of dependent calculations)
How do we implement these algorithms?How do we implement these algorithms? An engineering issue with many different optionsAn engineering issue with many different options
U. Maryland Work (Vishkin)U. Maryland Work (Vishkin)FPGA-Based Prototype of a PRAM-On-FPGA-Based Prototype of a PRAM-On-
Chip ProcessorChip ProcessorXingzhi Wen, Uzi Vishkin, ACM Computing Xingzhi Wen, Uzi Vishkin, ACM Computing
Frontiers, 2008Frontiers, 2008
http://videos.webpronews.com/2007/06/28/supercomputer-arrives/Video:
GoalsGoalsFind a parallel computing framework that:Find a parallel computing framework that:
is easy to programis easy to program
gives good performance with any amount of parallelism provided gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including by the algorithm; namely, up- and down-scalability including backwards compatibility on serial codebackwards compatibility on serial code
supports application programming (VHDL/Verilog, OpenGL, supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programmingMATLAB) and performance programming
fits current chip technology and scales with itfits current chip technology and scales with it
They claim that PRAM/XMT can meet these goals
What is PRAM?What is PRAM? ““Parallel Random Access Machine”Parallel Random Access Machine”
Virtual model of computation with some simplifying assumptions:Virtual model of computation with some simplifying assumptions: No limit to number of processors.No limit to number of processors. No limit on amount of shared memory.No limit on amount of shared memory. Any number of concurrent accesses to a shared memory take the same Any number of concurrent accesses to a shared memory take the same
time as a single access.time as a single access.
Simple model that can be analyzed theoreticallySimple model that can be analyzed theoretically
Eliminates focus on details like synchronization and communicationEliminates focus on details like synchronization and communication
Different types:Different types: EREW: Exclusive read, exclusive write.EREW: Exclusive read, exclusive write. CREW: Concurrent read, exclusive write. CREW: Concurrent read, exclusive write. CRCW: Concurrent read, concurrent write. CRCW: Concurrent read, concurrent write.
XMT Programming ModelXMT Programming Model XMT = “Explicit Multi-Threading”XMT = “Explicit Multi-Threading”
Assumes CRCW PRAMAssumes CRCW PRAM
Multithreaded extension of C with 3 commands:Multithreaded extension of C with 3 commands: Spawn: starts parallel execution modeSpawn: starts parallel execution mode Join: Resumes serial modeJoin: Resumes serial mode Prefix-sum: atomic command for incrementing variablePrefix-sum: atomic command for incrementing variable
RAM vs. PRAMRAM vs. PRAM
Simple ExampleSimple ExampleTask: Copy nonzero elements from A to B
$ is the thread-IDPS is Prefix-Sum
Architecture of PRAM prototypeArchitecture of PRAM prototype
MTCU:“Master Thread Control Unit”: handles sequential portions
TCU clusters: handles parallel portions
64 separate processors, each 75MHz1 GB RAM, 32KB per cache (8 shared cache modules)
Shared cache
Shared PS unit: only way to communicate!
Envisioned ProcessorEnvisioned Processor
Performance ResultsPerformance Results
Using 64 procs
Projected results75Mhz -> 800Mhz
Human ResultsHuman Results ““As PRAM algorithms are based on first principles that require As PRAM algorithms are based on first principles that require
relatively little background, a full day (300-minute) PRAM/XMT relatively little background, a full day (300-minute) PRAM/XMT tutorial was offered to a dozen tutorial was offered to a dozen high-school studentshigh-school students in in September 2007. Followed up with only a weekly office-hour by September 2007. Followed up with only a weekly office-hour by an undergraduate assistant, some strong students have been an undergraduate assistant, some strong students have been able to complete 5 of 6 assignments given in a able to complete 5 of 6 assignments given in a graduate coursegraduate course on parallel algorithms.”on parallel algorithms.”
In other words: XMT is an easy way to program in parallel
Main ClaimsMain Claims
““First commitment to silicon for XMT”First commitment to silicon for XMT” An actual attempt to implement a PRAMAn actual attempt to implement a PRAM
““Timely case for the education enterprise”Timely case for the education enterprise” XMT can be learned easily, even by high schoolers. XMT can be learned easily, even by high schoolers.
““XMT is a candidate for the Processor of the XMT is a candidate for the Processor of the Future”Future”
My ThoughtsMy Thoughts
Making parallel programming as pain-free as possible is Making parallel programming as pain-free as possible is desirable, and XMT makes a good attempt to do this. desirable, and XMT makes a good attempt to do this.
Performance is a secondary goal. Performance is a secondary goal.
Their technology does not seem to be ready for prime-Their technology does not seem to be ready for prime-time yet:time yet: 75 Mhz processors75 Mhz processors No floating point operations, no OSNo floating point operations, no OS
MPI OverviewMPI Overview
MPI (“Message Passing Interface”) is the standard MPI (“Message Passing Interface”) is the standard for distributed computingfor distributed computing
Basically it is an extension of C/Fortran that allows Basically it is an extension of C/Fortran that allows processors to send messages to each other.processors to send messages to each other.
A tutorial: A tutorial: http://www.cs.gsu.edu/~cscyip/csc4310/MPI1.ppthttp://www.cs.gsu.edu/~cscyip/csc4310/MPI1.ppt
OpenMP overviewOpenMP overview
OpenMP is the standard for shared memory OpenMP is the standard for shared memory computingcomputing
Extends C with compiler directives to denote Extends C with compiler directives to denote parallel sectionsparallel sections
Normally used for the parallelization of “for” Normally used for the parallelization of “for” loops.loops.
Tutorial: Tutorial: http://vergil.chemistry.gatech.edu/resources/programming/OpenMP.pdfhttp://vergil.chemistry.gatech.edu/resources/programming/OpenMP.pdf
Parallel Computing in AI/MLParallel Computing in AI/ML
Parallel Inference in Bayesian networksParallel Inference in Bayesian networksParallel Gibbs SamplingParallel Gibbs SamplingParallel Constraint SatisfactionParallel Constraint SatisfactionParallel SearchParallel SearchParallel Neural NetworksParallel Neural NetworksParallel Expectation Maximization, etc.Parallel Expectation Maximization, etc.
Finding Marginals in Parallel Finding Marginals in Parallel through “Pointer Jumping”through “Pointer Jumping”
(Pennock, UAI 1998)(Pennock, UAI 1998)
Each variable assigned to a separate processorEach variable assigned to a separate processor Processors rewrite conditional probabilities in Processors rewrite conditional probabilities in
terms of grandparent:terms of grandparent:
AlgorithmAlgorithm
Evidence Propagation Evidence Propagation
““Arc Reversal” + “Evidence Absorption”Arc Reversal” + “Evidence Absorption”
Step 1:Step 1: Make evidence variable root node and create a preorder walk Make evidence variable root node and create a preorder walk (can be done in parallel)(can be done in parallel) Step 2:Step 2: Reverse arcs not consistent with that preorder walk Reverse arcs not consistent with that preorder walk (can be done in parallel), and absorb evidence(can be done in parallel), and absorb evidence Step 3:Step 3: Run the “Parallel Marginals” algorithm Run the “Parallel Marginals” algorithm
Generalizing to Generalizing to PolytreesPolytrees
Note: Converting Bayesian Networks to Junction Trees can also be done in parallel
Namasivayam, et. al. Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. 18th Int. Symp. on Comp. Arch. And High Perf. Comp., 2006.
ComplexityComplexity
Time complexity:Time complexity:O(log n) for polytree networks!O(log n) for polytree networks!
Assuming 1 processor per variableAssuming 1 processor per variablen = # of processors/variablesn = # of processors/variables
O(rO(r3w3w log n) for arbitrary networkslog n) for arbitrary networks
r = domain size, w=largest cluster sizer = domain size, w=largest cluster size
Parallel Gibbs SamplingParallel Gibbs Sampling Running multiple parallel chains is trivial.Running multiple parallel chains is trivial. Parallelizing a single chain can be difficult:Parallelizing a single chain can be difficult:
Can use Metropolis-Hastings step to sample from joint Can use Metropolis-Hastings step to sample from joint distribution correctly. distribution correctly.
Related ideas: Metropolis-coupled MCMC, Parallel Related ideas: Metropolis-coupled MCMC, Parallel Tempering, Population MCMC Tempering, Population MCMC
RecapRecap
Many different ways to implement parallel Many different ways to implement parallel algorithms (XMT, MPI, OpenMP)algorithms (XMT, MPI, OpenMP)
In my opinion, designing efficient parallel In my opinion, designing efficient parallel algorithms is the harder part. algorithms is the harder part.
Parallel computing in context of AI/ML still Parallel computing in context of AI/ML still not fully explored!not fully explored!