Upload
mary-newman
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
1
OVERVIEW OF MULTICORE, PARALLEL COMPUTING,AND DATA MINING
Indiana UniversityComputer Science Dept.
Seung-Hee Bae1
1
22
OUTLINE
Motivation Multicore Parallel Computing Data Mining
2
MOTIVATION
According to “How Much Information” project at UC Berkeley Print, film, magnetic & optical storage media produced about 5 exabytes
(a billion of billion bytes) of new info. in 2002. 5 exabytes = 37000 Library of Congress (17 million books)
The rate of data increase will continue to accelerate through weblogs, digital photo & video, surveillance monitor, scientific instruments (sensors), and instant message etc.
Thus, we need more powerful computing platforms to deal with this much data. To take advantage of multicore chip, it is critical to build a software with
scalable parallelism. To deal with a huge amount of data and utilize multicore, it is essential
to develop data mining tools with highly scalable parallel programming. 3
4
RECOGNITION, MINING, AND SYNTHESIS (RMS)
(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.)
Intel points out these three processing cycle will be necessary to deal with most generalized decision support (data mining).
Examples Medicine (a tumor) Business (hiring) Investment
4
5
Motivation Multicore
Toward Concurrency What is Multicore?
Parallel Computing Data Mining
55
TOWARD CONCURRENCY IN SOFTWARE
Exponential growth (Moore’s Law) will change
Clock speed: getting more cycles Become harder to exploit higher
clock speeds (2GHz:2001, 3.4GHz:2004, now?)
Execution optimization: more work per cycle Pipelining, branch prediction,
multiple instructions/clock, reordering the instruction
Cache Increasing the size of on-chip
cache: main memory is much slower than the cache.
Moore’s law is over? Not yet (# of transistors ↑)
Hyperthreading Running two or more threads in
parallel inside a single CPU It doesn’t help single-threaded
applications
Multicore Running two or more actual CPUs on
one chip. It will boost reasonably well-written
multi-thread applications, but not single-threaded applications.
Cache Only this will broadly benefit most
existing applications. A cache miss costs 10 to 50 times.
Previous CPU performance gains Current CPU performance gains
666
7
WHAT IS MULTICORE?
Single Chip Multiple distinct processing Engine E.g.) Shared-cache Dual Core Architecture
77
Core 0
CPU
L1 Cache
Core 1
CPU
L1 Cache
L2 Cache
8
Motivation Multicore Parallel Computing
Parallel architectures (Shared-Memory vs. Distributed-Memory) Decomposing Program (Data Parallelism vs. Task Parallelism) MPI and OpenMP
Data Mining
88
99
PARALLEL COMPUTING: INTRODUCTION Parallel computing
More than just a strategy for achieving good performance Vision for how computation can seamlessly scale from a single processor
to virtually limitless computing power Parallel computing software systems
Goal: to make parallel programming easier and the resulting applications more portable and scalable while achieving good performance.
Component Parallel Paradigm (Explicit Parallel) One explicitly programs the different parts of a parallel application. E.g.) MPI, PGAS, CCR & DSS, Workflow, DES
Program Parallel Paradigm (Implicit Parallel) One writes a single program to describe the whole app. compiler and runtime
break up the program into the multiple parts that execute in parallel. E.g.) OpenMP, HPF, HPCS, MapReduce
Parallel Computing Challenges Concurrency & Communication Scalability and portability are difficult to achieve. Diversity of Architectures
9
10
PARALLEL ARCHITECTURE 1
10
Shared-memory machines Have a single shared address
space that can be accessed by any processor.
Examples Multicore Symmetric multiprocessor (SMP) Uniform Memory Access (UMA)
Access time is independent of the location.
Use bus or fully connected net. Hard to achieve the scalability
Distributed-memory machines The system memory is packaged
with individual nodes of one or more processors (c.f. Use separate computers connected by a network)
E.g.) Cluster communication is required to
provide data from a processor to a different processor.
11
PARALLEL ARCHITECTURE 2
11
Shared-Memory Distributed Memory
Pros • Lower latency and higher BW• Data are available to all of the CPUs through load and store instructions• Single address space
• Scalable, if a scalable interconnection network is used.• Quite fast local data access.
Cons • cache coherency should be dealt with carefully.• synchronization is explicitly needed to access shared data.• scalability issue
• Communication required to access data in a diff. processor.• Communication management problem1. Long latency 2. Long transmission time
1212
PARALLEL ARCHITECTURE 3
Hybrid systemsDistributed shared-memory (DSM)
Distributed-memory machine which allows a processor to directly access a datum in a remote memory.
Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA)
characteristics.SMP clusters
distributed-memory system with SMP as a unit.
12
PARALLEL PROGRAMMING MODEL
Shared-Memory Programming model Need for synchronization to preserve the integrity More appropriate to shared-memory machine E.g.) Open Specifications for MultiProcessing (OpenMP)
Message-Passing Programming model Send-receive communication steps. Communication is used to access a remote data location. More appropriate to distributed-memory machine E.g.) Message Passing Interface (MPI)
Shared-memory programming model can be used to distributed-memory machines as well as message-passing programming model can be used to shared-memory architectures. However, the efficiency of the programming model is different. 13
PARALLEL PROGRAM: DECOMPOSITION 1
Subdivides the data domain of a problem into multiple regions and assigns different processors.
Exploit the parallelism inherent in many large data structures.
Same Task on diff. data. (SPMD) More commonly used in scientific
problems. Features
natural form of scalability. Hard to express when geometry
irregular or dynamic Can be expressed by ALL parallel
programming models (i.e. MPI, HPF like, OpenMP like)
Different processors carry out different functions.
Coarse grain parallelism Different task on the same or
different data. Features
Parallelism limited in size Tens not millions
Synchronization probably good Parallelism and Decomposition can
be derived from problem structure. E.g.) workflow
Data Parallelism Functional Parallelism
1414
1515
PARALLEL PROGRAM: DECOMPOSITION 2
Load balance and scalability Scalable: running time is inversely proportional to the number of
processors used. Speedup(n) = T(1)/T(n)
Scalable if speedup(n) ≈ n Second definition of scalability: scaled speedup
Scalable if the running time remains the same when the number of processors and the problem size are increased by a factor of n.
Why scalability is not achieved? a region that must be run sequentially. Total speedup ≤ T(1)/Ts
(Amdahl’s Law) Require for a high degree of communication or coordination. Poor load balance (major goal of parallel programming)
If one of the processors takes half of the parallel work, speedup will be limited to a factor of two. 15
1616
MEMORY MANAGEMENT Memory-Hierarchy Management
Blocking Ensuring that data remains in cache between subsequent accesses to the
same memory location. Elimination of False Sharing
False sharing: When two diff. processors are accessing distinct data items that reside on the same cache line.
Ensure that data used by diff. processors reside on diff. cache line. (by padding: inserting empty bytes in a data structure.)
Communication Minimization and Placement Move send and receive commands far enough apart so that time spent
on communication can be overlapped. Stride-one access
Programs in which the loops access contiguous data items are much more efficient than those that do not.
16
17
MESSAGE PASSING INTERFACE (MPI) Message Passing Interface (MPI)
A specification for a set of functions for managing movement of data among sets of communicating processes.
The dominant scalable parallel computing paradigm with scientific problem.
Explicit message send and receive using rendezvous model. Point-to-point communication Collective communication Commonly implemented in terms of an SPMD model
All processes execute essentially the same logic.
Pros: scalable and portable Race condition avoided (implicit synch. w/ completion of the copy)
Cons: implements details at communication.
17
18
MPI 6 Key Functions
MPI_INIT MPI_COMM_SIZE MPI_COMM_RANK MPI_SEND MPI_RECV MPI_FINALIZE
Collective Communications Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange General reduction operation (sum, minimum, scan)
Blocking, nonblocking, buffered, synchronous messaging
18
19
OPEN SPECIFICATIONS FOR MULTIPROCESSING (OpenMP) 1 Appropriate to uniform-access, shared-memory. A sophisticated set of annotations (compiler directives) for traditional
C, C++, or Fortran codes to aid compilers producing parallel codes. It provides parallel loops and collective operations such as summation
over loop indices. Provide lock variables to allow fine-grain synchronization btwn threads. Specify where multiple threads should be applied, and how to assign
work to those threads. Pros:
Excellent programming interface for uniform-access, shared-memory machines.
Cons: No way to specify locality in machines w/ non-uniform shared-memory or
distributed memory. Cannot express all parallel algorithms.
19
20
OpenMP 2 Directives: instruct the compiler to
Create threads, perform synchronization ops, manage shared memory. Examples
PARALLEL DO ~ END PARALLEL DO SCHEDULE (STATIC) SCHEDULE (DYNAMIC) REDUCTION(+: x) PARALLEL SECTIONS
OpenMP synchronization primitives Critical sections Atomic updates Barriers
20
21
Motivation Multicore Parallel Computing Data Mining
Expectation Maximization (EM) Deterministic Annealing (DA) Hidden Markov Model (HMM) Other Important Algorithms
21
22
EXPECTATION MAXIMIZATION (EM) Expectation Maximization (EM)
A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables.
An efficient iterative procedure Goal: estimate unknown parameters, given measurement. Hill climbing approach guarantee to reach maxima (or local maxima.) Two Steps
E-step (Expectation): the missing data are estimated given the observed data and current estimate of the model parameters.
M-step (Maximization): the likelihood function is maximized under the assumption that the missing data are known. (The estimated missing data from the E-step are used in lieu of the actual missing data.)
Those two steps are repeated until the likelihood converges.22
2323
DETERMINISTIC ANNEALING (DA) Purpose: avoid local minima (optimization) Simulated Annealing (SA)
A sequence of random moves is generated and the random decision to accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method)
Deterministic Annealing (DA) Uses expectation instead of stochastic simulations (random move). Deterministic:
Making incremental progress on the average. (minimize the free energy (F) directly)
Annealing: still want to avoid local minima with certain level of uncertainty. Minimizing the cost at prescribed level of randomness (Shannon Entropy)
eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost) At large T, entropy (H) dominates while at small T cost dominates. Annealing lowers temperature so solution tracks continuously
23
24
DA FOR CLUSTERING
This is an extended K-means algorithm. Start with a single cluster giving as solution Y1 as centroid For some annealing schedule for T, iterate above algorithm testing
covariance matrix in Xi about each cluster center to see if “elongated”
Split cluster if elongation “long enough” phase transition You do not need to assume number of clusters but rather a final
resolution T or equivalent At T=0, uninteresting solution is N clusters; one at each point xi
24
DA CLUSTERING RESULTS (GIS)Age under 5 vs. 25 to 34 Age under 5 vs. 75 and up
25
HIDDEN MARKOV MODEL (HMM) 1
A system being in one of a set of N distinct states, S1, S2, …, SN at any time.
State transition probability The special case of a discrete, first
order Markov chain: P[qt=Sj|qt-1=Si, qt-2=Sk, …]
= P[qt=Sj|qt-1=Si] (1)
Consider the right-hand side of (1) is independent of time, thereby leading to the set of state transition probability aij of the form
aij = P[qt = Sj|qt-1 = Si],
1 ≤ i, j ≤ N, aij ≥ 0 ∑j aij = 1
Initial state probability
Observation is a probabilistic function of the state.
State is hidden. Speech recognition, bioinfo, etc. Elements of an HMM
N, the number of states M, the number of symbols A = {aij}, The state transition probability
distribution B = {bj(k)}, The symbol emission
probability distribution in state j
bj(k) = P[vk at t| qt = Sj],
1 ≤ j ≤ N, 1 ≤ k ≤ M π = {πi}, The initial state distribution
πi = P[q1 = Si], 1 ≤ j ≤ N
Compact notation: λ = (A, B, π)
Markov Model Hidden Markov Model (HMM)
26
HIDDEN MARKOV MODEL (HMM) 2
Prob(observation seq | model): Given the observation sequence
O=O1O2 … OT, and a model λ=(A,B,π), how do we efficiently compute P(O| λ)?
Finding Optimal State Seq: Given O = O1O2 … OT, and
λ=(A,B,π), how do we choose a corresponding optimal state sequence Q = q1q2 … qT in some meaningful sense (i.e. best “explains” the observations)?
Finding Optimal Model Parameters: How do we adjust the model
parameters λ = (A, B, π) to maximize P(O| λ) ?
Prob(observation seq | model): Enumeration: computationally unfeasible. Forward Procedure
αt(i) = P(O1O2 … Ot, qt = Si| λ)
Finding Optimal State Seq: find the best state sequence (path) Viterbi algorithm: dynamic programming method
δt(i) = max P[q1q2…qt = i, O1O2…Ot| λ) Path back tracking
Finding Optimal Model Parameters: Baum-Welch Method: Choose λ = (A, B, π) such that P(O| λ) is
locally maximized Essentially EM method: iterative
ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ)
Three Basic Problems Solutions of those Problems
27
OTHER IMPORTANT ALGS.
Other Data Mining Algorithms Support Vector Machine (SVM) K-means (special case of DA clustering), Nearest-
neighbor Decision Tree, Neural network, etc.
Dimension Reduction GTM (Generative Topographic Map) MDS (MultiDimensional Scaling) SOM (Self-Organizing Map)
28
SUMMARY
Era of Multicore (Parallelism is essential.) Explosion of information from many kinds of sources. We are interesting scalable parallel data-mining algorithms.
Clustering algorithm (DA clustering) GIS (demographic (census) data) – visualization is natural. Cheminformatics – dimension reduction is necessary to visualize.
Visualization (Dimension Reduction) Hidden Markov Models, …
29
THANK YOU!
QUESTIONS?
30