1 OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING Indiana University Computer Science Dept. Seung-Hee Bae 1 1

1

OVERVIEW OF MULTICORE, PARALLEL COMPUTING,AND DATA MINING

Indiana UniversityComputer Science Dept.

Seung-Hee Bae1

1

22

OUTLINE

Motivation Multicore Parallel Computing Data Mining

2

MOTIVATION

According to “How Much Information” project at UC Berkeley Print, film, magnetic & optical storage media produced about 5 exabytes

(a billion of billion bytes) of new info. in 2002. 5 exabytes = 37000 Library of Congress (17 million books)

The rate of data increase will continue to accelerate through weblogs, digital photo & video, surveillance monitor, scientific instruments (sensors), and instant message etc.

Thus, we need more powerful computing platforms to deal with this much data. To take advantage of multicore chip, it is critical to build a software with

scalable parallelism. To deal with a huge amount of data and utilize multicore, it is essential

to develop data mining tools with highly scalable parallel programming. 3

4

RECOGNITION, MINING, AND SYNTHESIS (RMS)

(from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.)

Intel points out these three processing cycle will be necessary to deal with most generalized decision support (data mining).

Examples Medicine (a tumor) Business (hiring) Investment

4

5

Motivation Multicore

Toward Concurrency What is Multicore?

Parallel Computing Data Mining

55

TOWARD CONCURRENCY IN SOFTWARE

Exponential growth (Moore’s Law) will change

Clock speed: getting more cycles Become harder to exploit higher

clock speeds (2GHz:2001, 3.4GHz:2004, now?)

Execution optimization: more work per cycle Pipelining, branch prediction,

multiple instructions/clock, reordering the instruction

Cache Increasing the size of on-chip

cache: main memory is much slower than the cache.

Moore’s law is over? Not yet (# of transistors ↑)

Hyperthreading Running two or more threads in

parallel inside a single CPU It doesn’t help single-threaded

applications

Multicore Running two or more actual CPUs on

one chip. It will boost reasonably well-written

multi-thread applications, but not single-threaded applications.

Cache Only this will broadly benefit most

existing applications. A cache miss costs 10 to 50 times.

Previous CPU performance gains Current CPU performance gains

666

7

WHAT IS MULTICORE?

Single Chip Multiple distinct processing Engine E.g.) Shared-cache Dual Core Architecture

77

Core 0

CPU

L1 Cache

Core 1

CPU

L1 Cache

L2 Cache

8

Motivation Multicore Parallel Computing

Parallel architectures (Shared-Memory vs. Distributed-Memory) Decomposing Program (Data Parallelism vs. Task Parallelism) MPI and OpenMP

Data Mining

88

99

PARALLEL COMPUTING: INTRODUCTION Parallel computing

More than just a strategy for achieving good performance Vision for how computation can seamlessly scale from a single processor

to virtually limitless computing power Parallel computing software systems

Goal: to make parallel programming easier and the resulting applications more portable and scalable while achieving good performance.

Component Parallel Paradigm (Explicit Parallel) One explicitly programs the different parts of a parallel application. E.g.) MPI, PGAS, CCR & DSS, Workflow, DES

Program Parallel Paradigm (Implicit Parallel) One writes a single program to describe the whole app. compiler and runtime

break up the program into the multiple parts that execute in parallel. E.g.) OpenMP, HPF, HPCS, MapReduce

Parallel Computing Challenges Concurrency & Communication Scalability and portability are difficult to achieve. Diversity of Architectures

9

10

PARALLEL ARCHITECTURE 1

10

Shared-memory machines Have a single shared address

space that can be accessed by any processor.

Examples Multicore Symmetric multiprocessor (SMP) Uniform Memory Access (UMA)

Access time is independent of the location.

Use bus or fully connected net. Hard to achieve the scalability

Distributed-memory machines The system memory is packaged

with individual nodes of one or more processors (c.f. Use separate computers connected by a network)

E.g.) Cluster communication is required to

provide data from a processor to a different processor.

11


11

Shared-Memory Distributed Memory

Pros • Lower latency and higher BW• Data are available to all of the CPUs through load and store instructions• Single address space

• Scalable, if a scalable interconnection network is used.• Quite fast local data access.

Cons • cache coherency should be dealt with carefully.• synchronization is explicitly needed to access shared data.• scalability issue

• Communication required to access data in a diff. processor.• Communication management problem1. Long latency 2. Long transmission time

1212


Hybrid systemsDistributed shared-memory (DSM)

Distributed-memory machine which allows a processor to directly access a datum in a remote memory.

Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA)

characteristics.SMP clusters

distributed-memory system with SMP as a unit.

12

PARALLEL PROGRAMMING MODEL

Shared-Memory Programming model Need for synchronization to preserve the integrity More appropriate to shared-memory machine E.g.) Open Specifications for MultiProcessing (OpenMP)

Message-Passing Programming model Send-receive communication steps. Communication is used to access a remote data location. More appropriate to distributed-memory machine E.g.) Message Passing Interface (MPI)

Shared-memory programming model can be used to distributed-memory machines as well as message-passing programming model can be used to shared-memory architectures. However, the efficiency of the programming model is different. 13

PARALLEL PROGRAM: DECOMPOSITION 1

Subdivides the data domain of a problem into multiple regions and assigns different processors.

Exploit the parallelism inherent in many large data structures.

Same Task on diff. data. (SPMD) More commonly used in scientific

problems. Features

natural form of scalability. Hard to express when geometry

irregular or dynamic Can be expressed by ALL parallel

programming models (i.e. MPI, HPF like, OpenMP like)

Different processors carry out different functions.

Coarse grain parallelism Different task on the same or

different data. Features

Parallelism limited in size Tens not millions

Synchronization probably good Parallelism and Decomposition can

be derived from problem structure. E.g.) workflow

Data Parallelism Functional Parallelism

1414

1515

PARALLEL PROGRAM: DECOMPOSITION 2

Load balance and scalability Scalable: running time is inversely proportional to the number of

processors used. Speedup(n) = T(1)/T(n)

Scalable if speedup(n) ≈ n Second definition of scalability: scaled speedup

Scalable if the running time remains the same when the number of processors and the problem size are increased by a factor of n.

Why scalability is not achieved? a region that must be run sequentially. Total speedup ≤ T(1)/Ts

(Amdahl’s Law) Require for a high degree of communication or coordination. Poor load balance (major goal of parallel programming)

If one of the processors takes half of the parallel work, speedup will be limited to a factor of two. 15

1616

MEMORY MANAGEMENT Memory-Hierarchy Management

Blocking Ensuring that data remains in cache between subsequent accesses to the

same memory location. Elimination of False Sharing

False sharing: When two diff. processors are accessing distinct data items that reside on the same cache line.

Ensure that data used by diff. processors reside on diff. cache line. (by padding: inserting empty bytes in a data structure.)

Communication Minimization and Placement Move send and receive commands far enough apart so that time spent

on communication can be overlapped. Stride-one access

Programs in which the loops access contiguous data items are much more efficient than those that do not.

16

17

MESSAGE PASSING INTERFACE (MPI) Message Passing Interface (MPI)

A specification for a set of functions for managing movement of data among sets of communicating processes.

The dominant scalable parallel computing paradigm with scientific problem.

Explicit message send and receive using rendezvous model. Point-to-point communication Collective communication Commonly implemented in terms of an SPMD model

All processes execute essentially the same logic.

Pros: scalable and portable Race condition avoided (implicit synch. w/ completion of the copy)

Cons: implements details at communication.

17

18

MPI 6 Key Functions

MPI_INIT MPI_COMM_SIZE MPI_COMM_RANK MPI_SEND MPI_RECV MPI_FINALIZE

Collective Communications Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange General reduction operation (sum, minimum, scan)

Blocking, nonblocking, buffered, synchronous messaging

18

19

OPEN SPECIFICATIONS FOR MULTIPROCESSING (OpenMP) 1 Appropriate to uniform-access, shared-memory. A sophisticated set of annotations (compiler directives) for traditional

C, C++, or Fortran codes to aid compilers producing parallel codes. It provides parallel loops and collective operations such as summation

over loop indices. Provide lock variables to allow fine-grain synchronization btwn threads. Specify where multiple threads should be applied, and how to assign

work to those threads. Pros:

Excellent programming interface for uniform-access, shared-memory machines.

Cons: No way to specify locality in machines w/ non-uniform shared-memory or

distributed memory. Cannot express all parallel algorithms.

19

20

OpenMP 2 Directives: instruct the compiler to

Create threads, perform synchronization ops, manage shared memory. Examples

PARALLEL DO ~ END PARALLEL DO SCHEDULE (STATIC) SCHEDULE (DYNAMIC) REDUCTION(+: x) PARALLEL SECTIONS

OpenMP synchronization primitives Critical sections Atomic updates Barriers

20

21

Motivation Multicore Parallel Computing Data Mining

Expectation Maximization (EM) Deterministic Annealing (DA) Hidden Markov Model (HMM) Other Important Algorithms

21

22

EXPECTATION MAXIMIZATION (EM) Expectation Maximization (EM)

A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables.

An efficient iterative procedure Goal: estimate unknown parameters, given measurement. Hill climbing approach guarantee to reach maxima (or local maxima.) Two Steps

E-step (Expectation): the missing data are estimated given the observed data and current estimate of the model parameters.

M-step (Maximization): the likelihood function is maximized under the assumption that the missing data are known. (The estimated missing data from the E-step are used in lieu of the actual missing data.)

Those two steps are repeated until the likelihood converges.22

2323

DETERMINISTIC ANNEALING (DA) Purpose: avoid local minima (optimization) Simulated Annealing (SA)

A sequence of random moves is generated and the random decision to accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method)

Deterministic Annealing (DA) Uses expectation instead of stochastic simulations (random move). Deterministic:

Making incremental progress on the average. (minimize the free energy (F) directly)

Annealing: still want to avoid local minima with certain level of uncertainty. Minimizing the cost at prescribed level of randomness (Shannon Entropy)

eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost) At large T, entropy (H) dominates while at small T cost dominates. Annealing lowers temperature so solution tracks continuously

23

24

DA FOR CLUSTERING

This is an extended K-means algorithm. Start with a single cluster giving as solution Y1 as centroid For some annealing schedule for T, iterate above algorithm testing

covariance matrix in Xi about each cluster center to see if “elongated”

Split cluster if elongation “long enough” phase transition You do not need to assume number of clusters but rather a final

resolution T or equivalent At T=0, uninteresting solution is N clusters; one at each point xi

24

DA CLUSTERING RESULTS (GIS)Age under 5 vs. 25 to 34 Age under 5 vs. 75 and up

25

HIDDEN MARKOV MODEL (HMM) 1

A system being in one of a set of N distinct states, S1, S2, …, SN at any time.

State transition probability The special case of a discrete, first

order Markov chain: P[qt=Sj|qt-1=Si, qt-2=Sk, …]

= P[qt=Sj|qt-1=Si] (1)

Consider the right-hand side of (1) is independent of time, thereby leading to the set of state transition probability aij of the form

aij = P[qt = Sj|qt-1 = Si],

1 ≤ i, j ≤ N, aij ≥ 0 ∑j aij = 1

Initial state probability

Observation is a probabilistic function of the state.

State is hidden. Speech recognition, bioinfo, etc. Elements of an HMM

N, the number of states M, the number of symbols A = {aij}, The state transition probability

distribution B = {bj(k)}, The symbol emission

probability distribution in state j

bj(k) = P[vk at t| qt = Sj],

1 ≤ j ≤ N, 1 ≤ k ≤ M π = {πi}, The initial state distribution

πi = P[q1 = Si], 1 ≤ j ≤ N

Compact notation: λ = (A, B, π)

Markov Model Hidden Markov Model (HMM)

26

HIDDEN MARKOV MODEL (HMM) 2

Prob(observation seq | model): Given the observation sequence

O=O1O2 … OT, and a model λ=(A,B,π), how do we efficiently compute P(O| λ)?

Finding Optimal State Seq: Given O = O1O2 … OT, and

λ=(A,B,π), how do we choose a corresponding optimal state sequence Q = q1q2 … qT in some meaningful sense (i.e. best “explains” the observations)?

Finding Optimal Model Parameters: How do we adjust the model

parameters λ = (A, B, π) to maximize P(O| λ) ?

Prob(observation seq | model): Enumeration: computationally unfeasible. Forward Procedure

αt(i) = P(O1O2 … Ot, qt = Si| λ)

Finding Optimal State Seq: find the best state sequence (path) Viterbi algorithm: dynamic programming method

δt(i) = max P[q1q2…qt = i, O1O2…Ot| λ) Path back tracking

Finding Optimal Model Parameters: Baum-Welch Method: Choose λ = (A, B, π) such that P(O| λ) is

locally maximized Essentially EM method: iterative

ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ)

Three Basic Problems Solutions of those Problems

27

OTHER IMPORTANT ALGS.

Other Data Mining Algorithms Support Vector Machine (SVM) K-means (special case of DA clustering), Nearest-

neighbor Decision Tree, Neural network, etc.

Dimension Reduction GTM (Generative Topographic Map) MDS (MultiDimensional Scaling) SOM (Self-Organizing Map)

28

SUMMARY

Era of Multicore (Parallelism is essential.) Explosion of information from many kinds of sources. We are interesting scalable parallel data-mining algorithms.

Clustering algorithm (DA clustering) GIS (demographic (census) data) – visualization is natural. Cheminformatics – dimension reduction is necessary to visualize.

Visualization (Dimension Reduction) Hidden Markov Models, …

29

THANK YOU!

QUESTIONS?

30

Documents

1 OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING Indiana University Computer Science Dept. Seung-Hee Bae 1 1