30
Towards Parallel and Distributed Computing in Large-Scale Data Mining: A Survey Han Xiao Technical University of Munich D-85748 Garching near Munich, Germany [email protected] April 8, 2010 * Abstract The implementation of data mining ideas in high-performance parallel and dis- tributed computing environments is becoming crucial for ensuring system scalabil- ity and interactivity as data continues to grow inexorably in size and complexity. This paper is a survey on the parallelization of well-known data mining techniques covering classification, link analysis, clustering and sequential learning, which are the most important topics in data mining research and development. Basic termi- nology related to data mining and parallel computing is introduced. With each algorithm, we provide a description of the algorithm, review and discuss current research on parallel implementations of the algorithm. 1 Introduction The wide availability of large-scale data sets from different domains such as collec- tions of images, text, and related data have created a demand to automate the process of extracting information from them. Data Mining and Knowledge Discovery are com- monly defined as the extraction of patterns or models from observed data, usually the ability to explore much richer and more expressive models, as well as providing new and interesting domains for the application of learning algorithms. Examples ranging from books recommendation by Amazon, social connection mining on Facebook, to large collections of images clustering on Flickr. Meanwhile, real-time and archival data increase as fast as or faster than comput- ing power. Researchers are realizing that parallel processing is a novel technique for scaling up the algorithms. Although there are various reasons for performing data min- ing algorithms in a distributed manner, the most immediate and practical motivation is that developing learning algorithms that are able to take advantage of the increasing availability of multi-processor and grid computing technology. For instance, in Web * For latest revision, please download from http://home.in.tum.de/ ˜ xiaoh 1

Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

Towards Parallel and Distributed Computing inLarge-Scale Data Mining: A Survey

Han XiaoTechnical University of Munich

D-85748 Garching near Munich, [email protected]

April 8, 2010∗

Abstract

The implementation of data mining ideas in high-performance parallel and dis-tributed computing environments is becoming crucial for ensuring system scalabil-ity and interactivity as data continues to grow inexorably in size and complexity.This paper is a survey on the parallelization of well-known data mining techniquescovering classification, link analysis, clustering and sequential learning, which arethe most important topics in data mining research and development. Basic termi-nology related to data mining and parallel computing is introduced. With eachalgorithm, we provide a description of the algorithm, review and discuss currentresearch on parallel implementations of the algorithm.

1 IntroductionThe wide availability of large-scale data sets from different domains such as collec-tions of images, text, and related data have created a demand to automate the processof extracting information from them. Data Mining and Knowledge Discovery are com-monly defined as the extraction of patterns or models from observed data, usually theability to explore much richer and more expressive models, as well as providing newand interesting domains for the application of learning algorithms. Examples rangingfrom books recommendation by Amazon, social connection mining on Facebook, tolarge collections of images clustering on Flickr.

Meanwhile, real-time and archival data increase as fast as or faster than comput-ing power. Researchers are realizing that parallel processing is a novel technique forscaling up the algorithms. Although there are various reasons for performing data min-ing algorithms in a distributed manner, the most immediate and practical motivationis that developing learning algorithms that are able to take advantage of the increasingavailability of multi-processor and grid computing technology. For instance, in Web

∗For latest revision, please download from http://home.in.tum.de/˜xiaoh

1

Page 2: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

mining the input data is usually large and has high dimensional nature, the computa-tions have to be distributed across hundreds or thousands of machines in order to finishin a reasonable amount of time. On a deeper level, there are fundamental questionsabout distributed learning from the viewpoints of artificial intelligence and cognitivescience.

However, using parallelization for speeding up and scaling up data mining im-plementations involves a number of challenges. First, the appropriate parallelizationstrategy could depend upon the data mining task. There are no general parallelizationtechniques for data mining. Thus, specialized implementations of popular algorithmsrarely lead to widespread use. Secondly, maintaining, debugging, and performancetuning a parallel application are extremely time consuming tasks. Usually, it is noteasy to modify existing codes to achieve high performance on parallel systems. Aninappropriate parallelism makes the original simple computation with large amountsof complex code to deal with distributing the data, and handling failures.

In this survey, we attempt to review work on parallelization of state-of-art datamining algorithm, which is relevant for researchers using or trying to introduce par-allel technique into data mining. The survey is restricted to the application of par-allel computing in the solution of complex tasks in data mining and knowledge dis-covery involving large data sets. The rest of this paper is organized as follows: InSection 2, we introduce basic terminology on data mining, parallel environment andparallel programming model. In Section 3, we describe previous work on paralleliza-tion of 9 most well-known data mining algorithms in the research community. Theyare: k- Nearest Neighbor, Decision Tree, Naıve Bayes Classifier, k-Means Clustering,Expectation-Maximization, PageRank, Support Vector Machine, Latent Dirichlet Al-location, and Conditional Random Field. With each algorithm, different parallelizationare discussed, as well as the experimental works. Section 4 gives a overall view ofperformance analysis and comparison between two different parallelism. Conclusionsare presented in Section 5.

2 Basic Concept and TerminologyIn this section, we introduce basic concepts and terminology in data mining and somewidely used programming model, framework and techniques in parallel computing,which will help reader to understand the parallel algorithm in Section 3.

2.1 Data Mining and Machine LearningData Mining is: “The nontrivial extraction of implicit, previously unknown and poten-tially useful information from data” [39]. The analysis of data using machine learn-ing and statistical techniques aims at finding hidden patterns and connections in thesedata. Machine Learning is an area of artificial intelligence concerned with the devel-opment of techniques which allow computers to “learn” by the analysis of data sets.The focus of most machine learning methods is on automatically recognizing complexpatterns and making intelligent decisions based on data. It is also concerned with thealgorithmic complexity of computational implementations. [67] presents many of the

2

Page 3: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

commonly used Machine Learning methods. Statistics has its grounds in mathematicsand deals with the science and practice for the analysis of empirical data. As we willsee in Section 3, many methods of statistics are used in the field of Data Mining. Goodoverviews are given in [50, 10, 52]. Some other areas related with Data Mining areDatabases [20] and Information Visualization [61].

Data Mining methods have different goals and commonly involves four classes oftask [44]: regression, classification, clustering, identifying meaningful associations be-tween data attributes. It should also be noted that several methods with different goalsmay be applied successively to achieve a desired result. For example, to recommendthe books which customers are likely to buy, a business analyst might need to analyzethe association between user profiles and the descriptions of books, and then applycontent-based filtering approach matching the new book’s descriptions to those booksknown to be of interest to the user.

2.2 Parallel and Distributed ComputingParallel computing is a form of computation in which many calculations are carried outsimultaneously [3]. It is based on the principle that large problems can often be dividedinto smaller ones, which are then solved concurrently. Some of the more commonlyused terms associated with parallel computing are listed below [87, 78].

Speedup: It is one of the simplest and most widely used indicators for a parallel pro-gram’s performance. It is defined as: sn = ts/tp, where ts is the execution timeusing only one processor and tp is the execution time using n processors. Themaximum speedup that can be reached is linear speedup.

Scaleup: It captures how well the parallel algorithm handles larger data sets whenmore processors are available. Scaleup study measures execution times by keep-ing the problem size per processor fixed while increasing the number of proces-sors.

Shared Memory System: Multiple processors can operate independently but share thesame memory resources. Thus, data sharing between tasks is both fast and uni-form due to the proximity of memory to CPUs. However, adding more CPUscan geometrically increases traffic on the shared memory-CPU path.

Distributed Memory System: Each processor has its own local memory, a communi-cation network is used to connect the inter-processor memory. Increasing thenumber of processors can increase size of memory proportionately. Since theconcept of cache coherency and global address do not apply in a distributedmemory system, the programmer is responsible for many of the details associ-ated with data communication between processors.

There are two Parallel Programming Models in common use: Thread and MessagePassing. Threaded implementations are not new in computing. Two different but well-known implementations of threads are POSIX Threads [16] and OpenMP [26]. In Mes-sage Passing Model, multiple tasks can reside on the same physical machine as wellacross an arbitrary number of machines. Tasks exchange data through communications

3

Page 4: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

Processo

rs

Data

broadcast

P0

P1

P2

P3

scatterP0

P1

P2

P3gather

allgather

P0

P1

P2

P3

D0 D1 D2 D3

allreduce

P0

P1

P2

P3

sum(),

min(),max()

Data

D0 D1 D2 D3

Processo

rs

Data

D0 D1 D2 D3

Data

D0 D1 D2 D3

Processo

rs

Processo

rs

Figure 1: Collective communications in MPI. Each row represent a processor with 4-bytes memory, one byte per box. The box filled in different textures represents differentdata. In the case of broadcast, the processor P0 send its data to all other processors,which gives a rise to the same data on every processors. For gather operation, P0

will receive the data from other processors and write into its local memory. The reduc-tion operation like allreduce basically perform an allgather with some extracomputation, which can be either trivial as sum, and, or delicately designed with re-spect to specific algorithm. In the context of data mining, broadcast and scatteroperations are often used at the beginning of algorithm, where we have to distributedata from a master node to all other slave nodes. gather and allgather opera-tions are often called at the end of algorithm to combine values from all processors andupdate the model parameters.

by sending and receiving messages. The most widely used message-passing library isMPI, which is the ”de facto” industry standard now [38]. Collective communication isone of the remarkable feature in MPI, since it can transmit data among all processorsefficiently. A group of global reduction operations (such as sum, max, min, and etc.)are also supported in MPI. In some cases, a more complex reduction operation must bedefined for computing or updating the global parameters of the model. Figure 1 givesa pictorial representation of four basic collective functions.

Recently, there is a distributed programming model called MapReduce [28] raiseresearchers and developers concerns. MapReduce is developed by Google, intendedfor processing massive amounts of data in large clusters. It is implemented as twofunctions, Map which applies a function to all the members of a collection and returnsa list of results based on that processing, and Reduce, which collates and resolvesthe results from two or more Maps executed in parallel by multiple threads, proces-sors, or stand-alone systems. Both Map and Reduce may run in parallel, though notnecessarily in the same system at the same time.

4

Page 5: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

The recent improvements of Graphics Processing Units (GPU) offer a powerfulprocessing platform for both graphics and non-graphics applications [84, 63]. GPUhas a parallel “multi-core” architecture, each core capable of running thousands ofthreads simultaneously - if an application is suited to this kind of an architecture, theGPU can offer large performance benefits. However, a typical computation runningon the GPU must express thousands of threads in order to effectively use the hardwarecapabilities. Therefore, finding large scale parallelism is important for programming onGPU. The introduction of the NVIDIA CUDA (Compute Unified Device Architecture)brought, through a C-based API, an easy way to take advantage of the high performanceof GPUs for parallel computing. CUDA also exposes a fast shared memory region(16KB in size) that can be shared amongst threads. We will see that some data miningalgorithms can be significantly accelerated with benefit of GPU architecture in the nextSection.

3 Parallel Data Mining

3.1 k-Nearest Neighborsk-Nearest Neighbors (k-NN) algorithm is an easy to understand and to implement clas-sification technique in data mining. It is based on a majority vote of the k closest train-ing examples in the feature space [54]. If k = 1, then the object is simply assigned tothe class of its nearest neighbor. In practical applications, k is in units or tens ratherthan in hundreds or thousands. Various measures (e.g. Euclidean distance, cosine mea-surement, KL-divergnce) can be used to compute the distance between two data points,the most desirable distance metrics may differ in different applications.

k-NN is a type of lazy learning where the function is only approximated locallyand all computation is deferred until classification. Thus, building the model is cheap,but classifying unknown objects is relatively expensive since it requires computing thedistance of the unlabeled object to all the objects in the labeled set. Unlike other datamining mentioned in this paper, the parallelization of k-NN is not applied on trainingperiod but on the prediction the unobserved instances, which is given as follows:

1. Partition the dataset D into P blocks D1, · · · , DP , each processor handles roughly∥D∥/P

2. Given an unknown object, processor Pr calculates the k nearest neighbors Nr

with the local training samples Dr.

3. A global reduction computes the overall k nearest neighbors Nglobal from N1, · · · ,NP ,and then assign the object to the class which most common amongst Nglobal.

The main drawback of k-NN lies in its computation burden, as it grows polyno-mially with the data size. A number of techniques have been developed for efficientcomputation that make use of the structure in the data and do some preprocessing toavoid an “exhaustive search”. A typical example is BBD-tree [4] an approximate near-est neighbor searching algorithms, is commonly used in practical because of its betterperformance.

5

Page 6: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

[55] evaluated the parallel k-NN on top of Active Data Repository developed atUniversity of Maryland [18]. They used 8 Sun Microsystem Ultra Enterprise 450s, eachof which has 4 250MHz Ultra-II processors, 1 GB RAM and connected by a Myrinetswitch. The data set is 2.7 GB with points in a 3-dimensional space, the value of k usedis 10. They reported the speedups on 2, 4, and 8 nodes with single thread are 1.93, 4.04and 7.70, respectively. They also measured the time taken by a version of the codethat only performs I/O and no computation, and showed that the code is I/O boundand cannot benefit from additional threads for computation. It’s worth to highlightthat k-NN algorithm can be significantly accelerated using GPU architecture. Thecomparison [41] was between standard k-NN implemented in C, CUDA and BBD-treeimplemented in C, on a Pentium 4 3.4 GHz with 2GB of DDR memory and a NVIDIAGeForce 8800 GTX graphic card. This graphic card has 128 stream processors clockedat 1.35 GHz, a core clock of 575 MHz, and 768 MB of 384-bit GDDR3 memory at1.8 GHz. The results on 38400 points with 96-dimensions showed that k-NN-CUDAis 100 times faster than k-NN-C, and 40 times faster than BBD-tree-C.

Additionally, researchers from computational geometry also find this algorithm in-triguing, a k-Nearest Neighbor Graph (k-NNG) is defined as a graph in which twovertices p and q are connected by an edge, if the distance between p and q is amongthe k-th smallest distances from p to other vertices. k-NNG is widely used in meshing,rendering and geometric embedding, thus how to build k-NNG in an efficient mannerbecomes a curial problem in computational geometry. [24] presented a parallel algo-rithm for k-NNG construction that uses Morton ordering. They performed experimentson multi-core processors with Intel, AMD and Sun Architecture, showed that the algo-rithm performs best on point sets that use integer coordinates and scaled well as moreprocessing power becomes available.

3.2 Decision TreeDecision tree algorithms are widely used in Data Mining because they can be expressedin rule based mannerand can be easily converted into SQL statements that can be usedto access databases efficiently [1]. Each leaf represents a value of the target variablegiven the values of the input variables represented by the path from the root to theleaf. A Decision tree can be learned by splitting the source set into subsets based onan attribute value test. This process is repeated on each derived subset in a recursivemanner. C4.5 is an algorithm used to generate a decision tree using divide-and conquerstrategy [74]. Given a training set with N instances. each object is represented as afeature vector (x1, · · · , xd), with a class label v. The general case of C4.5 algorithm isdescribed as follows:

1. For each feature fi, compute the information gain by gi =∑V

v=1 pv log2 pv .Suppose v takes on values in {1, · · · , V }, pv is the fraction of items labeled withvalue v in the set.

2. Let fbest be the feature with the highest normalized gi, create a decision nodenm that splits on fbest

6

Page 7: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

3. Recurse on the sublists obtained by splitting on fbest, and add those nodes aschildren of nm

4. Recurse this algorithm on every node, until the subset at a node all has the samevalue of the target variable, or when splitting no longer adds value to the predic-tions.

The parallelization of decision tree construction algorithms lie in task parallelism,data parallelism and hybrid parallelism. The task-parallelism approach proposed by [27]dynamically distributed the decision nodes among the processors for further expansion.A single processor using all the training set starts the construction phase. When thenumber of decision nodes equals the number of processors the nodes are split amongthem. At this point each processor proceeds with the construction of the decision sub-trees rooted at the nodes of its assignment. They experimented on two data set: Census-Income1 contains 27,222 cases of 14 attributes (five out of 14 attributes are continuousattributes), and Letter-Recognition2 contains 20,000 cases of 16 continuous attributes.They observed the average speedup of [8 nodes, 1.5x] and [16 nodes, 2.5x] on a large-scale MIMD parallel computer Fujitsu AP1000. After 16 processors the performanceremains either static or degrades because of poor load balancing, that is, the sizes ofthe subtrees allocated to the processors varied leading to an uneven distribution of workbetween the processors. [77] proposed a data parallelism that distributed the instancesin data set evenly by the processors. Each processor keeps in its memory only a distinctsubset of examples of the training set. The possible splits of the examples associatedto a node are evaluated by all the processors. A global communication is performed atthe end to find the global values of the criteria used and, by this, the best split. Theyimplemented its parallelization on a 16-node IBM SP2 Model 9076 using MPI. Eachnode in the multiprocessor is a 370 Node consisting of a POWER1 processor running at62.5MHz with 128MB of RAM, and communicate with each other through the High-Performance-Switch with HPS-tb2 adaptors. The speedup on 1.6M examples showed1, 1.9, 3.7, 5.7 on 2, 4, 8 and 16 nodes respectively. On the other hand, [40] splitted thedata by attributes. Each processor keeps in its memory only the whole values for theset of attributes assigned to him and the values of the classes. During the evaluation ofthe possible splits each processor is responsible only for the evaluation of its attributes.Due to the evaluation of continuous attributes which requires more processing than theevaluation of discrete attributes, this parallelism still suffers from load imbalance. Theparallel decision tree construction algorithms, The hybrid parallelism, combines bothhorizontal or vertical data distribution and task parallelism. For the nodes covering asignificant amount of examples, data parallelism is used to avoid the problems of loadimbalance. For the nodes covering fewer examples, one of the processors continuesalone the construction of the tree rooted at the node (task parallelism). Two parallel de-cision tree construction algorithms using hybrid parallelism are described in [79, 59].[79] evaluated their implemented hybrid algorithm on the same distributed environmentand same data set used in [77]. The speedup on 1.6M examples were 2, 3.9, 7.4, 13 on2, 4, 8 and 16 nodes respectively, which are significantly better than data parallelism

1http://archive.ics.uci.edu/ml/support/Census+Income2http://archive.ics.uci.edu/ml/datasets/Letter+Recognition

7

Page 8: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

and task parallelism.

3.3 Naıve BayesNaıve Bayes is an important supervised classification method. It is based on Bayes’theorem with the assumption that the presence of a particular feature of a class is un-related to the presence of any other feature. Naıve Bayes classifiers often work muchbetter in many complex real-world situations than one might expect. [89] explainedwhy even with strong dependencies, Naıve Bayes still works well. General discussionof the Naıve Bayes method and its merits are given in [34, 37].

Given a training set with n instances in k classes, each object is represented asa feature vector (x1, . . . , xd), with a class label v. the general case of Naıve BayesClassifier is described as follows:

1. Estimate P (y = v) directly from the proportion of class v objects in the trainingset

2. Estimate P (xi = u|y = v). If xi is categorical, which take only a few values,then this estimation can be done simply as fraction of “y = v” records that alsohave xi = u. If xi is continuous, a common strategy is to assume that xi have aGaussian probability distribution.

To predict the value y given observations of a new object with (x1, . . . , xd), compute

y = argmaxv

P (y = v|x1 = u1, . . . , xm = ud)

= argmaxv

d∏i=1

P (xi = ui|y = v)P (y = v)

The parallelization of Naıve Bayes is straightforward. Under the assumption of thecomponents of x are independent, each of distributions P (xi = u|y = v) can be es-timated separately. [33] ran a MPI implemented algorithm on a cluster with 6 nodes,where each node has a 1.6GHz CPU, 256MB physical memory and connected by theEthernet. They evaluated the performance on Reuters dataset3 with 9603 training and3299 test documents, the speedup was [2 nodes, 1.3x], [4 nodes, 1.6x], [6 nodes, 1.8x].On the other hand, [22] parallelized the Naıve Bayes using MapReduce model onShared-memory system. They specify different sets of mappers to calculate them, andthen the reducer sum up intermediate result to get the final result for the parameters.Their experiment was on a 16 way Sun Enterprise 6000 running Solaris 10. They eval-uated the average speedup on ten datasets from the UCI Machine Learning repository4

with different size (from 30000 to 2500000), which makes their report more convinc-ing. The result showed that the speedup was [4 nodes, 4x], [8 nodes, 7.8x], [16 nodes,13x].

3http://www.daviddlewis.com/resources/testcollections/reuters21578/4http://www.ics.uci.edu/$\sim$mlearn/

8

Page 9: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

3.4 k-meansk-means is an unsupervised method of cluster analysis that aims to partition n ob-servations into k clusters, in which each observation belongs to the cluster with thenearest mean. [46] provided a nice historical background for k-means placed in thelarger context of hill-climbing algorithms. A detailed history of k-means alongwith de-scriptions of several variations are given in [35]. Given a set of d-dimensional vectorsD = {xi|i = 1, . . . , N}, the algorithm is initialized by picking k “centroids” ran-domly or by some heuristic, the algorithm proceeds by alternating between two stepstill convergence:

1. Data Assignment. In iteration t, each data point is assigned to its closest centroid.

S(t)i = {xj :

∥∥∥xj −m(t)i

∥∥∥ ≤ ∥∥∥xj −m(t)i∗

∥∥∥ for all i∗ = 1, . . . , k}

The default measure of closeness is the Euclidean distance. In some applica-tions, KL-divergence is used to measure the distance between two data pointsrepresenting two discrete probability distributions

2. Relocation of “means”. Calculate the new centroid of the data points in thecluster.

m(t+1)i =

1∣∣∣S(t)i

∣∣∣∑

xj∈S(t)i

xj

There is no guarantee that it will converge to the global optimum, and the result maydepend on the initial clusters. Therefore, it is common to run the algorithm multipletimes with different initial centroids.

There has been some works for studying the advantages of the parallelism in thek-means procedure. Parallel k-means has been studied by [32, 80, 88, 57] previouslyfor very large databases. The speedup and scale-up variation with respect to numberof documents (vectors), the number of clusters and the dimension of each of the doc-uments has been studied [32]. The parallel k-mean algorithm in Master-slave mode isdescribed as follows:

1. Partition the dataset into P blocks D1, . . . , DP , each processor handles roughlyN/P .

2. Processor P0 builds the initial k centriods (m1, . . . ,mk)global, and broadcaststhem to all processors.

3. Processor Pr reads the part of dataset Dr based on its responsibility, and deter-mines the centroid to which its set of xi is closest using global parameters, onexi at a time. Then Pr computes its local centroids (m1, . . . ,mk)local, and thensend it to P0.

4. After P0 collects all local centroids, computes the new global centriods andbroadcast it to all processors.

9

Page 10: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

Iterate 3,4 until convergence.[80] tested k-means clustering algorithm on a PC cluster with 10 MBits Ethernet.

The data set consists of 100,000 objects in 20 clusters, each with 20 continuous at-tributes. They were able to achieve about 90% efficiency for a configuration up to 32processors. [32] implemented a SPMD version of this algorithm using MPI on IBMPOWERparallel SP2 with a maximum of 16 nodes. Each node in the multiprocessor isa Thin Node 2 consisting of a IBM POWER2 processor running at 160 MHz with 256megabytes of main memory. The processors all run AIX level 4.2.1 and communicatewith each other through the High-Performance Switch with HPS-2 adapters. They re-port three sets of experiments using artificial generated data set, with different N , d,and k, respectively. They observed a speedup of 15.62 on 16 processors for a largestdata set n = 221, and a flattened speedup of 6.22 on 16 processors for n = 211. Bystudying the performance on different data set, they also found that the speedups areessentially independent of d and k. They also reported that their implementation of par-allel k-means has linear scaleup in n and k, and surprisingly better than linear scaleupin d. Moreover, [9] used k-mean as a test case to investigate scalable implementationsof out-of-core I/O-intensive data mining algorithms on clusters of workstations. [22]developed the same algorithm using MapReduce and got linear speedup when proces-sors up to 16.

It is well known that the k-means algorithm is a hard threshold version of theexpectation-maximization (EM) algorithm. Meanwhile, EM algorithm has a naturalconnection to Gibbs sampling methods for Bayesian inference, which is widely usedin Mixture model. One can envision that the EM algorithm and Gibbs Sampling can beeffectively parallelized using essentially the similar strategy as that used in this parallelk-means.

3.5 Expectation-MaximizationIn statistics, Expectation-Maximization (EM) algorithm is used for finding maximumlikelihood estimates of parameters in probabilistic models, where the model dependson latent variables [31, 51]. It iteratively alternates between performing an expecta-tion (E) step and maximization (M) step. EM algorithm has become a popular toolin data clustering and mixture estimation problems involving incomplete data [65]. Indocument classification problem, the cost of labeling documents is usually expensive,while unlabeled documents are commonly available. By applying the EM algorithm,we can use the unlabeled documents to augment the available labeled documents in thetraining process [70]. Given a likelihood function L(θ|x, z) , where θ is the parametervector, x is the observed data and z represents the latent variables or incomplete data.The maximum likelihood estimate (MLE) is determined by the marginal likelihood ofthe observed data L(θ|x), however this quantity is often intractable. The EM algorithmseeks to find the MLE by iteratively applying the following two steps:

E-step In iteration t, calculate the expected value of the log likelihood function, withrespect to the current estimate of the parameters θ(t):

Q(θ|θ(t)) = Ez|x,θ(t) [logL (θ|x, z)]

10

Page 11: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

M-step Find the parameter which maximizes this quantity Q(θ|θ(t)):

θ(t+1) = argmaxθ

Q(θ|θ(t))

The new parameters are then used to determine the distribution of the latentvariables in the next E-step.

Convergence is assured since the algorithm is guaranteed to increase the likelihood ateach iteration. Like k-means, EM algorithm only gives a local optimized solution.

The parallel EM algorithm in SPMD model could be described as follows:

1. Partition the dataset into P blocks D1, . . . , DP , each processor handles roughlyN/P

2. Processor P0 builds the initial global parameters centriods θ(0)global, and broadcaststhem to all processors.

3. Processor Pr reads the part of dataset Dr based on its responsibility, and iteratestep 4 5 6 until convergence.

4. E-step: Each processor estimates quantity Qr(θ|θ(t)global, Dr)

5. M-step: Each processor re-estimates its own local parameters θ(t+1)r by maxi-

mizing Qr(θ|θ(t)global, Dr)

6. Use collective communication operation to obtain the new global parametersθ(t+1)global from local parameters θ

(t+1)1 , . . . , θ

(t+1)P , and then return θ

(t+1)global to all

processors.

Many parallel implementations of the EM algorithm in different domain have beenproposed in recent years. [58] employed the SPMD model of parallel EM for text clas-sification on PIRUN Cluster. PIRUN Cluster consists of 72 nodes connected with FastEthernet Switch 3COM SuperStackII. Each node is a 500 MHz Pentium III with 128Mb Memory and runs on Linux, the processors communicate with each other by usingMPI. They did 3 groups of experiment using 10000, 5000, 2500 documents drawedfrom 20 Newsgroups dataset5. They claimed that their parallel algorithm yields betterperformance for the larger data sets. It achieved the speedups of [2 nodes, 1.97x], [4nodes, 3.72x], [8 nodes, 7.16x], and [16 nodes, 12.16] on the largest set with 10000documents. When it accessed to a smaller set of documents 5000, 2500, the speedupcurves tend to drop from the linear curve. [43] reported a hybrid-memory paralleliza-tion of the EM algorithm using the FREERIDE middleware [55]. Their experimentswere conducted on a cluster with 6 nodes, each nodes has 700 MHz Pentium CPU,1 GB memory and connected through Myrinet LANai 7.0. In experiment, they gen-erated three different size of dataset, containing millions of 10-dimensional points tobe clustered. Each dataset was partitioned into thousands of chunks to make it disk-resident. They reported the average speedup on three dataset were [2 nodes, 1.76x],

5http://people.csail.mit.edu/$\sim$jrennie/20Newsgroups/

11

Page 12: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

[4 nodes, 3.47x], [8 nodes 6x] with 1 thread. When adding the number of threads upto 3, their parallelization demonstrated 10% additional speedup due to the reductionobject is small enough to be cached in some of the instances. However, adding the 4ththread creates CPU contention and thus doesnt result in additional speed-ups. [22] alsoshowed the average speedup of parallel EM using MapReduce model on a 16 coresserver was [2 nodes, 2x], [4 nodes, 3.8x], [8 nodes, 7x], [16 nodes, 10x]. The reasonfor the sub-unity slope is increasing communication overhead.

3.6 PageRankPageRank is a link analysis algorithm [14], named after Larry Page, used by the GoogleInternet search engine that assigns a numerical weighting to each element of a hyper-linked set of documents. The intuition behind PageRank is that a web page is importantif several other important web pages point to it. Generally, this algorithm can be usedin any graph with link structure to measure the relative importance of the nodes. Inaddition to rank the webpages, PageRank has recently been proposed as a replacementfor the traditional Institute for Scientific Information impact factor [29]. In text mining,PageRank has also been used to automatically rank WordNet synsets according to howstrongly they possess a given semantic property, such as positivity or negativity [36].Even in ecosystem, a modified version of PageRank may be used to determine speciesthat are essential to the continuing health of the environment [2].

Given a direct graph G = (V,E) consisting of a set of pages V (vertices) and a setof directed links E (edges) that connect pages. PageRank score vector of the pages canbe worked out as follows:

r = α ·T · r+ (1− α) · 1N· 1N

where α is damping factor, which is generally set around 0.85 [14]. T is the transitionmatrix:

T(p, q) =

{0 if (q, p) /∈ E

1outdegree(q) if (q, p) ∈ E

Parallelization of PageRank has great significance, since the connection graph ofreal Web has usually one billion or more links, a system that can compute results withinminutes is desideratum. Fortunately, parallelizing PageRank is not a new problem,PageRank can be computed by basic linear algebra operations. One approach is toparallel solve this linear system (I−U) · r = b where U = α · T and b = (1 −α) · 1

N · 1N. The linear system solvers are accessible from well-developed ScientificComputation package, such as Portable,Extensible Toolkit for Scientific Computation(PETSc) [8, 7, 6]. The complete distributed algorithm based on algorithmic ways isgiven as follows:

1. Partition the matrices U and b into P sub matrices {U1, . . . ,UP }, {b1, . . . ,bP }by dividing the rows.

2. Processor P0 builds the initial global rank vector r(0)global, and broadcasts it to allprocessors.

12

Page 13: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

3. In iteration k, each processor Pi use linear solvers to calculate local ri, suchas Power iterations, Jacobi iterations and Krylov subspace methods. If we takeJacobi iterations, r(k)i = Ui · r(k−1)

global + bi.

4. Concatenate the local rank vectors into a global rank vector by collective com-munications: r(k)global = [r

(k−1)1 , r

(k−1)2 , ...]T

Iterate 3,4 until convergence.[42] did experiments on their parallel computer, which was a Beowulf cluster of

RLX blades connected in a star topology with a gigabit Ethernet. They had seven chas-sis composed of 10 dual processor Intel Xeon blades with 4 GB of memory each (140processors, and 280 GB memory total). Each blade inside a chassis was connected toa gigabit switch and the seven chassis’ were all connected to one switch. Their paral-lel PageRank codes use PETSc to implement basic linear algebra operations and basiciterative procedures on parallel sparse matrices. Their experimented on datasets with1.4 billion nodes took 35.5 minutes (2128 secs) for PageRank and 28.2 minutes (1690secs) for BiCGSTAB on the full cluster of 140 processors, while the most efficient im-plementation of the serial PageRank algorithm took 12.5 hours on this graph using aquad-processor Alpha 667MHz server and approximately 10 hours using a 800MHzItanium [15]. They studied the parallel performance of many linear solvers, includingbasic Power iterations, Jacobi iterations, Generalize Minimum Residual (GMRES), Bi-conjugate Gradient (BiCG) and Biconjugate Gradient Stabilized (BiCGSTAB). Theyargued that BiCGSTAB and GMRES have the highest rate of convergence, neverthe-less, the actual runtime can be longer than the run time for simple Power iterations.They also indicated by the experiment that when the communication and work loadbalance is approximately preserved, increasing the number of processors leads to asmaller computation time, but that speedup eventually gets down.

Another experiment done by [62] was run on a PC cluster of eight Opteron 240machines, networked via the Gigabit Ethernet and running the Linux operating sys-tem. Each machine is equipped with 3GB of main memory, and a UW-SCSI hard disk.Their proposed parallel algorithm was written in C language using the standard MPICHv1.2.5 library. They used a web graph derived from a crawl during January 2003 withinthe thailand (.th) domain, which contained around 10.9 million web pages, 97 millionlinks. They reported the speedup of their parallelization are [2 nodes, 1.5x], [4 nodes,2x], [8 nodes, 2.9x]. They also created additional artificial sets of web graphs by con-catenating several copies of the base graph, and connecting those copies by reroutingsome of the links. Their experiment on this artificial dataset (roughly 174.4 millionweb pages, 1.55 billion links) gave [2 nodes, 1.9x], [4 nodes, 3.8x], [8 nodes 6.5x].Their slope of speedup curves get closer to the linear one when the size of the virtualdata becomes larger.

3.7 Support Vector MachineSupport Vector Machine (SVM) is a supervised learning method used for binary clas-sification [13, 25, 49]. Intuitively, the aim of SVM is to find optimal separating hy-perplane by maximizing the margin between the two classes, which offer the best gen-eralization ability for future data. When training data are not linearly separable, the

13

Page 14: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

kernel function k(xi, xj) can be used to define a variety of nonlinear relationship be-tween its inputs. This allows the algorithm to fit the maximum-margin hyperplane ina transformed feature space. Much study in recent years have gone into the study ofdifferent kernels for SVM classification [76]. Given a training set with N instances,where each object is represented as a feature vector xi, with a class label yi ∈ {1,−1}.The general SVM training algorithm can be summarized as follows:

1. Choose a kernel function k(xi, xj)

2. Maximize the function below, subject to αj ≥ 0 and∑N

i=1 αiyi = 0

W (α) =N∑i=1

αi −1

2

N∑i=1

αiαjyiyjk(xi, xj)

where αi are non-negative Lagrange multipliers, it indicates the “support level”of instance xi to the hyperplane. In case of αi = 0 means that removing xi fromtraining set does not interfere the position of hyperplane.

3. The bias b is found as follows:

b =1

2

min

∑i|yi=1

αiyik(xi, xj)

+max

∑i|yi=−1

αiyik(xi, xj)

4. Given a new object z, the optimal αi go into the decision function: D(z) =

sign(∑N

i=1 αiyik(xi, z) + b)

There are several important extensions on the above basic formulation of SVM.The “soft margin” idea was introduced to extend the SVM algorithm [83] so that thehyperplane allows a few of such noisy data to exist. To solve the problems that involvemore than two classes. we can repeatedly use one of the classes as a positive class,and the rest as the negative classes to train several SVM models, which known asthe 1-vs-all method. Moreover, SVM can be easily extended to perform regressionanalysis [83].

The core of SVM is the quadratic programming problem (QP). Although severalapproaches for accelerating the QP such as “chunking” [13, 56], Sequential MinimalOptimization (SMO) [73], “shrinking” [56] have been proposed, improving compute-speed through parallelization is difficult due to dependencies between the computationsteps. [23] used mixture of several SVMs, each of them has a weight and trained onlyon a part of the data set. The training method could be implemented by Master-Slavemodel given as follows:

1. Partition the dataset into P blocks D1, . . . , DP , each processor handles roughlyN/P

2. Processor Pr reads the part of dataset Dr based on its responsibility, and buildslocal SVM Sr.

14

Page 15: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

early during the optimization process is another strategy that provides substantial savings in computation. Efficient SVM implementations incorporate steps known as ‘shrinking’ for identifying non-support vectors early [4][6][7]. In combination with caching of the kernel data, such techniques reduce the computation requirements by orders of magnitude. Another approach, named ‘digesting’ optimizes subsets closer to completion before adding new data [8], saving considerable amounts of storage.

Improving compute-speed through parallelization is difficult due to dependencies between the computation steps. Parallelizations have been proposed by splitting the problem into smaller subsets and training a network to assign samples to different subsets [9]. Variations of the standard SVM algorithm, such as the Proximal SVM have been developed that are better suited for parallelization [10], but how widely they are applicable, in particular to high-dimensional problems, remains to be seen. A parallelization scheme was proposed where the kernel matrix is approximated by a block-diagonal [11]. A technique called variable projection method [12] looks promising for improving the parallelization of the optimization loop.

In order to break through the limits of today’s SVM implementations we developed a distributed architecture, where smaller optimizations are solved independently and can be spread over multiple processors, yet the ensemble is guaranteed to converge to the globally optimal solution.

2 The Cascade SVM

As mentioned above, eliminating non-support vectors early from the optimization proved to be an effective strategy for accelerating SVMs. Using this concept we developed a filtering process that can be parallelized efficiently. After evaluating multiple techniques, such as projections onto subspaces (in feature space) or clustering techniques, we opted to use SVMs as filters. This makes it straightforward to drive partial solutions towards the global optimum, while alternative techniques may optimize criteria that are not directly relevant for finding the global solution.

SV1 SV2 SV3 SV4 SV5 SV6 SV7 SV8

SV9 SV10 SV11 SV12

SV13 SV14

SV15

TD / 8 TD / 8 TD / 8 TD / 8 TD / 8TD / 8 TD / 8 TD / 8

1st layer

2nd layer

3rd layer

4th layer

SV1 SV2 SV3 SV4 SV5 SV6 SV7 SV8

SV9 SV10 SV11 SV12

SV13 SV14

SV15

TD / 8 TD / 8 TD / 8 TD / 8 TD / 8TD / 8 TD / 8 TD / 8

1st layer

2nd layer

3rd layer

4th layer

Figure 1: Schematic of a binary Cascade architecture. The data are split into subsets and each one is evaluated individually for support vectors in the first layer. The results are combined two-by-two and entered as training sets for the next layer. The resulting support vectors are tested for global convergence by feeding the result of the last layer into the first layer, together with the non-support vectors. TD: Training data, SVi: Support vectors produced by optimization i.

We initialize the problem with a number of independent, smaller optimizations and combine the partial results in later stages in a hierarchical fashion, as shown in Figure 1. Splitting the data and combining the results can be done in many different ways.

Figure 2: Schematic of a binary Cascade architecture. The data are split into subsetsand each one is evaluated individually for support vectors in the first layer. The resultsare combined two-by-two and entered as training sets for the next layer. The resultingsupport vectors are tested for global convergence by feeding the result of the last layerinto the first layer, together with the non-support vectors. TD: Training data, SVi:Support vectors produced by optimization i.

3. Processor P0 trains the weight matrix w ∈ RP×N by minimizing cost function

C =N∑i=1

[tanh

(P∑

r=1

wriSr(xi)

)− yi

]2where Sr(xi) is the output of Sr given input xi

Their experiment on 100000 examples showed that even on single processor the train-ing time decrease from 3231 minutes to 237 minutes by using mixture SVMs insteadof single SVM. They claimed that the reason is their algorithm scale linearly with thenumber of training examples, whereas the standard SVM complexity scaling muchcloser to O(N3). They also trained the mixture of SVMs on 50 machines, and the timedecayed to 73 minutes (about 3.2 speedup). It’s not clear what parallel environmentthey used. In addition, they observed a significant improvement in generalization ofmixture SVMs.

Another notable parallelization is Cascade SVM proposed in [45], which filteredthe non-support vectors from the optimization in a hierarchical fashion, as shown inFigure 3.7.

The Cascade provides several advantages over a single SVM because it can reducecompute- as well as storage-requirements. Their experiment was on a Linux clusterwith 16 nodes, each nod has a AMD 1.8GHz dual processors and 2GB RAM. Thedata set consists 1,016,736 vectors. A Cascade with 1-5 layers was executed to findglobal solution. The fully converged solution was found in 3 iterations, and the averagespeedup compare to standard SVM is 1.5, 3,3, 4.0, 4.5 on 2, 4, 8, 16 nodes. The mainlimitation of this algorithm is that it only works on 2k processors and in higher layer

15

Page 16: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

it manipulates fewer processors to do the optimization. This is why the accelerationsaturates at a relatively small number of layers.

Moreover, [19] developed a parallel SVM algorithm (PSVM), which reduces mem-ory use through performing a row-based, approximate matrix factorization, and whichloads only essential data to each machine to perform parallel computation. Their ex-periment showed that PSVM did enjoy linear speedup when the number of machinesis up to 30, and gave 90, 120 speedup on 150, 250 machines, respectively. [17] de-scribed a solver for SVM training running on a GPU, using Platt’s Sequential MinimalOptimization algorithm. They implemented the parallel SVM using MapReduce andCUDA. The experiment was conducted on Nvidia GeForce 8800 GTX, and got 81-138speedup compared to LIBSVM an Intel Core 2 Duo 2.66 GHz processor.

3.8 Latent Dirichlet AllocationLatent Dirichlet Allocation (LDA) is a Bayesian network that generates a documentusing a mixture of topics [12]. It assumes a generative probabilistic model in whichdocuments d are represented as random mixtures over latent topics z, where each topicz is characterized by a probability distribution over words w. The complete generationprocess and equivalent graphical model are showed below.

ϕ ∼ Dirichlet(β)

θ ∼ Dirichlet(α)

zdi|θd ∼ Multinomial(θd)

wdi|zdi, ϕzdi ∼ Multinomial(ϕzdi)

LDA can capture the heterogeneity in grouped data which exhibit multiple pat-terns. In the recent past, LDA has emerged as an attractive framework to model, vi-sualize [53] and summarize large document collections in a completely unsupervisedfashion. Several extensions to LDA model have been proposed, such as the TopicsOver Time model that permits us to analyze the popularity of various topics as a func-tion of time [85], Hidden Markov-LDA that integrates topic modeling with syntax [48],Author-Persona-Topic model that models words and their authors [66], etc. In eachcase, graphical model structures are carefully designed to capture the relevant structureand co-occurrence dependencies among the data.

Although LDA is still a relatively simple model, exact inference is generally in-tractable. The solution to this is to use approximate inference algorithms, such asmean-field variational EM [12] and Gibbs sampling [47]. Gibbs sampling is a typi-cal MCMC method for Bayesian inference. It directly follows the generative process,which make it easier to understand and implement compare to variational EM. Thegeneral description of Gibbs sampling estimator of standard LDA is given as follows:

1. Give an initial (random) topic zdi assignment on every word in every document.

2. In each iteration, update the topic assignment zdi by sampling from full condi-

16

Page 17: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

α

β

θ

z

w φd�

D

T

Figure 3: Graphical representation of LDA. LDA is a hierarchical generative model,it describes the complete generative process of a given corpus. It encapsulates threeplates, or repetitive processes. The outer most plate describes the generation processfor each document. This repeats D times, where D is the number of documents inthe corpus. The embedded smaller plate shows the generation process for each wordin document. The process repeat Nd times, once for each word in the document d.Another small plate on the right side denotes the generation process of all T topics,in which each topic is described by a multinomial distribution over vocabulary. Ingeneral, T, α, β have to be hand-tuned for different data set.

tional posterior distribution:

P (zdi|z−di, w, α, β) ∝ (nddi,zdi + βzdi)×nzdi,wdi

+ αwdi

V∑v=1

(nzdi,v + αv)

where nd,z is the number of tokens in document d that assigned to topic z, nz,v isthe number of tokens of word v are assigned to the topic z, the number of topicsis T .

3. After the burn-in period the sampling algorithm gives direct estimates of z forevery word. The word-topic distributions θ and topic-document distributions ϕcan be obtained from θzd =

nd,z+αz∑Zz=1 (nd,z+αz)

, ϕvz =

nz,v+βv∑Vv=1 (nz,v+βv)

respectively.

In each iteration, the sampler has to go through the whole corpus and assign a topicon each word, which gives Gibbs sampling a poor efficiency. When dealing with alarge-scale document collection, standard serialized Gibbs sampling is computation-ally infeasible. Prior work has explored multiple alternatives for speeding up LDA,including both parallelizing Gibbs sampling and variational EM across multiple ma-chines. [5] presented an asynchronous distributed Gibbs sampling algorithm. [69]presented two synchronous methods, AD-LDA and HDLDA, to perform distributedGibbs sampling. AD-LDA is similar to parallel EM as we hereinbefore mentionedfrom data-flow perspective. The AD-LDA algorithm works as follows:

17

Page 18: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

1. Partition the dataset into P blocks D1, . . . , DP , processor Pr works with its ownword content w|r and corresponding topic assignment z|r, and maintains localcounts of n|r

d,z and n|rz,v .

2. Processor P0 builds the initial global counts of n|globalz,v , and broadcasts them to

all processors.

Iterate until stop:

3. Each processor Pr sampling every z|rdi ∈ Z|r from the approximate posterior

distribution:

P (z|rdi|z−di, w

|r, α, β) ∝(n|rddi,zdi

+ βzdi

)× n

|globalzdi,wdi + αwdi

V∑v=1

(n|globalzdi,v + αv

)4. Each processor updates local counts of n|r

d,z and n|rz,v according to the new topic

assignment.

5. Use collective communication to obtain the new global counts n|globalz,v from local

counts n|1z,v, ..., n

|Pz,v , and then return it to all processors.

[86] implemented the AD-LDA algorithm in both MPI and MapReduce models andnamed it as PLDA. They applied PLDA on document summarization on the Wikipediadataset, which consists of 2, 122, 618 articles with 447, 004, 756 words. The experi-ments were conducted on 256 machines at Googles distributed data centers, in whicheach machine is configured with a CPU faster than 2GHz and memory larger than4GB. Their results showed that both MPI-PLDA and MapReduce-LDA enjoyed ap-proximately linear speedup when the number of machines is up to 100. However, whenthe number of machines continues to increase, MPI-PLDA achieved speedup with [128nodes, 94x], [256 nodes, 169x] and MapReduce-PLDA yields [128 nodes, 57x], [256nodes, 72x]. They claimed that in the absence of machine failures, MPI-PLDA is moreefficient because no disk IO is required between computational iterations. When thenumber of machine is large, and the meantime to machine failures becomes a legitimateconcern, the target application should either use MapReduce-LDA or force checkpointswith MPI-PLDA. [21] applied this MPI-PLDA on the Orkut data set for the communityrecommendation task, the data set consists of 492,104 users and 118,002 communities.The result showed that the speedup approximately linear on up to 8 machines. But afterthat, adding more machines yields diminishing returns: [16 nodes, 7.45x], [32 nodes,10.66x], since communication time takes up more and more in total running time. With32 machines they reduced the training time from 8 hours to less than 46 minutes. Addi-tionally, [68] built parallel implementations of the variational EM algorithm for LDAin a multiprocessor architecture as well as a distributed setting. They used a Linuxmachine with four 2.40GHz CPUs sharing 4GB RAM for shared-memory system, anda 96 nodes cluster that equipped with a Transmetta Efficeon TM8000 1.2GHz proces-sor with 1MB of cache and 1GB RAM on each for distributed-memory system. Theyshowed the multiprocessor implementation achieved a speedup of only 1.85 from 1 to

18

Page 19: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

ty

1ty

+1ty

tx

1tx

+1tx−

Figure 4: Graphical representation of a linear-chain CRF in which the transition scoredepends on the its’ neighboring observations.

4 threads, while the distributed implementation achieved a significantly higher speedupof 14.5 from 1 to 50 nodes. They claimed that the multiprocessor implementation maynot scale to large collections, since it stores the entire data in memory and the read-conflict between various threads.

3.9 Conditional Random FieldsConditional Random Fields (CRF) is a framework for building graphical probabilisticmodels to segment and label sequence data [60]. It inherits characteristics of discrim-inative models and can be applied to encode non independent features. Instead ofmodeling the subtle dependencies in the input space of p(x), CRFs concentrate di-rectly on modeling the conditional distribution p(y|x) between observation data andstate sequence. Let G(V, F,E) be a factor graph that V is the set of vertices connectedby the edges in E, F represent the set of weights. y is indexed by the vertices andT = card(V ). Supposing Y is a set of all state sets, so that y ∈ Y . Given a vertices yt,yt is a set of all the vertices tied to yt and xt is a set of all the observations tied to yt.By using Bayes rule, the general conditional random field can be written in the form:

p(y|x) = p(x, y)∑y∈Y

p(x, y)=

1

Z(x)

∏c∈C

exp

{T∑

t=1

∑k∈K

λckf

ck

(yt, yct , x

ct

)}

where Z(x) =∑y∈Y

∏c∈C

exp

{T∑

t=1

∑k∈K

λckf

ck(yt, y

ct , x

ct)

}is a normalization term. C is

a set of clique template, K is defined as a set of all the state-state pairs and the state-observation pairs. fk(yt, yct , xc

t) is the feature function weighted by λck [82, 81, 75].

In a specified problem, {fk(yt, yt, xt)} depends on the structure of graph. Con-sider a Linear-Chain structure which have been used in sequential data mining likenamed-entity recognition [64] and part-of-speech tagging [60], yt is just tied to yt−1

and xt, there is only one clique template, the feature function can be therefore writtenas fk(yt, yt−1, xt).

Parameter estimation of CRFs is aimed to determine best parameters Λ = {λck} for

given data sequences (xi, yi),i ∈ [1, s] by maximizing the conditional log likelihood:l(Λ) =

∑si=1 log p(y

i|xi). In general, numerical approaches like stochastic gradient

19

Page 20: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

ascent, steepest ascent or quasi-Newton methods(BFGS) [11] are used to solve this op-timization problem. A stochastic gradient training process can be described as follow:

1. Initialize Λ0 as a set of random number.

2. For each given data sequence (xi, yi), calculate the marginal distributions p(yit, yt′ |xi,Λn)from the forward and backward recursions.

3. the gradient of Λn can be then calculated by:

∂l

∂λck

=s∑

i=1

T∑t=1

fck(y

it, y

i,ct , xi,c

t )−s∑

i=1

T∑t=1

∑yt′∈yi,c

t

fck(y

it, y

i,ct , xi,c

t )p(yit, yt′ |xi,Λn)−∑k,c

λck

σ2

4. Update the parameter Λn+1 = Λn + w∇(Λn), where w is the learning step.

5. If stop conditions are met, such as |∇(Λn)| < δ or n > nMax, end the training;otherwise, n← n+ 1 and go to step 2.

Finally, given a new observation sequence x, the most probable assignment is de-fined as y∗ = argmaxy p(y|x), which can be calculated by Viterbi recursion.

From the formulas above for the marginal distribution, the computational complex-ity is O(TL2NG), where T is the length of set, L is the number of labels, N is thenumber of training examples, and G the number of gradient. Due to the dimension oflarge training set, parallel works are needed to ameliorate the performance. As reportedin [72], a team from Tohoku University has implemented a parallel training of CRF. themain idea of the parallel algorithm is to divide the training dataset into P sub-dataset.A main process will gather those values calculated by slave processes in order to obtaina global value of gradient of l(Λ). After apply the optimization algorithm and updatethe l(Λ) for all processes, each process will pass to next step. Their parallel algorithmcan be described as follow [72]:

1. Generate features with initial weights Λ = [λ1, λ2, ...] and each process loads itsown partition Di

2. The root process broadcast Λ to all parallel processes

3. Each process Pi computes the local log-likelihood li and local gradient vector[ ∂l∂λ1

, ∂l∂λ2

, ...]i on Di

4. The root process gather and sums over all li and [ ∂l∂λ1

, ∂l∂λ2

, ...]i by computing

lglobal =∑i

li[∂l

∂λ1,∂l

∂λ2, ...

]global

=∑i

[∂l

∂λ1,∂l

∂λ2, ...

]i

to obtain the global log-likelihood and gradient.

20

Page 21: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

5. The root process performs L-BFGS optimization search to update the new fea-ture weights Λ

6. If iterations < m then goto step 2, stop otherwise

Their test environment is a Cray XT3 system(a MPI system) with 180 AMD Opteron2.4GHz processors and 8GB RAM per each [72]. The results of NP chunking andchunking tasks on the CoNLL2000-L dataset show that in one hand CRF models haveless prediction error than other models such as SVMs by Kudo & Matsumoto. Com-pared to the previous best system, their model reduces error by 22.93% on NP chunk-ing. In another hand, also for the CoNLL2000-L, training time of a single process isover 61 times than the time costed by 45 parallel process. Similarly, they also examinedcross-validation test on Wall Street Journal data set, which took 1h21′ on 45 processorswhile it is estimated to take 56h on one processor. In the last experiment, the speed-upratio is 41.5. In addition, the speed-up ratio is nearly linear to the number of parallelprocesses as predicted.

4 An Overall VisionIn this section, we summarize the parallel data mining techniques described above inan overall vision. We draw analogies of parallelization from different data mining algo-rithm and analyze the complexity, characteristic and parallelism of them. It will offera high-level support for creating scalable data mining implementations in an effectiveand efficient way.

4.1 Performance AnalysisTable 4.1 shows the time complexity analysis for the nine algorithms we remarkedabove. In general, we assume that the input training data set has N instances withdimension of D, the test data set has only one instance and there are P processors forboth parallel training and testing. For CRF model, we assume the length of given testdata is L. In clustering and classification task, the number of expected classes or labelsis K. For iterative algorithms, we assume there needs I steps till convergence. Incollaborate communications such as Broadcast or Scatter, T is the transmissiontime for the model parameters, which accounts for log(P ) factor.

The experimental results showed that many parallelized algorithm can achieve al-most linear speedup when the number of machines is small, which agrees well withtheoretic analysis. However, when the number of machines continues to increase athreshold, the speedup will slow down or even decrease.

There are many factors that limit the speedup of a parallel algorithm. In data miningscenario, two factors should come into notice, which are load balancing and commu-nications overhead. Speedup is generally limited by the speed of the slowest node.Writing an algorithm that evenly distributes its workload across all the processors isknown as load balancing. It is possible that an unparallelisable serial component pres-ence within the parallel algorithm due to the computation dependency. Such a serialcomponent would only allow one processor to work on it while the others processors

21

Page 22: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

Table 1: Time complexity analysis in training and testingAlgorithm Time Complexity

Training TestingSequential Parallel Sequential Parallel

k-NN n/a n/a O(DN) O(DNP + T )

Decision Tree O(DN logN +N log2 N) O(DNP log N

P +N log2 N) O(K) n/aNaıve Bayes O(DN +DK) O(DN

P +DK) O(KD) O(K(DP + T ))k-means O(INDK) O(IDK(NP + T )) O(KD) O(K(DP + T ))

PageRank O(IN2) O(I(NP2+ T )) n/a n/a

EM O(INDK) O(IDK(NP + T )) O(KD) O(K(DP + T ))SVM O(KN3) O(K2(NP + T )) O(N) O(NP + T )LDA O(IDNK) O(IDK(NP + T )) O(KD) O(K(DP + T ))

CRF O(INK2) O(I NP K2) O(LK2) O(LK2

P )

remained idle, as we saw in the training step 3 of mixture of SVMs. A similar bottle-neck happens if the data set is not uniform density and the workload on each processoris not evenly distributed. Take the sparse connection matrix in PageRank algorithm forexample, if one processor is assigned some empty rows which does not need compu-tation, nonetheless, this processor has to wait for others finishing before it enter nextiteration. Some load balancing heuristics algorithms for SVM, CRF, PageRank werepurposed in [45] [72] and [42].

Communication overhead means the increase in the absolute time spending in com-munication between machines, and the increase in the fraction of the communicationtime in the entire execution time. As pointed in [30], since the improvement of CPUperformance outpaces the improvement of IO/communication performance, commu-nication cost increasingly dominates a parallel algorithm. To reduce the amount ofcommunication overhead parallel algorithm designers make sure that the grain size isas large as possible or avoid communication whenever possible. Several data miningworks [86, 58] has listed the improvement of the disk I/O as their next target.

4.2 Parallelism ComparisonFrom all parallelizations shown in section 3, we observe that most of them were usingdata parallelism (e.g. k-NN, Naıve Bayes, k-means, EM, ADLDA), few of them wereusing task parallelism (e.g. Decision tree, HDLDA) or hybrid parallelism (e.g. CascadeSVM). Table 4.2 summarized different parallelism for all algorithms in Section 3.

Since most of the work in data mining focus on performing operations on a dataset, it is not at all surprising that data parallelism is most commonly used. Data paral-lelism emphasizes the distributed nature of the data. It is achieved by distributing thetraining set among the processors where each processor is responsible for a distinct setof examples. In the context of data mining, an example often have several dimensionsfor representing the features, thus the data set is typically organized into a matrix. Thedistribution of the data can therefore be performed in two different ways, horizontallyor vertically. As shown in Figure 5, horizontal distribution refers to these cases where

22

Page 23: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

Table 2: Feasible parallelism in data mining algorithms. Filled dot indicates that thealgorithm in such parallelism is a feasible scheme, while soft dot means such paral-lelization is elusive and hard to design.

Algorithm Data Parallelism Task Parallelism Hybrid ParallelismHorizontal Vertical

k-NN # #Decision Tree Naıve Bayes # #k-means # #

PageRank # # #EM # # #

SVM # LDA # #CRF # #

1 2 3 4 5, , , , ,...,

nf f f f f f

1x2x3x

4x5x6x

1 2 3 4, , ,f f f f

5 6 7 8, , ,f f f f

1x2x3x

4x

nx

...

Figure 5: Data parallelism: horizontal (left) and vertical (right)

different database records reside in different places, while vertical data distribution,refers to the cases where all the values for different features reside in different places.

Among the mentioned algorithms in Section 3, k-NN, Decision Tree, Naıve Bayes,k-means can be parallelized in vertical manner. Typically, if the model assume thatfeatures are independent to each other, or if the dimensions are greater than the num-ber of examples (i.e., D ≫ N ), then vertical distribution can be applied. However, ifthe features are not independent, as we saw in many hierarchical probabilistic models(e.g. LDA), splitting the data set by features is not an proper way. On the other hand,horizontal distribution is usual tactics in parallelization, since most of the probabilisticmodels assume that the observations are independent and identically distributed (i.i.d.). In case of the data are dependent with each other or the computing of local estima-tor conducts global parameters, it will only give a approximate solve since the modelparameters estimated on their local subsets of the data. Apparently, 10 probabilisticmodels estimated from local examples of size N/10 would produce worse quality es-timate than the sequential model estimated from data set of size N . A typical examplehas given in Section 3.8, where Gibbs sampling draw topic assignment from an ap-proximate full conditional distribution P (z

|rdi|z−di, w

|r, α, β). Therefore, it is weak inpredictive power and lack of theoretical background. A more principled way to model

23

Page 24: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

parallel processes is to build them directly into the probabilistic model, and thus, makesthe estimation in parallelized model theoretically equivalent to serialized model or in apseudo-sequential manner [71, 69].

Task parallelism focuses on distributing execution processes across different paral-lel computing nodes, as opposed to the data parallelism. The construction of decisiontrees as introduced in Section 3.2 followed a task-parallelism approach can be viewedas dynamically distributing the decision nodes among the processors for further ex-pansion. This approach suffers from bad load balancing generally, since it is hard todivide the task with equally complexity to different processor. Data parallelism andtask parallelism complement each other, and can be used together to handle large-scaletraining.

5 ConclusionData mining is a broad area that integrates techniques from several fields includingmachine learning, statistics, pattern recognition, artificial intelligence, and databasesystems. For the analysis of large volumes of data, speeding up and scaling up datamining implementations by introducing parallel and distributed computing emerge asan effective solution. In this paper, we surveyed research on the use of parallel anddistributed techniques involving mining on large-scale data. We motivated this field ofresearch, gave formal definition of the terms used herein and presented a brief overviewof several state-of-art data mining algorithms. We introduced their properties, appli-cation to specific problems, parallel implementation and experimental results. Mostof the parallel data mining algorithms can be parallelized by building a local modelon each processor and then combining these models to obtain a global model usingcollective communication. We discussed the theoretic complexity and bottleneck inparallelization and gave a general view about the parallelism used in this paper. We be-lieve the ideas discussed and the provided references could inspire the interested readerfor further studies in this field.

AcknowledgementWe thank Prof. Dr. Amitava Gupta and Dr. Thomas Stibor for helpful comments andconstructive criticism on a previous draft.

References[1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An interval classifier

for database mining applications. In Proceedings of the International Conferenceon Very Large Data Bases, pages 560–560. Citeseer, 1992. 6

[2] S. Allesina and M. Pascual. Googling Food Webs: Can an Eigenvector MeasureSpecies’ Importance for Coextinctions? 2009. 12

24

Page 25: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

[3] G.S. Almasi and A. Gottlieb. Highly parallel computing. Benjamin/CummingsPub. Co., 1994. 3

[4] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu. An optimalalgorithm for approximate nearest neighbor searching fixed dimensions. Journalof the ACM (JACM), 45(6):891–923, 1998. 5

[5] A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning oftopic models. Advances in Neural Information Processing Systems, 20:20, 2008.17

[6] Satish Balay, Kris Buschelman, Victor Eijkhout, William D. Gropp, DineshKaushik, Matthew G. Knepley, Lois Curfman McInnes, Barry F. Smith, and HongZhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.0.0, Ar-gonne National Laboratory, 2008. 12

[7] Satish Balay, Kris Buschelman, William D. Gropp, Dinesh Kaushik, Matthew G.Knepley, Lois Curfman McInnes, Barry F. Smith, and Hong Zhang. PETSc Webpage, 2009. http://www.mcs.anl.gov/petsc. 12

[8] Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. Ef-ficient management of parallelism in object oriented numerical software libraries.In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Toolsin Scientific Computing, pages 163–202. Birkhauser Press, 1997. 12

[9] R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, and R. Perego. Implementa-tion issues in the design of I/O intensive data mining applications on clusters ofworkstations. Parallel and distributed processing: 15 IPDPS 2000 workshops,Cancun, Mexico, May 1-5, 2000: proceedings, page 350, 2000. 10

[10] M. Berthold and DJ Hand. Intelligent data analysis: an introduction. SpringerVerlag, 2003. 3

[11] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.20

[12] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal ofMachine Learning Research, 3:993–1022, 2003. 16

[13] B.E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal mar-gin classifiers. In Proceedings of the fifth annual workshop on Computationallearning theory, pages 144–152. ACM New York, NY, USA, 1992. 13, 14

[14] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search en-gine. Computer networks and ISDN systems, 30(1-7):107–117, 1998. 12

[15] A.Z. Broder, R. Lempel, F. Maghoul, and J. Pedersen. Efficient pagerank ap-proximation via graph aggregation. Information Retrieval, 9(2):123–138, 2006.13

25

Page 26: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

[16] D.R. Butenhof. Programming with POSIX threads. Addison-Wesley LongmanPublishing Co., Inc. Boston, MA, USA, 1997. 3

[17] B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine trainingand classification on graphics processors. In Proceedings of the 25th internationalconference on Machine learning, pages 104–111. ACM, 2008. 16

[18] C. Chang, R. Ferreira, A. Sussman, and J. Saltz. Infrastructure for building par-allel database systems for multi-dimensional data. In Proceedings of the SecondMerged IPPS/SPDP (13th International Parallel Processing Symposium & 10thSymposium on Parallel and Distributed Processing). 6

[19] E.Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. Psvm: Paral-lelizing support vector machines on distributed computers. Advances in NeuralInformation Processing Systems, 20, 2007. 16

[20] M.S. Chen, J. Han, and P.S. Yu. Data Mining: An Overview from a DatabasePerspective. IEEE Transactions on Knowledge and Data Engineering, 8(6):866–883, 1996. 3

[21] W.Y. Chen, J. Luan, H. Bai, Y. Wang, and E.Y. Chang. Collaborative filtering fororkut communities: discovery of user latent behavior. In Proceedings of the 18thinternational conference on World wide web, pages 681–690. ACM New York,NY, USA, 2009. 18

[22] C.T. Chu, S.K. Kim, Y.A. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, and K. Olukotun.Map-reduce for machine learning on multicore. In Advances in Neural Informa-tion Processing Systems 19: Proceedings of the 2006 Conference, page 281. TheMIT Press, 2007. 8, 10, 12

[23] R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of SVMs for verylarge scale problems. Neural computation, 14(5):1105–1114, 2002. 14

[24] M. Connor and P. Kumar. Parallel construction of k-nearest neighbour graphs forpoint clouds. In Eurographics Symposium on Point-Based Graphics, 2008. 6

[25] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995. 13

[26] L. Dagum and R. Menon. Open MP: An Industry-Standard API for Shared-Memory Programming. IEEE Computational Science and Engineering, 5(1):46–55, 1998. 3

[27] J. Darlington, YK Guo, J. Sutiwaraphun, and H.W. To. Parallel induction algo-rithms for data mining. Lecture Notes in Computer Science, 1280:437–446, 1997.7

[28] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on largeclusters. 4

26

Page 27: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

[29] R.P. Dellavalle, L.M. Schilling, M.A. Rodriguez, H. Van de Sompel, andJ. Bollen. Refining dermatology journal impact factors using PageRank. Journalof the American Academy of Dermatology, 57(1):116–119, 2007. 12

[30] JW Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-avoidingparallel and sequential QR and LU factorizations. Submitted to SIAM Journal ofScientific Computing, 2008. 22

[31] A.P. Dempster, N.M. Laird, D.B. Rubin, et al. Maximum likelihood from incom-plete data via the EM algorithm. Journal of the Royal Statistical Society. SeriesB (Methodological), 39(1):1–38, 1977. 10

[32] I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memorymultiprocessors. Lecture Notes in Computer Science, 1759:245–260, 2000. 9, 10

[33] W. Ding, S. Yu, Q. Wang, J. Yu, and Q. Guo. A Novel Naive Bayesian TextClassifier. In Proceedings of the 2008 International Symposiums on InformationProcessing-Volume 00, pages 78–82. IEEE Computer Society, 2008. 8

[34] P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifierunder zero-one loss. Machine learning, 29(2):103–130, 1997. 8

[35] R.C. Dubes and A.K. Jain. Algorithms for clustering data, 1988. 9

[36] A. Esuli and F. Sebastiani. PageRanking WordNet synsets: An applicationto opinion mining. In ANNUAL MEETING-ASSOCIATION FOR COMPUTA-TIONAL LINGUISTICS, volume 45, page 424, 2007. 12

[37] E. Fix and JL Hodges Jr. Discriminatory analysis. Nonparametric discrimination:Consistency properties. International Statistical Review/Revue Internationale deStatistique, 57(3):238–247, 1989. 8

[38] Message Passing Interface Forum. MPI: A message-passing interface standard.International Journal of Supercomputer Applications, 8:159–416, 1994. 4

[39] W.J. Frawley, G. Piatetsky-Shapiro, and C.J. Matheus. Knowledge discovery indatabases: An overview. Ai Magazine, 13(3):57–70, 1992. 2

[40] A.A. Freitas and S.H. Lavington. Mining very large databases with parallel pro-cessing. Springer, 1998. 7

[41] V. Garcia, E. Debreuve, and M. Barlaud. Fast k nearest neighbor search usinggpu. In CVPR Workshop on Computer Vision on GPU, pages 1–7, 2008. 6

[42] D. Gleich, L. Zhukov, and P. Berkhin. Fast parallel PageRank: A linear sys-tem approach. Yahoo! Research Technical Report YRL-2004-038, available viahttp://research. yahoo. com/publication/YRL-2004-038. pdf, 2004. 13, 22

[43] L. Glimcher and G. Agrawal. Parallelizing EM clustering algorithm on a clusterof SMPs. Lecture notes in computer science, pages 372–380, 2004. 11

27

Page 28: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

[44] M. Goebel. A survey of data mining and knowledge discovery software tools.ACM SIGKDD Explorations Newsletter, 1(1):20–33, 1999. 3

[45] H.P. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel supportvector machines: The cascade svm. Advances in neural information processingsystems, 17(521-528):2, 2005. 15, 22

[46] R.M. Gray and D.L. Neuhoff. Quantization. IEEE transactions on informationtheory, 44(6):2325–2383, 1998. 9

[47] T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of theNational Academy of Sciences, 101(Suppl 1):5228, 2004. 16

[48] T.L. Griffiths, M. Steyvers, D.M. Blei, and J.B. Tenenbaum. Integrating topicsand syntax. Advances in neural information processing systems, 17:537–544,2005. 16

[49] S.R. Gunn. Support vector machines for classification and regression. ISIS tech-nical report, 14, 1998. 13

[50] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statisticallearning: data mining, inference and prediction. The Mathematical Intelligencer,27(2):83–85, 2005. 3

[51] R.V. Hogg, A.T. Craig, and J. McKean. Introduction to mathematical statistics.1978. 10

[52] J.R.M. Hosking, E.P.D. Pednault, and M. Sudan. A statistical perspective on datamining. Future Generations in Computer Systems, 13(2):117–134, 1997. 3

[53] T. Iwata, T. Yamada, and N. Ueda. Probabilistic latent semantic visualization:topic model for visualizing documents. In Proceeding of the 14th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 363–371. ACM, 2008. 16

[54] H. Jiawei and M. Kamber. Data mining: concepts and techniques. San Francisco,CA, itd: Morgan Kaufmann, 2001. 5

[55] R. Jin and G. Agrawal. A middleware for developing parallel data mining im-plementations. In Proceedings of the first SIAM conference on Data Mining.Citeseer, 2001. 6, 11

[56] T. Joachims. Making large-scale support vector machine learning practical, Ad-vances in kernel methods: support vector learning, 1999. 14

[57] M.N. Joshi. Parallel K-Means Algorithm on Distributed Memory Multiproces-sors. Computer, 2003. 9

[58] C. Kruengkrai and C. Jaruskulchai. A parallel learning algorithm for text clas-sification. In Proceedings of the eighth ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 201–206. ACM New York, NY,USA, 2002. 11, 22

28

Page 29: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

[59] R. Kufrin. Decision trees on parallel processors. Parallel Processing for ArtificialIntelligence 3. Elsevier Science, pages 279–306, 1995. 7

[60] J. Lafferty, A. McCallum, and F. Pereira. Conditional random felds: Probabilisticmodels for segmenting and labeling sequence data. Proceedings of the EighteenthInternational Conference on Machine Learning, 1:282–289, 2001. 19

[61] J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction. Springer Verlag,2007. 3

[62] B. Manaskasemsak and A. Rungsawang. Parallel PageRank computation on agigabit PC cluster. In Proceedings of the 18th International Conference on Ad-vanced Information Networking and Applications, 2004. 13

[63] S. Manavski and G. Valle. CUDA compatible GPU cards as efficient hard-ware accelerators for Smith-Waterman sequence alignment. BMC bioinformatics,9(Suppl 2):S10, 2008. 5

[64] Andrew McCallum and Wei Li. Early results for named entity recognition withconditional random fields, feature induction and web-enhanced lexicons. In InSeventh Conference on Natural Language Learning (CoNLL), 2003. 19

[65] G.J. McLachlan and T. Krishnan. The EM algorithm and extensions. Wiley NewYork, 1997. 10

[66] D. Mimno and A. McCallum. Expertise modeling for matching papers with re-viewers. In Proceedings of the 13th ACM SIGKDD international conference onKnowledge discovery and data mining, page 509. ACM, 2007. 16

[67] T.M. Mitchell. Machine learning. WCB. Mac Graw Hill, page 368, 1997. 2

[68] R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for la-tent Dirichlet allocation: An experimental evaluation of speed and scalability.ICDMW, 7:349–354. 18

[69] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference forlatent dirichlet allocation. Advances in Neural Information Processing Systems,20:1081–1088, 2007. 17, 24

[70] K. Nigam, A.K. McCallum, S. Thrun, and T. Mitchell. Text classification fromlabeled and unlabeled documents using EM. Machine learning, 39(2):103–134,2000. 10

[71] J. Ocenasek, J. Schwarz, and M. Pelikan. Design of multithreaded estimation ofdistribution algorithms. Lecture Notes in Computer Science, pages 1247–1258,2003. 24

[72] X.H. Phan, L.M. Nguyen, Y. Inoguchi, and S. Horiguchi. High-PerformanceTraining of Conditional Random Fields for Large-Scale Applications of Label-ing Sequence Data. IEICE TRANSACTIONS on Information and Systems, E90-D(1):13–21, 2007. 20, 21, 22

29

Page 30: Towards Parallel and Distributed Computing in Large-Scale Data … · 2017-05-04 · widely used programming model, framework and techniques in parallel computing, which will help

[73] J.C. Platt. Fast training of support vector machines using sequential minimaloptimization. 1999. 14

[74] J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 2003.6

[75] Matthew Richardson and Pedro Domingos. Markov logic networks. MachineLearning, 62:107–136, 2006. 19

[76] B. Scholkopf and A.J. Smola. Learning with kernels. Citeseer, 2002. 14

[77] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier fordata mining. In Proceedings of the International Conference on Very Large DataBases, pages 544–555. Citeseer, 1996. 7

[78] H. Shan, J.P. Singh, L. Oliker, and R. Biswas. A comparison of three program-ming models for adaptive applications on the Origin2000. Journal of Parallel andDistributed Computing, 62(2):241–266, 2002. 3

[79] A. Srivastava, E.H. Han, V. Kumar, and V. Singh. Parallel formulations ofdecision-tree classification algorithms. Data Mining and Knowledge Discovery,3(3):237–261, 1999. 7

[80] K. Stoffel and A. Belkoniene. Parallel k/h-Means Clustering for Large Data Sets.Lecture notes in computer science, pages 1451–1454, 1999. 9, 10

[81] C. Sutton. Conditional probabilistic context-free grammars. Masters thesis, 2004.19

[82] B. Taskar, P. Abbeel, and Koller D. Discriminative probabilistic models for rela-tional data. In In Eighteenth Conference on Uncertainty in Artificial Intelligence(UAI02), 2002. 19

[83] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000. 14

[84] G. Vasiliadis, S. Antonatos, M. Polychronakis, E.P. Markatos, and S. Ioannidis.Gnort: High performance network intrusion detection using graphics processors.In Proceedings of RAID, volume 5230, pages 116–134. Springer. 5

[85] X. Wang and A. McCallum. Topics over time: a non-markov continuous-timemodel of topical trends. In Proceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining, page 433. ACM, 2006. 16

[86] Y. Wang, H. Bai, M. Stanton, W.Y. Chen, and E.Y. Chang. PLDA: Parallel LatentDirichlet Allocation for Large-Scale Applications. AAIM, June, 2009. 18, 22

[87] B. Wilkinson and M. Allen. Parallel programming: techniques and applicationsusing networked workstations and parallel computers. Prentice Hall, 1998. 3

[88] S. Xu and J. Zhang. A parallel hybrid web document clustering algorithm and itsperformance study. The Journal of Supercomputing, 30(2):117–131, 2004. 9

[89] H. Zhang. The optimality of naive Bayes. A A, 1(2):3. 8

30