Clustering with tag for web data by using parallel PSO · [email protected], [email protected], [email protected] May 27, 2018 Abstract In recent time

Clustering with tag for web data byusing parallel PSO

1Vitthal Sadashiv Gutte , 2Pooja V. Mundhe,3Pramod Mundhe, 4 Bhandari Mahesh A,

1,2,3,4 Asst. Professor,Information Technology Department,

MITCOE kothrudPune, India,

[email protected] ,[email protected],

[email protected],[email protected]

May 27, 2018

Abstract

In recent time World Wide Web or web is collection ofbillions of web pages growing exponentially by the means ofpublic transport, social media, online shopping, blogs etc.This highly generated data is stored, for a large duration oftime results into big data. Big data can be in structured,unstructured or semi structured format. This exponentialgrowth of web pages generates huge data which is beyondthe capability of relational database. Analysis of such largedata cannot be handled easily using traditional data miningtool. Thus Data mining is being researched intensively andcombined with soft computing domain, which uses mathe-matical algorithms to segment data and analyze the proba-bility of future events. In the paper we mention about theevolutionary clustering technique Particle Swarm Optimiza-tion (PSO) algorithm on web data, fetched with the help

1

International Journal of Pure and Applied MathematicsVolume 118 No. 24 2018ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

of crawler and preprocessed by removing stop words andstemming. Then an improved numerical statistic method,Term Frequency Inverse Document Frequency is applied onthe preprocessed data to derive the importance of a wordwith respect to a set of documents and overcome traditionalTDIDF issue of inter class consideration. A parallel PSOclustering technique is applied on this data to get optimizedclusters with higher accuracy and minimize computationaltime by preserving compactness of intermolecular distancebetween particles.

Key Words:Swarm Intelligence (SI); Particle SwarmOptimization (PSO); Information Retrieval (IR); Data Min-ing; Clustering; K-Means.

1 Introduction

Huge amount of data is daily produced by using internet and forthe last two year its made history by generating it around 2.5 quin-tillion bytes. So there is need to mine important information fromthis data. Data mining is the process of automatically searchinglarge data set to discover previously undiscovered patterns for de-cision making purpose and I further improved using the ArtificialIntelligence techniques. AI inspired techniques impart a sixth senseto the data mining systems and explore meaningful information.Thus this two inter-disciplinary domain can be easily applied toreal world complex problem to get optimized results. Flow of thepaper is given as; a proper literature survey is made in sectionII. Section III specifies the proposed system architecture and flowchart of the implementation steps. In section IV experiment resultis summarized. A comparison between evolutionary is

shown in Table I, whereas comparison analysis between PSOand K-mean is shown in table III. Finally the conclusion and refer-ences are mentioned at the end

2 LITERATURE SURVEY

A. MotivationThere has been a tremendous growth in the creation, acquisi-

tion, storage of data which contains valuable and important hidden

2

International Journal of Pure and Applied Mathematics Special Issue

knowledge. Analysis of this knowledge can be used to improve thedecision making process of an organization. Clustering of high di-mensional data involves cluster analysis of data that have dozensto thousands of dimension. Data with such high dimensions are en-countered in areas like biology, medicine, text document clusteringetc. A large variety of algorithms for analysis of such large datasets have been proposed, however results produced by conventionalalgorithm suffer from shortcomings such as slowness of convergenceand sensitivity to initialization values. Therefore these algorithmsrequire meticulous investigation for improvement of their perfor-mance and efficiency.

High dimensional data is one of the norm for web. Even anormal user operation generates huge data on web in terms of logfile, database etc which lead to unstructured high dimensional data.Analysis of such data is big challenge which required efforts at hard-ware, software as well as algorithmic level. Number of frameworksare available, but because of high dimensionality of such data itbecomes challenging to become dependent on existing framework.

Common approaches for mining such data include dimension-ality reduction as a data preprocessing step. Dimension reductionbased on selecting variable aims for identification of a subspacewhich is spanned by a small set of variables that include most ofthe relevant variables. Selection of these variables generates a nat-ural data cluster and excludes the irrelevant variables. Differentapproaches have been implemented from different perspective likedensity based, model based and criteria based.

Clustering techniques that have been implemented and provento be successful in many real life applications include k-means, hier-archical clustering and soft clustering. K-means is the most widelyused algorithm because of its high feasibility and efficiency whendealing with large data sets [JIA 06]. However, the results producedare highly sensitive to the selection of the initial cluster centers andmay converge to the local optima. PSO is one of the SI paradigmsthat have received widespread attention in research these days.

It is a novel population based stochastic search algorithm thatprovides a solution to complex non-linear optimization problem.It is an evolutionary computation technique which simulates themovement and flocking of birds that performs a global search inthe solution space [1] [2] [3]. PSO produces better results in com-

3


plicated and multi-peak problems with few parameters to adjustgiving fast as well as accurate computation results which makes itas popular optimization technique in swarm intelligence field. Dataclustering with PSO algorithms have recently been shown to pro-duce good results in a wide variety of real-world data which invitesan innovative research in the area of PSO based clustering.

To overcome the shortcomings of conventional algorithms evo-lutionary the algorithms were developed which included the algo-rithms based on Genetic Algorithm, Simulated Annealing and PSOalgorithm. Genetic Algorithm are based on mutation operatorswhereas Simulated Annealing allows making moves that are locallynon-optimal. PSO algorithm replicates the evolutionary behaviorof a swarm of birds that are looking for food. The advantage ofthese techniques is, it requires few parameters to adjust and canfind the optimal value based on interaction of particles. A compar-ison between GA, PSO and ACO is shown in Table I [4][5][6][7].

B. Paper ReviewedAs mention by Ahmed et al [2] though PSO is optimized PSO

algorithm and PSO techniques have shortcomings when the searchspace is high. A large search space results in slow convergencespeed near the global optimum and produces poor quality results.A large number of PSO variant have been proposed like hybridPSO and adaptation of PSO parameters which result in an adap-tive PSO. The PSO and its adaptive variants with other algorithmslike k-means, GA and additional preprocessing like dimensionalityreduction and feature selection can produce better results from highdimensional data [8] which leads to better prediction and analysis.There are various variant of PSO for dimension reduction and doc-ument clustering [2] [9].

3 IMPLEMENTATION DETAIL

In this section we discuss about the proposed system architecture,its flow structure and algorithmic Pseudo code for implementationpurpose.

A. Proposed System ArchitectureThe system architecture depicts the conceptual and logical model,

along with the structural behavioral views. Our system architecture

4


specifies about the web data clustering and performs blog analysisusing parallel PSO algorithm. Fig 1 shows the overview of systemarchitecture and its component.

A large amount of data is generated when web text feature ex-traction is used and which is the basic problem involved in textmining, Song Liangtu et al in [10]. This data is scattered and hasno unified managed and structure which greatly reduces the ef-ficiency in using web information and to overcome this problem atechnique called Vector Space Model (VSM) is used which is havingthe description of web text and an improved feature extraction al-gorithm which again based on improved PSO with reverse thinkingparticles. The use of such solutions allows us to search for multi-dimensional complex space efficiently, provides for dimensionalityreduction and improves the efficiency of web documents [10].

Dian Palupi Rini et al provides the way PSO algorithm worksin finding optimal values which follows the work of an animal so-ciety without a leader. Particle swarm optimization consists of aswarm of particles, where each particle represents a potential so-lution. Particles move through a multidimensional search space tofind the best position in that space (optimum values). There arenumber of basic variant of that support to control velocity of par-ticle. On the other hand, modified variant of PSO helps to processother issues like convergence that cannot be solved by the basicPSO algorithm. Modified PSO variants have been implemented onvarious applications [12].

Mita K. Dalal et al gives details about Automatic text classifi-cation of sports blog data [13] which uses Semi- Supervised machinelearning mechanism. For example, the present paper automaticallyclassifies the text entries made by bloggers on various sports blogs,to the appropriate category of sport. For this purpose steps likepreprocessing of text, text feature extraction and naive Bayesianclassification is applied. The combination of TFIDF and multi-word for feature extraction followed by naive Bayesian classificationis an effective method for classifying unstructured texts. However,classification accuracy can be improved with more extensive train-ing using larger feature sets. It is proposed to use an extensivetraining set along with the semantic resources, to mine informationfor a variety of commercially useful applications. Such as contentmanagement of sports blogs through automatic classification and

5


frequency analysis of classified sports blog entries to determine thepopularity of a specific sport event.

Overall system is divided into three modules for smooth workingof whole system.

Fig. 1. System Architecture

Three categorized modules are as:i. Web data extraction and Preprocessing.ii. Feature Extraction.iii. Clustering and Tag Cloud formation.All the given modules are interrelated and dependent on previ-

ous module outcome while processing.B. Implementation detail1) Module I: Web data extraction and Preprocessing.

In module I, web crawler been used to crawl hyperlinks recur-sively in an automated manner, thus it reduce lots of maintenancework like checking broken links, indexing and create a copy of allthe visited pages for searching over blogosphere. Crawler [22] willextract unstructured or semi structured data from web, tokenize itin required format, store it in MongoDB database and forward itfor preprocessing. While performing preprocessing a special care istaken about stop word removal and stemming. Stop word removalis the process of removing words which are of less important andmostly used for connectivity purpose. Removal of stop words ishelpful in keyword searching. Stemming focus on suffixes removalusing algorithm like potters algorithm. While performing this filter-ing we used natural language toolkit written in python to processhuman language data. Natural language toolkit provide easy in-terface and consist over 50 corpora and lexical resources includingWordNet, text processing libraries for classification, tokenization,stemming, tagging, parsing, and semantic reasoning [NLP]. Thus

6


using parser crawled page is extracted in tag format and filtered forfurther smooth process.

2) Module II: Feature ExtractionIn module II a statistical technique TFIDF called as Term Fre-

quency Inverse Document Frequency is applied for feature extrac-tion and is mathematically given as, multiple of the value of TFand IDF for a particular word i.e.

TFIDF = TF * IDF (1)• Term Frequency defines the number of times a particular term

occurring in a documentTF (term, document) = (Frequency of term) / (Number of Doc-

ument) (2)• IDF (Inverse Document Frequency) is used to check that

whether the word is rare or common in all documents. IDF (term,document) is obtained by dividing total number of Documents bythe number of documents containing that term and then calculatinglog of division.

IDF (term, document) = log (Total No of Document) /(Numberof document containing term) (3)

• The value of TFIDF increases with the number of occurrenceswithin a document and with rarity of the term across the corpora.TFIDF plays a crucial role in dimension reduction and keywordsearching. Based on the search queries a special focus is thrownover precision and recall to calculate accuracy of retrieved contentaccuracy.

• Precision: It is defined as the fraction of retrieved documentthat are relevant. It is calculated as Precision = P (relevant —retrieved)

• Recall: Specifies the fraction of relevant docs that are re-trieved. It is calculated as Recall = P (retrieved — relevant)

3) Module III : Clustering and Tag Cloud formationHere in above discussion a specific focus is given on clustering as-

pect where the set of documents are fed as an input to parallel PSOalgorithm, given document set acts as particles which are initial-ized randomly with zero velocity. Then each document is assistedwith fitness value. Fitness value of each individual is continuouslychanging as per the change in personal best (Pbest) and global best(Gbest) which are the values according to position and velocity up-dating equation as shown in equation 4 & 5 [1][5][17][24][MAU06].

7


Velocity updation equation:

Position updating equation:Xi t+1 = X c+ V c + 1 (5)where ,Vi(t) Velocity of the ith particle.Pbest Personal best position of the ith particle.Gbest Global best position of the particles.Xi(t) Current position of the ith particle.c1 & c2 Acceleration constants.r1 & r12 Random function in the range [0, 1].w Inertia weight.Change in position and velocity of particle leads to change in

fitness value. Three terminologies are of crucial importance in PSOalgorithm

1. Pbest: best solution achieved so far.2. Gbest: best solution achieved in the swarm3. Lbest: local best value within a group of swarm.In Clustering similar type of object are clubbed together to min-

imize inter cluster distance and maximize intra cluster distance,in order to achieve compactness and purity of cluster. Clusteringassociate with different issues like center initialization, stagnationproblem, multiple cluster problems, local premature or slow con-vergence [1] [8] and to resolve these issues, Swarm Intelligence (SI)inspired optimization algorithms can be efficiently applied. PSOgives near optimal

solution as compare to K-means, a comparison between K- meansand PSO is shown in Table III. An advance variant of PSO calledParallel PSO algorithm is implemented to reduce computation timeof clustering and enhance CPU clock utilization while preservingthe purity of clustering.

TABLE III COMPARISON BETWEEN K-MEANS AND PSO[7][29][30]

In the system Tag Clouds visually represent the important key-words in individual cluster along with its occurrence frequency. Theoverall flow of the system is given as in Fig 2. The flow of proposedsystem as shown in Fig 2. includes the following steps:

8


i. Accept input from blog/web.ii. Crawl through the web pages and build search indexes.iii. Remove stop words.iv.Perform stemming of remaining words.v. Apply TFIDF to sort relevant words.vi. Form clusters of related words using parallel PSO.vii.Visually represent the weighted list of words.

Fig. 2. Execution flow of System

4 EXPERIMENTAL RESULT

In this section we discuss module wise processing. Module I: Dataextraction and Preprocessing

Input: Hypertext links are provided as input I1 = x — x I andx.author = x .author x̂.title = x .title x̂.doc= x.doc stop words -x.doc - stemming

Where Stop words = a, an, am, where, . . .Stemming = removal of suffixes like es to get root wordProcessing: In the given system as per mentioned in pre-

processing module stop words removal and stemming is performedon the extracted data from the hyperlinks with the help of naturallanguage tool kit corpora. Output: Preprocessed extracted data.

Module II: Feature ExtractionInput: I2 = x— where x = Td—Td= ¡termi; tfidtermi, x¿ for

1¡=i¡=SProcessing: An improved TFIDF algorithm is implemented to

extract important keyword from the document. Earlier TFIDF iscalculated for the terms[a-z] [A-Z] [0-9]*

occurring in the same document where in our work the ImprovedTFIDF also considers other documents for calculation purpose. An

9


expected result shows about the terms importance, most impor-tant keywords are listed in output file based on the frequency ofoccurrence

Output: terms and document associated with TFIDF Value.Module III: Clustering and Tag Cloud formation.

Input: set of Document [D = D1, D2, D3.Di]Processing: Each Document is initializing as particle with ran-

dom position and velocity. A parallel PSO algorithm is applied onthis particle to get optimized custers. Pseudo code for parallel PSO[15] is shown in Fig 3.

Fig. 3. Pseudo Code for Parallel PSO algorithm [15].

Due to parallelization, PSO required less computational timeand gave result with better performance. Tag Clouds as in Fig 4.shows the frequency of the terms in the cluster.

Here performance analysis is performed using f-score value, whichgives equality harmonic mean of precision and recall. Also accuracyrate is calculated.

Output: Tag Clouds with associated frequency

Fig. 4. Tag Cloud

Platform / Technology - Software:

10


• Java version 1.7 update 67 (openjdk)• Eclipse• Python Scrapy Crawler• Hadoop Framework ( Apache Hadoop 2.4.1)Platform:• Unix/Linux (fedora 20)Hardware: (minimum requirement)• PC/Laptop• CoreTM 2 Duo processor• RAM 512 MB• HDD 512 MB.Dataset:• Data extracted from blogs using crawler.• Reuter-21578 or Blogger datasetDatabase:• MongoDB (NoSQL Database)

5 RESULTS AND ANALYSIS

Each implemented module has effective contribution in making sys-tem fault tolerant and robust such as module I increase computa-tional efficiency approx 55% to 65 % for performing preprocessingof raw data. A novel approach of improved TFIDF calculation andfeature extraction results 17% faster computation than traditionalmethod with good semantic score is shown in Fig 4. The differentextracted features from web data like author name, page title, URLand important keywords screenshot is displayed in Fig 5.

Fig. 5. Traditional vs. Improved TFIDF

11


Fig. 6. Feature Extraction Screenshot

6 CONCLUSIION

As the Internet usage increases day by day the amount of unstruc-tured data generated per day is growing exponentially. Consider-ing this issue our system is implemented to process such big datausing a nature inspired evolutionary clustering algorithm ParticleSwarm Optimization. . Our hybrid system is efficient to analyzebig data. Using Preprocessing approach raw data is mined to mean-ingful content from which important keywords are extracted alongwith its weight age. Our novel approach states about the prepro-cessing of such huge data by implementing improved TFIDF forfeature extraction and Parallel PSO to form optimal clusters andfinally generate tag clouds. A comparative analysis between PSO,K-means and other evolutionary algorithm is made to show howPSO out performs than these algorithm. Performance parametersof the system such as accuracy rate, precision, recall and f-measurehave been discussed to maintain the improved performance, accu-racy, and purity of clusters.

References

[1] R. C. Eberhart and J. Kennedy, A new optimizer using particleswarm theory, in Proc. 6th Int. Symp. Micro Machine andHuman Science, 1995, pp. 39-45.

[2] Ahmed A. A. Esmin, Rodrigo A. Coelho, Stan Matwin, A re-view on particle swarm optimization algorithm and its variantsto clustering highdimensional data, Springer, 2013, pp.1-23.

12


[3] Rana, Sandeep, Sanjay Jasola, and Rajesh Kumar. ”A reviewon particle swarm optimization algorithms and their applica-tions to data clustering.” Artificial Intelligence Review 35.3(2011): 211-222.

[4] Jayshree Ghorpade-aher and Roshan Bagdiya. Article: A Re-view on Clustering Web data using PSO. International Journalof Computer Applications 108(6):31-36, December 2014.

[5] Rania Hassan, Babak Cohanim, Olivier de Weck, A compar-ison of PSO and GA, American Institute of Aeronautics andAstronautics, 2004.

[6] Emad Elbeltagi, Tarek Hegazy, Donald Grierson, Comparisonamong five evolutionary-based optimization algorithms, Ad-vanced Engineering Informatics, Volume 19, Issue 1, January2005, pp 43-53.

[7] Niknam, Taher, and Babak Amiri. ”An efficient hybrid ap-proach based on PSO, ACO and k-means for cluster analysis.”Applied Soft Computing 10.1 (2010): 183-197.

[8] Jayshree Ghorpade and Vishakha Arun Metre. Article: PSObased Multidimensional Data Clustering: A Survey. Interna-tional Journal of Computer Applications 87(16), 2014, pp.41-48.

[9] Xiaohui Cui, Thomas E. Potok, Document Clustering usingPSO, IEEE, 2005, pp. 185-191.

[10] Song Liangtu, Zhang Xiaoming, Web Text Feature Extrac-tion with Particle Swarm Optimization, IJCSNS InternationalJournal of Computer Science and Network Security, VOL.7No.6, June 2007, pp.132-136.

[11] Xing Huang, Qing Wu, Micro-blog Commercial Word Extrac-tion Based On Improved TFIDF Algorithm, IEEE, 2013, pp.1-5.

[12] Dian Palupi Rini, Siti Mariyam Shamsuddin, Siti SophiyatiYuhaniz, Particle Swarm Optimization: Technique, Systemand Challenges, IJOCA, 2011, pp.19-27.

13


[13] Mita K. Dalal, Mukesh A. Zaveri, Automatic text classificationof sports blog data, IEEE, 2012, pp.219-222.

[14] Rabanal, Pablo, Ismael Rodrguez, and Fernando Rubio. ”Par-allelizing Particle Swarm Optimization in a Functional Pro-gramming Environment.” Algorithms 7.4 (2014): 554-581.

[15] Madamanchi, Manoj Babu. Parallelization of Generic PSOJava Code Using MPJExpress. Diss. North Dakota State Uni-versity, 2012.

[16] Shafiq Alam, Gillian Dobbie, Yun Sing Koh, Patricia Riddle,Clustering Heterogeneous Web Usage Data Using HierarchicalParticle Swarm Optimization, IEEE, 2013, pp. 147-154.

[17] Stuti Karol and Veenu Mangat, Survey On Particle SwarmOptimization Based Web Mining, IJIOME, 2012, pp. 273-276.

[18] Mohammad Syafrullah and Naomie Salim, Improving TermExtraction Using ParticleSwarm Optimization Techniques,JOC , 2010,

[19] Hongbo LI Yunming Ye, Improved Blog Clustering ThroughAutomated Weighing of Text blocks,IEEE, 2009, pp. 1586-1591. [20] Ziqiang Wang, Qingzhou Zhang, Dexian Zhang, APSO-Based Web Document Classification Algorithm, IEEE,2007, pp. 659-664.

[20] Shouning Qu ,Sujuan Wang,Yan Zou, Improvement of TextFeature Selection Method based on TFIDF, IEEE, 2008,pp.79-81.

[21] Huo Ling Yu1, Liu Bingwu, Yan Fang, Similarity Computationof Web Pages of Focused Crawler , International Forum onInformation Technology and Applications, 2010, pp 499-505

[22] Tang Rui, Simon Fong, Xin-She Yang, Suash Deb, Nature-inspired Clustering Algorithms for Web Intelligence Data,IEEE, 2012, pp.147-153.

[23] Marco A. Montes de Oca, Thomas Sttzle, Mauro Birattariand Marco Dorigo, Frankensteins PSO: A Composite Particle

14


Swarm Optimization Algorithm, Ieee Transactions On Evolu-tionary Computation, Vol. 13, No. 5, October 2009,pp 1-30.

[24] Ching-Yi Cheo, Fun Ye, Particle Swarm Optimization Algo-rithm and Its Application to Clustering Analysis , IEEE, 2004,pp 789-794.

[25] Tien-Chi Huang, Shu-Chen Cheng, Yueh-Min Huang, Ablog article recommendation generating mechanism using anSBACPSO algorithm , Expert Systems with Applications36,2009, pp 10388 10396.

[26] Mariam El-Tarabily, A PSO-Based Subtractive Data Cluster-ing Algorithm , IJORCS 2013 , pp.1-9.

[27] Vora, Pritesh, and Bhavesh Oza. ”A survey on k-mean clus-tering and particle swarm optimization.” Int. J. Sci. ModernEng 1 (2013): 24-26.

[28] Rostami, Amin, and Maryam Lashkari. ”Extended Pso Algo-rithm For Improvement Problems K-Means Clustering Algo-rithm.” International Journal of Managing Information Tech-nology 6.3 (2014).

[29] Qinghai B., The Analysis of Particle Swarm Optimization Al-gorithm, in CCSE, February 2010, vol.3. 180-184 BIBLIOG-RAPHY

[30] [JIA 06] Jiawei Han and Micheline Kamber, Data Mining Con-cepts and Techniques, published by Morgan Kauffman, 2ndEd, 2006.

[31] [NLP] Bird, Steven, Edward Loper and Ewan Klein (2009),Natural Language Processing with Python. OReilly Media Inc.

[32] [MAU 06] Maurice Clerc, Particle Swarm Optimization, ISTELtd, 2006.

[33] [PSOL] http://www.particleswarm.info/

[34] [SER] Serkan Kiranyaz, Turker Ince, and Moncef Gabbouj,Multidimensional Particle Swarm Optimization For MachineLearning And Pattern Recognition, Springer Adaptation,Learning, And Optimization Volume 15.

15