An Efficient K-Means Clustering based Annotation Search from … ANSHUL TIWARI... · 2018-12-15 · clustering using K-Means. K-Means clustering is an efficient learning approach

International Journal of Electrical, Electronics ISSN No. (Online): 2277-2626and Computer Engineering 5(1): 75-81(2016)

An Efficient K-Means Clustering based Annotation Search from WebDatabases

Anshul Tiwari* and Kiran Agrawal***Department of Computer Science and Engineering,

**Department of Information Technology,

(Corresponding author: Anshul Tiwari)(Received 04 May, 2016 Accepted 02 June, 2016)

(Published by Research Trend, Website: www.researchtrend.net)

ABSTRACT: In our work an efficient methodology is implemented to improve the accuracy of search resultsfrom the web databases on various keywords such as movies, CD, books etc. The improvement of Alignmentalgorithm using K-means clustering is proposed for the searching of annotated results from the webdatabases. The technique implemented is for the proficient retrieval of text nodes and data units using K-means clustering which improves precision and recall as compared to the existing approach. Themethodology implemented using K-means clustering and labeling of search records is compared with existingmethodology implemented for the search records.

Keywords: K-means clustering, Data Alignment, Precision, Recall

I. INTRODUCTION

Increasingly, many data sources appear as databaseswhich are open for all and hence various queries can beused for the hidden of data after which deep web ispossible [1]. First of all unique database can give datarecords. Second, due to the dependencies presentbetween the hidden databases, certain databases mustbe queried before others. Third, some database may notbe available at certain times because of some majorproblems, and therefore, the query planning should becapable of dealing with unavailable databases andgenerating alternative plans when the optimal one is notfeasible [2]. One of the key principles in modernsoftware engineering is the separation of concerns. Thisconcept described in [3], states that by separating thebasic algorithm from special purpose concerns makeseach of the parts easier to write, maintain and test. Forexample, the separation of data and layout of a webpageinto a style-sheet [4] and the html content page hasbecome an engineering practice: It not only allowschanging the layout and content independently, but italso allows different people with different skill-sets andtraining to work on the webpage independently, e.g. adeveloper and a designer. To show the effectiveness ofour methods, it is necessary to demonstrate theirapplication to some real world scenarios or conduct astudy using a range of schema or ontology matchingtasks. To different matching systems correspond todifferent evaluations in the literature, with diversemethodologies, and data sets.

It makes difficult to compare their effectiveness withrespect to the state of the art. In this methods areevaluated by adopting the quality measures proposed in[7], which have been used to assess the effectiveness ofseveral relevant schema matchers.For example, once a book comparison shopping systemcollects multiple result records from different booksites, it needs to determine whether any two SRRs refer[8] to the same book.

Fig. 1. Architecture of annotation based web search.

I

J EE

CE

www.researchtrend.net

Tiwari and Agrawal 76

Web Databases. In Web Databases system storesinformation and can be accessed via an internet. Here inthis paper web databases from online e-commersewebsite is taken with a number of categories such asmovies and books etc. A Web Database contains a setof more than one tables with data and that can becommunicated via a programming languugae overwebsite. Here in the paper Web Databases are collectedfrom various websites with its html code over variouscategories.

II. LITERATURE SURVEY

Marja-Riitta Koivunen, Eric Prud'Hommeaux proposeda new technique called Annotea: an open RDFinfrastructure for shared Web annotations [9]. It isbased on the RDF structure for the annotationsextraction. The annotations are extracted on the basis ofmetadata. Efficient extraction of annotations such asfrom HTTP, Xlinks and Xpointer. Huge metadata needsto be maintained for better performance. Luis Gravanoand Héctor García-Molina find a new way of searchresults from Text-Source Discovery over Internet [10].

Here in this paper text source discovery problem isremoved that may occurs over internet. The techniqueremoves the problem of text source discovery problemthat may occur over internet. The technique works onquery basis hence provides less performance for othersources. Khaled Khelif, Rose Dieng-Kuntz, PascalBarbry. Proposed an Ontology-based Approach toSupport Text Mining and Information Retrieval in theBio logical Domain [11]. The technique includesontology for all the text documents especially forbiologists. Hence after using the technique ontologybased semantic can be extracted from the sources.Efficient retrieval of information of biological domains.James Pustejovsky, Robert Gaizauskas, Graham Katzimplemented Robust specification of event andtemporal expressions in text [12]. Here in this paper themethodology uses the concept of TimeML which isused for the time based retrieval of text using naturallanguage text sequence. Efficient extraction of text onspatial and temporal basis. The technique is not suitablefor multi-lingual languages.

S.No.

Paper Author Technique Used Advantages Issues

1 Annotating SearchResults from

Web Databases [1].

Yiyao Lu, HaiHe, HongkunZhao, WeiyiMeng

Here in this paper an efficienttechnique is implemented for thesearching of annotations in theweb databases

Efficientretrieval ofannotations andlabeling of thoseannotations.

Provides lessefficiency and isstatic.

2 Annotea: an open RDFinfrastructure for sharedWeb annotations [13]

Marja-RiittaKoivunen,EricPrud'Hommeaux,

It is based on the RDF structurefor the annotations extraction.The annotations are extracted onthe basis of metadata.

Efficientextraction ofannotations suchas from HTTP,Xlinks andXpointer.

Huge metadataneeds to bemaintained forbetterperformance.

3 An Ontology-basedApproach to SupportText Mining andInformation Retrieval inthe Bio logical Domain[11]

KhaledKhelif , RoseDieng-Kuntz,PascalBarbry.

The technique includes ontologyfor all the text documentsespecially for biologists. Henceafter using the techniqueontology based semantic can beextracted from the sources.

Efficientretrieval ofinformation ofbiologicaldomains.

A differentontology needs tobe maintained fordifferent domains.

4 Robust specification ofevent and temporalexpressions in text [14].

JamesPustejovsky,RobertGaizauskas,Graham Katz.

Here in this paper themethodology uses the concept ofTimeML which is used for thetime based retrieval of text usingnatural language text sequence.

Efficientextraction of texton spatial andtemporal basis.

The technique isnot suitable formulti-linguallanguages.


S.No.

Paper Author Technique Used Advantages Issues

5. An experimentusing ConceptualGraph Structure fora MultilingualInformation System[15]

CatherineRoussey,SylvieCalabrettoand Jean-MariePinon.

The technique isimplemented for theextraction of informationfor multilingual languages.The technique allowsindexing of documents andinformation retrieval formultilingual textdocuments.

Removes theproblem ofmultilingualinformationretrieval andprovides complexknowledgerepresentation.

Themethodology isapplied for thecollection ofEnglish articles.

III. PROPOSED METHODOLOGY

Web Pages are created and implemented using mark uplanguages such as HTML. Since almost all the thingsare available in web pages. Web sites such as mobilescontains a number of categories and sub categorieswhich uses a huge databases to shows the pictures andprices and various other categories from the databases,but sometimes these web pages contains tags fromwhich some meaningful information can be extracted.These HTML web pages contain tag structure whereeach of the tag structure contains either a tag node orcan also contains text node. These HTML web pagescontains a tag node is surrounded by decorative tagnodes such as ‘<’ and ‘>’ in the original HTML textsource. The text that is written outside of the tag nodesare known as text nodes. The major difference betweentag nodes and text nodes is that the tag nodes are notvisible on the web pages while text nodes are visible onthe web pages. These text nodes contain a number ofdata units which can be classified or detected using themajor four types of relations.The Proposed Methodology implemented hereefficiently searches Annotations from the webdatabases using K-Means Clustering based AlignmentAlgorithm. The Existing methodology implemented forthe Searching of Annotations from web databases can’twork well for the Multiple Composite text nodes henceprovides less accuracy of searching Annotations. Hencehere in this paper the problem of composite text nodescan be solved using K-Means Clustering.

A. Data AlignmentThe purpose of data alignment is to put the data units ofthe same concept into one group so that they can beannotated holistically.

B. Data Unit SimilarityData content similarity (SimC). It is the Cosinesimilarity between the term frequency vectors ofd1 and d2:

( , ) = ∗||∗ ||

where ,Vd is the frequency vector of the terms insidedata unit d, ||Vd|| is the length of Vd, and the numeratoris the inner product of two vectors.

C. K-Means ClusteringK-means is an unsupervised learning algorithm whichis used to classify data on the basis of clusters. Itincludes k number of centroids for each of the cluster.The main objective of the algorithm is the minimizationof objective function which is given as:= ( ) −Where the parameter is the chosen distance betweendata point and the cluster center between n data points.The algorithm consists of the following steps:(i) Place K points into the space represented by theobjects that are being clustered. These points representinitial group centroids.(ii) Assign each object to the group that has the closestcentroids.(iii) When all objects have been assigned, recalculatethe positions of the K centroids.(iv) Repeat Steps 2 and 3 until the centroids no longermove. This produces a separation of the objects intogroups from which the metric to be minimized can becalculated.The proposed methodology implemented here uses thefollowing Alignment algorithm for the single or multipletext nodes and then uses K-Means based clustering tocluster the similar text nodes which provide same set ofsearch results from the HTML tags.The proposed algorithm that is implemented hereconsists of the following basis steps:Step 1: Take an input dataset which contains a numberof web pages for various categories such as Book andMovies and Pen Drives and Electronics, since thesecategories contains some basis labels such as title andauthor and price and ISBN for the category Book.


The Web pages taken here contains HTML tags andnods inside which some meaning information is stored.Step 2: As soon as the input dataset is selected the nextstep is to extract the important features from the webpages. Here for the extraction of features for the webpages the following predefined classes are used whichremoves the decorative tags from the web pages andstores text nodes.Step 3: After the extraction of features from each of theinput web page dataset cosine similarity is measure foreach of the text document web pages using thefollowing formula=VectAB=VectAB+(freq1*freq2);VectA_Sq = VectA_Sq + freq1*freq1;VectB_Sq = VectB_Sq + freq2*freq2;sim_score =((VectAB)/(Math.sqrt(VectA_Sq)*Math.sqrt(VectB_Sq)));Step 4: The next step is the computation of Alignmentof the text nodes and the data units that are available inthe web pages. Here Alignment can be done using theclustering using K-Means. K-Means clustering is anefficient learning approach which takes ‘X’ as an inputdata values and ‘Y’ as the label values and mediods andcenters.Step 5: Finally on the basis of clustering of the labelsof the text nodes value a weighted factor is decided andhence labels are assigned._ ( , ) = ∗Where, = + ( ∗ )

1. J 12. While true (means all the available text nodes

from HTML tags provides labels)3. For i 1 to number of search records4. Gi SR[i][j]5. If Gi contains empty labels6. Exit7. V Call_Cluster(G)8. If |V| > 19. S=Call_Merge(SR[i][j])10. V[c]=Call_Similar_Cluster(V,S)11. Shifting of SR and V.

Call_Cluster(G)1.Input G contains all the search records along withthe labels for each text node in the search record.2.Repeat till G[i][j]==null3.For each A in G and B in G

4.Compute Similarity(A,B) in G usingSim(A,B)=

precence(AUB)/precence(A)+precence(B)-precence(AnB)

J 15.While true (means all the available text nodesfrom HTML tags provides labels)6.For i 1 to number of search records7.Gi SR[i][j]8.If Gi contains empty labels9.Exit

10.V Call_Cluster(G)11.If |V| > 113.S=Call_Merge(SR[i][j])14.V[c]=Call_Similar_Cluster(V,S)15.Shifting of SR and V.

Call_Cluster(G)1. Input G contains all the search records along

with the labels for each text node in the searchrecord.

2. Repeat till G[i][j]==null3. For each A in G and B in G4. Compute Similarity(A,B) in G using

Sim(A,B)= precence(AUB)/precence(A)+precence(B)- precence(AnB)

5. BestSim(A,B)6. Remove text node from L7. Remove text node from Right R8. Add LUR to V9. Return V;

Call_Merge(SR)1. Repeat till SR empty2. For i=1,j=1 length (G)3. S[i][j] SR[i][j]4. Return S;

Call_Similar_Cluster(V,S)1. Repeat till Vempty || Sempty2. Sim(V,S) = Sim(VUS) / Sim(V) +Sim(S) –

Sim(VnS)3. Check the minimum value for the cluster4. V[c]=min (Sim(V,S)5. Return V;

IV. RESULT ANALYSIS

The Result Analysis shows the performance of theproposed methodology. The proposed methodologyshows higher precision and recall as well as has highAccuracy for the prediction of annotated search recordsfrom the web databases.


Table 1: Analysis of Existing Technique on WDB.

The table shown below is the experimental analysis of thesearching of annotations using SVM based clustering.The result is analyzed on various domains such as Bookand Pen Drives and Music and Movies as well as Games.The result is analyzed on the basis of Precision andRecall and F-Score. Here Precision can be computed onthe basis of correctly identified annotation to the totalnumber of annotations fetched from the web databases.Recall is the computation of total number of annotations

fetch from the web databases to the total number ofannotated records present.The figure shown below is the experimental analysis andcomparison of Precision based on Existing and Proposedwork. The result is analyzed on various domains such asBook and Pen Drives and Music and Movies as well asGames. Here Precision can be computed on the basis ofcorrectly identified annotations to the total number ofannotations fetched from the web databases.

Table 2: Analysis of Proposed Technique on WDB.

Domain Precision Recall F-Score

Book 0.75 0.64 0.69

Pen Drive 0.925 0.648 0.76

Music 0.854 0.812 0.8324

Movies 0.923 0.843 0.881

Games 0.732 0.693 0.7119

Fig. 2. Comparison Analysis of Precision on various domains.

The figure shown below is the experimental analysisand comparison of Recall based on Existing andProposed work. The result is analyzed on variousdomains such as Book and Pen Drives and Music andMovies as well as Games.

Recall is the computation of total number ofannotations fetch from the web databases to the totalnumber of annotated records present.

0

0.5

1

Book


Book 0.6 0.6 0.6

Pen Drive 0.468 0.732 0.565

Music 0.5 0.7 0.58

Movies 0.56 0.66 0.605098Games 0.68 0.57 0.619238







Book 0.75 0.64 0.69

Pen Drive 0.925 0.648 0.76

Music 0.854 0.812 0.8324

Movies 0.923 0.843 0.881

Games 0.732 0.693 0.7119




Pen Drive Music Movies Games

Precision comparsion

Existing Proposed


Book 0.6 0.6 0.6

Pen Drive 0.468 0.732 0.565

Music 0.5 0.7 0.58

Movies 0.56 0.66 0.605098Games 0.68 0.57 0.619238







Book 0.75 0.64 0.69

Pen Drive 0.925 0.648 0.76

Music 0.854 0.812 0.8324

Movies 0.923 0.843 0.881

Games 0.732 0.693 0.7119





Book 0.6 0.6 0.6

Pen Drive 0.468 0.732 0.565

Music 0.5 0.7 0.58

Movies 0.56 0.66 0.605098Games 0.68 0.57 0.619238


Fig. 3. Comparison Analysis of Recall on various domains.

The figure shown below is the experimental analysis andcomparison of F-Score based on Existing and Proposedwork. The result is analyzed on various domains such as

Book and Pen Drives and Music and Movies as well asGames. It is defined as:− = 2 ∗ ∗+

Fig. 4. Comparison Analysis of Final Score on various domains.

V. CONCLUSION

The comparison between existing and proposedtechniques can be analyzed based on precision and recalland it was found that the proposed methodology notonly removes the problems of the existing technique butalso provides high precision and recall.The proposed methodology implemented here for thesearching of the annotations from the web databases.Here the annotations can be identified on the variouscategories such as Book, Movies, Electronics, Pen

Drives, Auto.The proposed methodology is applied onthe these categories with different web pages and henceon the basis of search web records labels are assigned tothese web pages After identification of annotations inthe web databases accuracy can be computed andcompared to the existing technique that is implementedfor the efficient search of records from the webdatabases and the proposed methodology provides highprecision and recall as compared to the existingtechnique.

0

0.5

1

Book Pen Drive

0

0.2

0.4

0.6

0.8

1

Book Pen Drives

F-Sc

ore

Comparison of F-Score






V. CONCLUSION



Pen Drive Music Movies Games

Recall Comparsion

Existing Proposed

Pen Drives Music Movies Games

Domains

Comparison of F-Score

Previous Work

Proposed Work






V. CONCLUSION



Previous Work

Proposed Work


REFERENCES

[1]. Y. Lu, H. He, H. Zhao, W. Meng, C. Yu, “AnnotatingSearch Result From Web databases” In IEEE Transaction onKnowledge and Data Engineering, Vol. 25, No.3, 2013.[2] Y. Lu, H. He, H. Zhao, W. Meng, and C. Yu, “AnnotatingStructured Data of the Deep Web,” Proc. IEEE 23rd Int’lConf. Data Eng. (ICDE), 2007.[3]. W.L. Hürsch and C.V. Lopes, “Separation of Concerns”.Technical Report NU-CCS-95-03, Northeastern University,1995.[4]. Cascading style sheets: www.w3.org/Style/CSS[5]. A. Arsanjani et al, “S3: A Service-Oriented ReferenceArchitecture”. IT Professional, Vol. 9, Issue 3, pp. 10-17,2007.[6]. T. Elrad, R.E. Filman and A. Bader, “Aspect-orientedprogramming: Introduction”, Communications of the ACM,Vol. 44, Issue 10, pp. 28-32, 2001.[7]. Do, H. H., Melnik, S. and Rahm, E. Comparison ofSchema MatchingEvaluations. In Chaudhri, A. B., Jeckle, M.,Rahm, E., and Unland, R., editors, Web, Web-Services, andDatabase Systems, NODe 2002 Web and Database-RelatedWorkshops, Erfurt, Germany, October 7-10, volume 2593 ofLecture Notes in Computer Science, pages 221–237. Springer,(2002).[8]. Annotating Search Results from Web Databases YiyaoLu, Hai He, Hongkun Zhao, Weiyi Meng, Member, IEEE,andClement Yu, Senior Member, IEEE Transactions On

Knowledge And Data Engineering, VOL. 25, NO. 3, MARCH2013.[9]. J. Kahan, M-R. Koivunen, Annotea: an open RDFinfrastructure for shared Web annotations. Proceedings of the10th international conference on World Wide Web, 2001.[10]. L. Gravano, H. Garcia-Molina, A. Tomasic, "GlOSS:Text-Source Discovery over Internet", TODS 24(2), 1999.[11]. K. Khelif, R. Dieng-Kuntz, P. Barbry, An Ontology-based Approach to Support Text Mining and InformationRetrieval in the Bio logical Domain, in J. UCS 13(12), pp.1881-1907, 2007.[12]. A. Setzer, R. Gaizauskas, TimeM L: Robustspecification of event and temporal expressions in text. InThe second international conference on language resourcesand evaluation, 2000.[13]. J. Kahan, M-R. Koivunen, Annotea: an open RDFinfrastructure for shared Web annotations. Proceedings of the10th international conference on World Wide Web, 2001.[14]. A. Setzer, R. Gaizauskas, TimeM L: Robustspecification of event and temporal expressions in text. InThe second international conference on language resourcesand evaluation, 2000.[15]. C. Roussey, S. Calabretto, An experiment usingConceptual Graph Structure for a Multilingual InformationSystem, in the 13th International Conference on ConceptualStructures, ICCS'2005.

.

www.w3.org/Style/CSS

Documents

An Efficient K-Means Clustering based Annotation Search from … ANSHUL TIWARI... · 2018-12-15 · clustering using K-Means. K-Means clustering is an efficient learning approach