Click here to load reader

Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining 2014-06-19

Embed Size (px)

Citation preview

Survey on Techniques and issues survey on Web Structure Mining

Web Mining ClassNam Hoai NguyenHiep Tuan Nguyen TriSurvey on Web Structure Mining2014-06-191ContentsResearch purposeIntroductionWhat is Web Structure MiningAlgorithms in Web Structure MiningComparison table of Web structure Mining AlgorithmsImplementation resultsConclusionResearch purposeStudy about Web Structure Mining and its techniquesTry to make a systematically comparison of some important Web Structure Mining algorithms through literature analysisImplement in practice some Web Structure Mining techniques in order to get the insights of those techniques.IntroductionWhat is Web Structure Mining?

Web MiningWebContentMiningWebStructureMiningWebUsageMiningWeb Structure Mining (WSM):A process by which the model of link structures and web pages are discoveredPurpose of WSM: generate structural summary about the Web site and Web page4Introduction (contd)Web Structure MiningLink MiningDocument Structure Mining Extracting patterns from hyperlinks in the webMining the document structure (tree-like structure of documents)

5Introduction (contd)Four important WSM algorithms:Pagerank algorithmWeighted pagerank algorithmWeighted content pagerank algorithm (WCPR) Hyperlink-Induced Topic Search (HITS)

61. PageRankDeveloped by L.Page and S.Brin.A page has high rank when the sum of the ranks of its backlinks is highUtilized by Google:User request a search queryGoogle combines pre-computed static PageRank scores with content matching score to obtains an overall ranking score for each web page.

72. Weighted PageRankProposed in order to improve pageRankIs an extended algorithm of PageRank by Wenpu Xing and Ali GhorbaniMethod: assigns larger rank values to more popular pages instead of dividing the rank value of a page among its outlink pagesPopular page: is the more linkages that other web page tend to have to them or are linked to by them83. Weighted Content PageRankBased on WST and WCMReturn the relevant and important pages in a list to a given queryWSM is used to calculate the important pageWCM is used to find how much relevant a page isPopularity of a page = number of inlinks and outlinks of the pageA page is maximally matched to the query, it becomes more relevant. 93. Weighted Content PageRank (contd)Algorithm summary:Input for the algorithm: Page P, inlink and outlink. Weights of all backlinks of P, Query Q, d (damping factor).Output of the algorithm: Rank scoreStep 1: Relevance calculation:Find all meaningful word strings of Q (say N)Find whether the N strings are occurring in P or not?Z = Sum of frequencies of all N strings.S = Set of the maximum possible strings occurring in P.X = Sum of frequencies of strings in S.Content Weight (CW) = X/ZC = No. of query terms in PD = No. of all query terms of Q while ignoring stop words.Probability Weight (PW) = C/DStep 2: Rank calculation:Find all backlinks of P (say set B)Calculate Rank scoreOutput PR(P) as the Rank score104. Hyperlink-Induced Topic Search (HITS)HITS is a link algorithmTwo types of webpages: hubs and authoritiesHub:Resource listsA good hub: pointing to many authoritative pages on content that is being queriedAuthority:Pages having important contentsA good authority: pointed by many good hub pages on the same content

114. Hyperlink-Induced Topic Search (HITS)(contd)Algorithm summary:Input: search topic, specified by one or more query terms.

Step 1 - Sampling: A sampling component, which constructs a focused collection of several thousand Web pages likely to be rich in relevant authorities

Step 2 - Weight propagation: A weight-propagation component, which determines numerical estimates of hub and authority weights by an iterative procedure.

Output: hubs and authorities for the search.

12Comparison table of WST AlgorithmsAlgorithmPageRankWeighted PageRankWeighted Page Content RankHITSAuthor/YearS. Brin et al., 1998Wenpu Xing et al, 2004P. Sharmar et al., 2010Jon Kleinberg, 1998Mining Technique UsedWSMWSMWSM and WCMWSM and WCMDescriptionComputes scores at indexing time, not query time. Results are sorted according to importance of pages.Assigns large value to more important pages instead of dividing the rank value of a page evenly among its outlink pagesGives sorted order to the web pages returned by a search engine as a numerical value in response to a user queryComputes hub and authority scores of n highly relevant pages on the fly. Relevant as well as important pages are returned.13Comparison table of WST Algorithms (contd)AlgorithmPageRankWeighted PageRankWeighted Page Content RankHITSInput / Output ParametersBacklinksBacklinks,Forward linksBacklinks,Forward links,ContentsBacklinks,Forward links,ContentsComplexityO(logn)