27
Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Embed Size (px)

DESCRIPTION

The Web: A Directed Graph (V, A) Vertices  Web pages V = {v 1, v 2, …, v N } Arcs  Hyperlinks A = {e ij : v j  v i } Path: p 1.p 2. ….p n with arcs from p i to p i+1 Cycle: A Path with p n = p 1

Citation preview

Page 1: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Detecting Sequences and Cycles of Web Pages

Narayan L. Bhamidipati and

Sankar K. Pal

Indian Statistical InstituteKolkata

Page 2: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Contents

• Introduction• Objective• Significance• Procedure• Experiments• Future directions

Page 3: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

The Web: A Directed Graph

• (V, A)• Vertices Web pages

• V = {v1, v2, …, vN}

• Arcs Hyperlinks• A = {eij : vj vi}

• Path: p1.p2. … .pn with arcs from pi to pi+1

• Cycle: A Path with pn = p1

Page 4: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Sequences of Web Pages

• Paths consisting of adjacent web pages• Order sensitive• A surfer may follow one such sequence

when browsing pages

Page 5: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Cycles of Web Pages• http://www.stanford.edu/• http://www.stanford.edu/home/atoz/letterw.html• http://www.stanford.edu/group/wellspring/• http://www.stanford.edu/group/wellspring/yahoo_spotlight.html• http://www.yahoo.com/• http://dir.yahoo.com/Education/• http://dir.yahoo.com/Education/Higher_Education/• http://dir.yahoo.com/Education/Higher_Education/Colleges_and_Universities/• http://dir.yahoo.com/Education/Higher_Education/

Colleges_and_Universities/United_States/• http://www.stanford.edu

Page 6: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

What are we looking for ?

• A particular kind of sequences and cycles• Regular• Consisting of similar units• Units having similar relationship• Reasonably sized

Page 7: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Why are these Sequences and Cycles Interesting ?

• Individual units form a single object• These were intended to be together• They collectively include the complete

information• Despite being part of a collection,

individuality is maintained

Page 8: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Significance of Detecting Such Sequences and Cycles

• Compression• Merge groups of pages• Fewer pages fewer links

• Pre-fetching• Know where the surfer wants to be next• Fetch the page(s) before being requested• Saves time• Errors: pre-fetching wrong pages

Page 9: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Significance of Detecting Such Sequences and Cycles (Contd.)

• Fair comparison• Comparison independent of how content is

presented• Content split into multiple pages should be

treated equivalent to the same in a single page• Better retrieval

• Retrieval independent of the presentation• Output a set of pages instead of a single one as

a match

Page 10: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Fair Comparison

Page 11: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Fair Comparison

Page 12: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Fair Comparison

Page 13: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Improved Retrieval

• Retrieve only portions of interest• Instead of, whole (huge) documents• Avoid rewarding more content

Page 14: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

How to Detect Sequences and Cycles of Web Pages ?

• Find navigational links• Find consecutive pages

• Define what the elements of the sequence would satisfy

• Identify subsequences (or units)• Concatenate

• Check for cycles

Page 15: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Finding Navigational Links: Background

• The purpose of a link may be• Navigation• Reference• Advertisement

• Links between pages on the same server are treated as navigational

• Have also been treated as noise

Page 16: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Finding Navigational Links: Our Method

• Avoid treating links on the same server as navigational links

• Appear mostly either at the top or at the bottom

• Navigational links are generally huddled together

• Fewer text and images around such links

Page 17: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Advantages and Limitations

• Simple and fast• Navigational links across servers are also

identified

• Heuristics need not always work – fall back on sophisticated methods

Page 18: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Units of the Sequences

• ABC is a unit if C is “related” to B in the same way as B is “related” to A

• “related” is defined in terms of how they are linked

• Relation is stored as “position” of the link• Several ways of defining “position”

Page 19: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Combining the units into sequences

• DEF• BCD• ABC• CDE

• ABCDEF

Page 20: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Cycle detection

• Existing cycle detection algorithms• Cycle detection in number theory• Special case of cycle detection in graph

theory• Stack based algorithm

Page 21: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Improvements and Speedups

• Believe the “rel” information provided by the (author of the) pages

• Use keywords like “next” and “previous” to perceive the relationships

• Utilize the information of the naming convention

Page 22: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Experimental Results

• Data• Toy data: python tutorial in HTML• Tutorial split into several chapters and sections• Several cycles

• Mutilated data• Certain pages deleted (missing links)

• 100% detection in all cases

Page 23: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Other experiments planned

• Real test: unorganized web pages• Difficulties:

• Finding navigational links• Noise (advertisements, etc)• Dynamically generated

• Will the relationships hold ?

Page 24: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Leads us to …

• Concatenate detected sequences for analysis• Modify retrieval mechanism• Return sets of pages as results• Improve mirror/duplicate detection

Page 25: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Future Work

• Consider other relations• Unifying framework ?• Improve identification of navigational links

Page 26: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata
Page 27: Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata