Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian...

Preview:

DESCRIPTION

The Web: A Directed Graph (V, A) Vertices  Web pages V = {v 1, v 2, …, v N } Arcs  Hyperlinks A = {e ij : v j  v i } Path: p 1.p 2. ….p n with arcs from p i to p i+1 Cycle: A Path with p n = p 1

Citation preview

Detecting Sequences and Cycles of Web Pages

Narayan L. Bhamidipati and

Sankar K. Pal

Indian Statistical InstituteKolkata

Contents

• Introduction• Objective• Significance• Procedure• Experiments• Future directions

The Web: A Directed Graph

• (V, A)• Vertices Web pages

• V = {v1, v2, …, vN}

• Arcs Hyperlinks• A = {eij : vj vi}

• Path: p1.p2. … .pn with arcs from pi to pi+1

• Cycle: A Path with pn = p1

Sequences of Web Pages

• Paths consisting of adjacent web pages• Order sensitive• A surfer may follow one such sequence

when browsing pages

Cycles of Web Pages• http://www.stanford.edu/• http://www.stanford.edu/home/atoz/letterw.html• http://www.stanford.edu/group/wellspring/• http://www.stanford.edu/group/wellspring/yahoo_spotlight.html• http://www.yahoo.com/• http://dir.yahoo.com/Education/• http://dir.yahoo.com/Education/Higher_Education/• http://dir.yahoo.com/Education/Higher_Education/Colleges_and_Universities/• http://dir.yahoo.com/Education/Higher_Education/

Colleges_and_Universities/United_States/• http://www.stanford.edu

What are we looking for ?

• A particular kind of sequences and cycles• Regular• Consisting of similar units• Units having similar relationship• Reasonably sized

Why are these Sequences and Cycles Interesting ?

• Individual units form a single object• These were intended to be together• They collectively include the complete

information• Despite being part of a collection,

individuality is maintained

Significance of Detecting Such Sequences and Cycles

• Compression• Merge groups of pages• Fewer pages fewer links

• Pre-fetching• Know where the surfer wants to be next• Fetch the page(s) before being requested• Saves time• Errors: pre-fetching wrong pages

Significance of Detecting Such Sequences and Cycles (Contd.)

• Fair comparison• Comparison independent of how content is

presented• Content split into multiple pages should be

treated equivalent to the same in a single page• Better retrieval

• Retrieval independent of the presentation• Output a set of pages instead of a single one as

a match

Fair Comparison

Fair Comparison

Fair Comparison

Improved Retrieval

• Retrieve only portions of interest• Instead of, whole (huge) documents• Avoid rewarding more content

How to Detect Sequences and Cycles of Web Pages ?

• Find navigational links• Find consecutive pages

• Define what the elements of the sequence would satisfy

• Identify subsequences (or units)• Concatenate

• Check for cycles

Finding Navigational Links: Background

• The purpose of a link may be• Navigation• Reference• Advertisement

• Links between pages on the same server are treated as navigational

• Have also been treated as noise

Finding Navigational Links: Our Method

• Avoid treating links on the same server as navigational links

• Appear mostly either at the top or at the bottom

• Navigational links are generally huddled together

• Fewer text and images around such links

Advantages and Limitations

• Simple and fast• Navigational links across servers are also

identified

• Heuristics need not always work – fall back on sophisticated methods

Units of the Sequences

• ABC is a unit if C is “related” to B in the same way as B is “related” to A

• “related” is defined in terms of how they are linked

• Relation is stored as “position” of the link• Several ways of defining “position”

Combining the units into sequences

• DEF• BCD• ABC• CDE

• ABCDEF

Cycle detection

• Existing cycle detection algorithms• Cycle detection in number theory• Special case of cycle detection in graph

theory• Stack based algorithm

Improvements and Speedups

• Believe the “rel” information provided by the (author of the) pages

• Use keywords like “next” and “previous” to perceive the relationships

• Utilize the information of the naming convention

Experimental Results

• Data• Toy data: python tutorial in HTML• Tutorial split into several chapters and sections• Several cycles

• Mutilated data• Certain pages deleted (missing links)

• 100% detection in all cases

Other experiments planned

• Real test: unorganized web pages• Difficulties:

• Finding navigational links• Noise (advertisements, etc)• Dynamically generated

• Will the relationships hold ?

Leads us to …

• Concatenate detected sequences for analysis• Modify retrieval mechanism• Return sets of pages as results• Improve mirror/duplicate detection

Future Work

• Consider other relations• Unifying framework ?• Improve identification of navigational links

Recommended