53
AJAX Crawl: Making AJAX Applications Searchable Cristian Duda and many others ICDE09

AJAX Crawl

Embed Size (px)

DESCRIPTION

ICDE09

Citation preview

Page 1: AJAX Crawl

AJAX Crawl: Making AJAX Applications Searchable

Cristian Duda

and many others

ICDE09

Page 2: AJAX Crawl

Outline

Introduction

Modeling AJAX

AJAX Crawling

Architecture of A Search Engine

Experimental Results & Conclusions

Page 3: AJAX Crawl

What is AJAX?

Asynchronous JavaScript and XML

Page 4: AJAX Crawl

What is AJAX?

Asynchronous JavaScript and XML

AJAX applications Google Mail, Yahoo! Mail, Google Maps.

No URL changing

Page 5: AJAX Crawl

Why Search Engines Ignore AJAX Content?

No caching/pre-crawl Events cannot be cached.

Duplicate states. Several events can lead to the same state.

Very granular events. Lead to a large set of very similar states.

Infinite event invocation.

Page 6: AJAX Crawl

Now

Introduction

Modeling AJAX Event Model AJAX Page Model AJAX Web Sites Model

AJAX Crawling

….

Page 7: AJAX Crawl

Event Model

When JavaScript is used, the application reacts to user events: click, doubleClick, mouseover, etc.

Event structure in JavaScript:

Page 8: AJAX Crawl

AJAX Page Model

An AJAX application = a simple page identified by an URL + a series of states, events and transitions

Page model: a view of all states in a page (e.g., all comment pages).

In particular it is an automaton, a Transition Graph, which contains all application entities (states, events, transition).

Page 9: AJAX Crawl

Model of An AJAX Web Page

Page 10: AJAX Crawl

Model of An AJAX Web Page

Nodes: application state. An application state is represented as a DOM tree.

Edges: transitions between states. A transition is triggered by an event activated on the source element and applied to one or more target elements, whose properties change through an action. Annotation For The Transition

Graph of An AJAX Application

Page 11: AJAX Crawl

AJAX Web Sites Model

Page 12: AJAX Crawl

Now

Introduction

Modeling AJAX

AJAX Crawling

Architecture of A Search Engine

Experimental Results & Conclusions

Page 13: AJAX Crawl

Crawling Algorithm

Build the model of AJAX Web Site.

Focus on how to build the AJAX Page Model. (i.e., for YouTube, indexing all comment pages of a video).

Page 14: AJAX Crawl

A Basic AJAX Crawling AlgorithmFirst Step:Read the initial DOM of the Document at a given URI.(line 2)

Page 15: AJAX Crawl

A Basic AJAX Crawling AlgorithmNext Step:AJAX-specific and consists of running the onLoad event of thebody tag in the HTML document.(line 3)

All JavaScript-enabled browsersinvoke this function at first.

Page 16: AJAX Crawl

A Basic AJAX Crawling Algorithm

Crawling starts after this initialstate has been constructed.(line 5)

The algorithm performs a breadth-first crawling, i.e., it triggers allevents in the page and invokesthe corresponding JavaScriptfunction.

Page 17: AJAX Crawl

A Basic AJAX Crawling Algorithm

Whenever the DOM changes, anew state is created (line 11) andthe corresponding transition is added to the application model(line 16).

Page 18: AJAX Crawl

Problem of The Basic Algorithm

The network time needed to fetch pages. In case of AJAX Crawling, multiple individual events per

page lead to fetching network content.

Traditional way: pre-caching the Web and crawling locally. Two pages can be checked to be identical using a single URL.

Page 19: AJAX Crawl

A Heuristic Crawling Policy For AJAX Applications

We observe: Stable structure, Contains a menu, present in all states, And a dynamic part.

By identifying the same state but without fetching the content.

Page 20: AJAX Crawl

JavaScript Invocation Graph

The heuristic we use is based on the runtime analysis of the JavaScript invocation graph.

Page 21: AJAX Crawl

JavaScript Invocation GraphEvents And Functionalities In The JavaScript Invocation Graph

Nodes: JavaScript functions.The functionally of an AJAX page is expressedthrough events.

Page 22: AJAX Crawl

JavaScript Invocation GraphFunctions In The JavaScript Invocation Graph On YouTube Page

Functions in the JavaScript code can be invokedeither directly by event triggers (event invocations)or indirectly by other functions (local invocations).

Page 23: AJAX Crawl

The dependencies in the code are listed below:

Page 24: AJAX Crawl

JavaScript Invocation Graph

Hot Node: the functions that fetch content from the server.

Hot Call: a call to a hot node

A single function fetches content fromthe server, i.e., getURLXMLResponseAndFillDiv.

Page 25: AJAX Crawl

In AJAX, the same function can be invoked in order to fetch the same content from the server from different comment pages.

In this approach we detect this situation and we avoid invoking the same function twice.

Page 26: AJAX Crawl

How to solve it?

We solve the problem of caching in AJAX applications and detecting duplicate states by identifying and reusing the result of server calls.

Just as in traditional I/O analysis in databases, we tend to minimize the number of the most expensive operations, i.e., the Hot Calls, invocations which generate AJAX calls to the server.

Page 27: AJAX Crawl

Optimized Crawling Algorithm

Page 28: AJAX Crawl

Optimized Crawling Algorithm

Step 1: Identifying Hot Nodes.

The crawler tags the Hot Nodes, i.e., the functions that directly contain AJAX calls.(line 34)

Page 29: AJAX Crawl

Optimized Crawling Algorithm

Step 2: Building Hot Node Cache.

The crawler builds a table containing all Hot Node invocations, the actual parameters used in the call and the results returned by the server(line 34-53). This step uses the current runtime stack trace.

Page 30: AJAX Crawl

Optimized Crawling Algorithm

Step 3: Intercepting Hot Node Calls.

The crawler adopts the following policy:

1. Intercept all invocations of Hot Nodes (functions) and actual parameters (line 34).

2. Lookup any function call within the Hot Node Cache (line 37-39).

3. If match is found (hot node with same parameters) do not invoke AJAX call and reuse existing content instead (line 41).

Page 31: AJAX Crawl

Simplifying Assumptions Snapshot Isolation

An application does not change during crawling.

No Forms Do not deal with AJAX parts that require user inputting

data in forms, such as Google Suggest.

No Update Events Avoid triggering update events, such as Delete buttons.

No Image-based Retrieval

Page 32: AJAX Crawl

Now

Introduction

Modeling AJAX

AJAX Crawling

Architecture of A Search Engine

Experimental Results

Page 33: AJAX Crawl

The Components AJAX Crawler

Indexing

Query Processing

Result Aggregation

Page 34: AJAX Crawl

Indexing

Starts from the model of the AJAX Site and builds the physical inverted file.

Opposed to traditional way, a result is an URI and a state.

Page 35: AJAX Crawl

Processing Simple Keyword Queries

Each query returns the URI and the state(s) which contain the keywords.

Ranked by the score.

Page 36: AJAX Crawl

Processing Conjunctions

Query: Morcheeba singer

Page 37: AJAX Crawl

Processing Conjunctions

Conjunctions are computed as a merge between the individual posting lists of the corresponding keywords.

Entries are compatible if the URLs are compatible, then if the States are identical.

Page 38: AJAX Crawl

Parallelization

Crawling AJAX faces the difficulty of not being able to really cachedynamic Web content and network connections must continuouslybe created.

Page 39: AJAX Crawl

Parallelization

A precrawler is used to build the traditional, linked-based Website structure.

The total list of URLs of AJAX Web pages is then partitionedand supplied to a set of parallel crawlers.

Page 40: AJAX Crawl

Parallelization

Each crawler applies the crawling algorithm and builds foreach crawled page the AJAX Model.

Page 41: AJAX Crawl

Parallelization

More indexes are then built from the disjunct sets of AJAXModels.

Page 42: AJAX Crawl

Parallelization

Query processing is then performed by query shipping, computing the results from each Index, and then performinga merge of the individual results from each index, returningthe final list to the client.

Page 43: AJAX Crawl

Now

Introduction

Modeling AJAX

AJAX Crawling

Architecture of A Search Engine

Experimental Results & Conclusions

Page 44: AJAX Crawl

Experimental Setup

YouTube Datasets

Algorithms: Traditional Crawling AJAX Non-Cached AJAX Cached AJAX Parallel

YouTube Statistics

Page 45: AJAX Crawl

Crawling Time

Network time is predominant. Underline the importance of applying the Hot Node

optimization.

Page 46: AJAX Crawl

Total Crawling Time & Network Time

Page 47: AJAX Crawl

Crawling Time

The Hot Node heuristics is effective. The heuristic approach of the Hot Nodes causes a 1.29

factor of improvement in crawling time as opposed to the Non-Cached Approach.

Page 48: AJAX Crawl

Number of AJAX Events Resulting In Network Requests

Page 49: AJAX Crawl

Crawling Time

Parallelization is effective. The running time decreases almost by 25%, as opposed to

the AJAX Non-Parallel version.

Page 50: AJAX Crawl

Query Processing Time

YouTube queries.

Query processing times on YouTube.

Page 51: AJAX Crawl

Recall

For each query we evaluated the number of videos returned by just using the traditional approach, as opposed to the total number of videos returned in the AJAX Crawl approach, when also comment pages are taken into account.

Page 52: AJAX Crawl

Discussions Combine with existing search engines.

Focusing on a specific user’s interaction with the server.

Support more AJAX applications, such as forms.

Irrelevant events. This paper focus on the most important events (click, doubelclick, mouseover).

Page 53: AJAX Crawl

Questions? Comments?