Download ppt - Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit

Analysis of DOM Structuresfor Site-Level Template Extraction

(PSI 2015)(PSI 2015)

Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit

2

Motivation

• Content Extraction and Block

Detection

• Template Extraction

A Technique for Template Extraction

• State of the art

• The DOM tree

• Template extraction based on DOM

Experiments

• Firefox plugin online DEMO

Conclusions and Future Work

Contents

3

Information Retrieval

Web Mining

Template Detection

Content Extraction

Block Detection

Motivation

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:

menus, status bars, advertisements, sponsored information, etc.

4

Motivation

¿What is content extraction?

Discipline that tries to isolate every information block in a webpage.

¿What is block detection?

5

Motivation

6

Motivation

7

Motivation

The date is differentThe title is different

Component reuse. Web developers can automatically extract components from a webpage.

Enhancing indexers and text analyzers to increase their performance by only processing relevant information.

It has been measured that almost 40-50% of the components of a webpage represent the template.

Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone

Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.

8

Motivation

¿Why is template extraction useful?

9

Motivation


Detection




• The DOM tree


Experiments



Contents

10

The Technique

What is a webpage?

Three main different ways to solve the problem:

Using the textual information of the webpage (i.e., the HTML code)

Using the rendered image of the webpage in the browser

Using the DOM tree of the webpage

11

The Technique

State of the Art

Densitometric features: counting characters and tags

Statistics on terms:Some terms are common in templates

12

The Technique

13

The Technique





14

The Technique

State of the Art

Position of elements: lateral menus, main content centered and visible

Less studied:rendering webpages is computationally expensive





15

The Technique

State of the Art

Analysis of the DOM structure: Difficulty in analysing DIV based structures

Comparing several webpages:Search for common structures

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).

Some assume that the main content text is continuous.

Some assume that the system knows a priori the format of the webpage.

Some need to (randomly) load many webpages (several dozens) to compare them.

16

The Technique

Limitations of Current approaches

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].

Some assume that the main content text is continuous [11].

Some assume that the system knows a priori the format of the webpage [10].

Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated.

17

The Technique

Limitations of Current approaches<h2>Directory</h2> <div class="vcard"> <span class="fn">Vicente Ramos</span> <div class="org">Software Development </div> <div class="adr"> <div class="street-address">Atmosphere 118</div> <span class="locality">La Piedad, México</span> <span class="postal-code">59300</span> </div> <div class="tel">+52 352 52 68499</div> <h4>His Company</h4> <a class="url" href="page2.html"> Company Page </a></div>

The main problem of these approaches is a big loss of generality.

They require to previously know or parse the webpages, or they require the webpage to have a particular structure.

This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design).

Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.

18

The Technique

Limitations of Current approaches

19

The Technique

Other approaches are able to work:

+ Online (i.e., with any webpage)

+ In real-time (i.e., without the need to preprocess the webpages or know their structure)

20

Motivation


Detection


A Technique for Content Extraction


• The DOM tree


Experiments



Contents

The Document Object Model (DOM)

API that provides programmers with a standard set of objects for the representation of HTML and XML documents.

Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa.

The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically.

21

The Technique

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

The Document Object Model (DOM)

Nodes in the DOM tree can be of two types: tag nodes, and text nodes:

Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes).

Text nodes are always leaves in the DOM tree because they cannot contain other nodes.

22

The Technique

I want to know more!

http://www.w3.org/DOM/

Table

Table

Div

Body

H1

Table

Image

Text

Text

Text

23

Motivation


Detection


A Technique for Content Extraction


• The DOM tree


Experiments



Contents

Our method for template extraction in a nutsell:

1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.

2.Solve conflicts between those webpages that implement different templates.

1. Establishing a voting system between the webpages.

nThe template is the intersection between the initial webpage and the DOM trees in the subdigraph.

The intersection is computed with an Equal Top-Down Mapping between the DOM trees.

1.The three steps can be done with a linear cost with respect to the size of the DOM trees.

24

The Technique

1. Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.

25

The Technique

Menu

Submenu

Domain A

Domain B

Domain C

The Technique


Hyperlink distance


The Technique

Hyperlink distance DOM distance

2. Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages.

The Technique

Our method for template extraction in a nutsell:

3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph.

3. The intersection is computed with an Equal Top-Down Mapping between the DOM trees.

29

The Technique

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

Table

Table

Div

Body

H1 Table

Image

Text

Text

Text

P1P2

P3P4

P5

Mapping:

30

The Technique

HTMLHTML

BodyBody

DivDiv TableTable

TableTable PP

HTMLHTML

BodyBody

TableTable TableTable

DivDiv PPPPPP

Top-Down Mapping:

31

The Technique

HTMLHTML

BodyBody

DivDiv TableTable

TableTable PP

HTMLHTML

BodyBody


DivDiv PPPPPP

Equal Top-Down Mapping:

32

The Technique

HTMLHTML

BodyBody

DivDiv TableTable

TableTable PP

HTMLHTML

BodyBody


DivDiv PPPPPP

33

Motivation


Detection




• The DOM tree


Experiments



Contents

Benchmarks: online heterogeneus webpages Domains with different layouts and page structures Company’s websites, news articles, forums, etc.

Final evaluation set randomly selected

We determined the actual template of each webpage by downloading it and manually selecting the template.

The DOM tree of the selected elements was then produced and used for comparison evaluation later.

F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall

34

Experiments

35

Experiments

37

Motivation


Detection




• The DOM tree


Experiments



Contents

38

Conclusions and future work

Conclusions:

• New technique proposed for template extraction:

1.It does not make assumptions about the particular structure of webpages.

2.It only needs to process a single webpage (no templates, no other webpages of the same website are needed).

3.No preprocessing stages are needed. The technique can work online.

4.It is fully language independent (it can work with pages written in English, German, etc.).

5.The particular text formatting of the webpage does not influence the performance of the technique.

39

Conclusions and future work

Future Work:

1.Consider that a website can implement several templates along the webpages:

• Extend the benchmark suite by labelling all templates.

• A new technique to detect all templates of a website.

1.Combine template extraction with content extraction:

1. Firstly, apply template extraction to remove the template, and

2. Secondly, look for the main content on the remaining webpage.

40

Thank You