Analysis of DOM Structuresfor Site-Level Template Extraction
(PSI 2015)(PSI 2015)
Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit
2
Motivation
• Content Extraction and Block
Detection
• Template Extraction
A Technique for Template Extraction
• State of the art
• The DOM tree
• Template extraction based on DOM
Experiments
• Firefox plugin online DEMO
Conclusions and Future Work
Contents
3
Information Retrieval
Web Mining
Template Detection
Content Extraction
Block Detection
Motivation
Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as:
menus, status bars, advertisements, sponsored information, etc.
4
Motivation
¿What is content extraction?
Discipline that tries to isolate every information block in a webpage.
¿What is block detection?
5
Motivation
6
Motivation
7
Motivation
The date is differentThe title is different
Component reuse. Web developers can automatically extract components from a webpage.
Enhancing indexers and text analyzers to increase their performance by only processing relevant information.
It has been measured that almost 40-50% of the components of a webpage represent the template.
Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone
Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.
8
Motivation
¿Why is template extraction useful?
9
Motivation
• Content Extraction and Block
Detection
• Template Extraction
A Technique for Template Extraction
• State of the art
• The DOM tree
• Template extraction based on DOM
Experiments
• Firefox plugin online DEMO
Conclusions and Future Work
Contents
10
The Technique
What is a webpage?
Three main different ways to solve the problem:
Using the textual information of the webpage (i.e., the HTML code)
Using the rendered image of the webpage in the browser
Using the DOM tree of the webpage
11
The Technique
State of the Art
Densitometric features: counting characters and tags
Statistics on terms:Some terms are common in templates
12
The Technique
13
The Technique
Three main different ways to solve the problem:
Using the textual information of the webpage (i.e., the HTML code)
Using the rendered image of the webpage in the browser
Using the DOM tree of the webpage
14
The Technique
State of the Art
Position of elements: lateral menus, main content centered and visible
Less studied:rendering webpages is computationally expensive
Three main different ways to solve the problem:
Using the textual information of the webpage (i.e., the HTML code)
Using the rendered image of the webpage in the browser
Using the DOM tree of the webpage
15
The Technique
State of the Art
Analysis of the DOM structure: Difficulty in analysing DIV based structures
Comparing several webpages:Search for common structures
Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags).
Some assume that the main content text is continuous.
Some assume that the system knows a priori the format of the webpage.
Some need to (randomly) load many webpages (several dozens) to compare them.
16
The Technique
Limitations of Current approaches
Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10].
Some assume that the main content text is continuous [11].
Some assume that the system knows a priori the format of the webpage [10].
Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated.
17
The Technique
Limitations of Current approaches<h2>Directory</h2> <div class="vcard"> <span class="fn">Vicente Ramos</span> <div class="org">Software Development </div> <div class="adr"> <div class="street-address">Atmosphere 118</div> <span class="locality">La Piedad, México</span> <span class="postal-code">59300</span> </div> <div class="tel">+52 352 52 68499</div> <h4>His Company</h4> <a class="url" href="page2.html"> Company Page </a></div>
The main problem of these approaches is a big loss of generality.
They require to previously know or parse the webpages, or they require the webpage to have a particular structure.
This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design).
Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.
18
The Technique
Limitations of Current approaches
19
The Technique
Other approaches are able to work:
+ Online (i.e., with any webpage)
+ In real-time (i.e., without the need to preprocess the webpages or know their structure)
20
Motivation
• Content Extraction and Block
Detection
• Template Extraction
A Technique for Content Extraction
• State of the art
• The DOM tree
• Template extraction based on DOM
Experiments
• Firefox plugin online DEMO
Conclusions and Future Work
Contents
The Document Object Model (DOM)
API that provides programmers with a standard set of objects for the representation of HTML and XML documents.
Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa.
The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically.
21
The Technique
Table
Table
Div
Body
H1 Table
Image
Text
Text
Text
The Document Object Model (DOM)
Nodes in the DOM tree can be of two types: tag nodes, and text nodes:
Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes).
Text nodes are always leaves in the DOM tree because they cannot contain other nodes.
22
The Technique
I want to know more!
http://www.w3.org/DOM/
Table
Table
Div
Body
H1
Table
Image
Text
Text
Text
23
Motivation
• Content Extraction and Block
Detection
• Template Extraction
A Technique for Content Extraction
• State of the art
• The DOM tree
• Template extraction based on DOM
Experiments
• Firefox plugin online DEMO
Conclusions and Future Work
Contents
Our method for template extraction in a nutsell:
1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.
2.Solve conflicts between those webpages that implement different templates.
1. Establishing a voting system between the webpages.
nThe template is the intersection between the initial webpage and the DOM trees in the subdigraph.
The intersection is computed with an Equal Top-Down Mapping between the DOM trees.
1.The three steps can be done with a linear cost with respect to the size of the DOM trees.
24
The Technique
1. Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.
25
The Technique
Menu
Submenu
Domain A
Domain B
Domain C
The Technique
1. Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.
Hyperlink distance
1. Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph.
The Technique
Hyperlink distance DOM distance
2. Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages.
The Technique
Our method for template extraction in a nutsell:
3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph.
3. The intersection is computed with an Equal Top-Down Mapping between the DOM trees.
29
The Technique
Table
Table
Div
Body
H1 Table
Image
Text
Text
Text
Table
Table
Div
Body
H1 Table
Image
Text
Text
Text
Table
Table
Div
Body
H1 Table
Image
Text
Text
Text
Table
Table
Div
Body
H1 Table
Image
Text
Text
Text
Table
Table
Div
Body
H1 Table
Image
Text
Text
Text
P1P2
P3P4
P5
Mapping:
30
The Technique
HTMLHTML
BodyBody
DivDiv TableTable
TableTable PP
HTMLHTML
BodyBody
TableTable TableTable
DivDiv PPPPPP
Top-Down Mapping:
31
The Technique
HTMLHTML
BodyBody
DivDiv TableTable
TableTable PP
HTMLHTML
BodyBody
TableTable TableTable
DivDiv PPPPPP
Equal Top-Down Mapping:
32
The Technique
HTMLHTML
BodyBody
DivDiv TableTable
TableTable PP
HTMLHTML
BodyBody
TableTable TableTable
DivDiv PPPPPP
33
Motivation
• Content Extraction and Block
Detection
• Template Extraction
A Technique for Template Extraction
• State of the art
• The DOM tree
• Template extraction based on DOM
Experiments
• Firefox plugin online DEMO
Conclusions and Future Work
Contents
Benchmarks: online heterogeneus webpages Domains with different layouts and page structures Company’s websites, news articles, forums, etc.
Final evaluation set randomly selected
We determined the actual template of each webpage by downloading it and manually selecting the template.
The DOM tree of the selected elements was then produced and used for comparison evaluation later.
F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall
34
Experiments
35
Experiments
37
Motivation
• Content Extraction and Block
Detection
• Template Extraction
A Technique for Template Extraction
• State of the art
• The DOM tree
• Template extraction based on DOM
Experiments
• Firefox plugin online DEMO
Conclusions and Future Work
Contents
38
Conclusions and future work
Conclusions:
• New technique proposed for template extraction:
1.It does not make assumptions about the particular structure of webpages.
2.It only needs to process a single webpage (no templates, no other webpages of the same website are needed).
3.No preprocessing stages are needed. The technique can work online.
4.It is fully language independent (it can work with pages written in English, German, etc.).
5.The particular text formatting of the webpage does not influence the performance of the technique.
39
Conclusions and future work
Future Work:
1.Consider that a website can implement several templates along the webpages:
• Extend the benchmark suite by labelling all templates.
• A new technique to detect all templates of a website.
1.Combine template extraction with content extraction:
1. Firstly, apply template extraction to remove the template, and
2. Secondly, look for the main content on the remaining webpage.
40
Thank You