IEPAD: Information Extraction Based on Pattern Discovery

Chia-Hui Chang, Shao-Chen LuiDept. of Computer Science and

Information EngineeringNational Central University

IEPAD: Information Extraction Based on Pattern Discovery

WWW10 ’01

Introduction (1/4)

April 21, 20232

Introduction (2/4)

April 21, 2023

Great need for value-added service that integrates information from multiple sourcesCustomizable Web information gathering robots/crawlersComparison-shopping agentsMeta-search enginesNewsbots

Suppose the data has been collected from different Web sites…Write extractor program to extract the contents of the

Web pages Observe the extraction rules in person Write programs for each Web site

Since the format of Web pages is often subject to change, maintaining the wrapper can be expensive and impractical

→ labor-intensive !

3

Introduction (3/4)

April 21, 2023

Related worksTools that can generate wrappers automatically

Machine learning techniques to summarize extraction rules

Ex: WIEN, Softmealy, StalkerDesigner must manually label the beginning and the

end of the training examples for generating the rulesManual labeling is time-consuming and not efficient

enoughFully automate wrapper construction

Without users’ training examplesEx: One-tag separator approach (Embley et al.)

Discover record boundaries in Web documents by identifying candidate separator tags using five independent heuristics

Problem arises when the separator tag is used elsewhere among a record other than the boundary

4

Introduction (4/4)

April 21, 20235

Eliminate human intervention by pattern mining

Motivation is from the observation that useful information in a Web page is often placed in a structure having a particular alignment and orderEx: Web pages produced by search engines

generally present search results in regular and repetitive patterns

Mining repetitive patterns may discover the extraction rules for wrappers

System Overview (1/3)

April 21, 20236

The system IEPAD includes three components :An extraction rule generator

accepts an input Web pageA graphical user interface

Called pattern viewerShows repetitive patterns discovered

An extractor module Extracts desired information from similar Web pages

according to the extraction rule chosen by the user


April 21, 20237

Extraction rule generator includes :TranslatorPAT tree constructorPattern discovererPattern validatorExtraction rule composer

The results of rule extractor are extraction rules discovered in a Web page


April 21, 20238

1. User submits an HTML page

2. Receive and translate into a string of abstract representations

3. Receives the binary file to construct a PAT tree

4. Pattern discoverer uses the PAT tree to discover repetitive patterns, called maximal repeats

5. Filters out undesired patternsand produces candidate patterns6. Rule composer revises

each candidate pattern to form an extraction rule in regular expression

Extraction Rule Generator (1/2)

April 21, 20239

Desired information in a Web page is often placed in a structure having a particular alignment and forms repetitive patternsMay constitute the extraction rules for

wrappersRepetitive patterns : Any substring that

occurs at least twice in the encoded token stringInclude too many patterns fitting this requisiteDefine maximal repeats to uniquely identify

the longest pattern

Extraction Rule Generator (2/2)

Necessary for identifying the well used and popular term repeats

Maximal repeats have to be further verified by the validator to filter interesting ones

April 21, 202310

Translator (1/2)HTML page → token string 包含兩種 token

Tag tokenHtml(<tag_name>)

TEXT token兩個 tag 之間的 non-tag 文字內容當成單一個 tokenText(_)

April 21, 202311

Translator (2/2)Example – Congo code

April 21, 202312 1 2 3 4 5 6 7 8 9 10 11 12

13 14

PAT Tree Construction

April 21, 202313 Sistring: 000110001010110011100$

Bit position in the encoded bit stringUsed when locating a given sistring in PAT tree

Store all its data in external nodes

Pattern Discoverer (1/2)

April 21, 202314

Pattern Discoverer (2/2)

不只記下 maximal repeats , 還要記下它們的 occurrence counts, reference positions, pattern length

Ex: 想找出所有長度 > 3 tokens 的 patterns , 因為每個 token 以 3 bits encoded , 所以只需檢察 index bit > 3*3=9 的 internal nodesd,e,g,l,m其中又只有 d 符合 left diverse , maximal

repeat 為 April 21, 202315

Pattern Validator (1/2)A typical web page usually contains a large

number of maximal repeatsNot all useful!

Validator 使用 3 criteria 來決定哪些 maximal repeats are useful

RegularityMeasured by computing the standard

deviation of the interval between two adjacent occurrences then be devided by the mean of sequence April 21, 202316

0

Pattern Validator (2/2)

April 21, 202317

1

large

利用 3 thresholds 濾掉不符合的 maximal repeats沒有包含 Text token 的也會濾掉

Occurrence PartitionSpecial case:

The pattern of target information forms three information blocks in the Web page因為用所有 instance measure , 所以 Regularity →

large!

Partition the occurrences into segments

April 21, 202318

<

Set to a small value close to zero

Rule Composer

April 21, 202319

Find a good representation of the critical common features of multiple strings

Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb”

Multiple alignment for strings The extraction pattern can be

generalized as “adc[w|x]b[d|-]” 假設 records 是連續的 , 若 alternatives 超過 10 個 , 仍使用

maximal repeats Center String Algorithm

Approximation, reduce time complexity Another problem

產生出 pattern: “c1c2c3...cn”, 實際上是“ cjcj+1cj+2...cnc1c2...cj–1”

考慮 cj 為首的 records, 並檢查是否“ cjcj+1cj+2...cnc1c2...cj–1” 為正確 pattern

The Extractor (1/2)

April 21, 202320

1. 2 patterns discovered

2. Shows the detail measures of the selected pattern

The Extractor (2/2)

April 21, 202321

3. The selected pattern is then forwarded to the extractor for pattern recognition and extractionSearching in a PAT is fast, since every subtree of a PAT tree has all its sistrings with a common prefix→ efficient, linear-time

PAT tree constructed already

Pattern-matching algorithm or finite state machine for extraction rule (regular expression)

else

Experiments (1/3)

April 21, 202322

14 search engines, each with 10 Web pages

All-tag encoding scheme

Fixed min. length = 3Min. frequency = 5

Experiments (2/3)

April 21, 202323 recall precision

Encoding Scheme

0.4%

A pattern may contain only a portion of the

data record

Experiments (3/3)

April 21, 202324

Occurrence partition

Multiple string alignment

Lycos → 92%

SummaryPresented an unsupervised approach for

pattern discovery in the encoded token string of Web pages

Discovered maximal repeats are filtered by the measure regularity and compactness

Regularity higher than threshold → occurrence partition

Multiple string alignment is applied to patterns to generalize multiple recordsExpress the extraction rules in regular expressions

High retrieval rate and accuracy rateNo human intervention and training examplesTakes only 3 minutes to extract 140 pages →

quick and efficient!

April 21, 202325

Documents

IEPAD: Information Extraction Based on Pattern Discovery