View
219
Download
0
Embed Size (px)
Citation preview
Machine-learning based Semi-structured IE
Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central [email protected]
Wrapper Induction
Wrapper An extracting program to extract desired
information from Web pages.Semi-Structure Doc.– wrapper→ Structure Info.
Web wrappers wrap... “Query-able’’ or “Search-able’’ Web sites Web pages with large itemized lists
The primary issues are: How to build the extractor quickly?
Semi-structured IE
Independently of the traditional IEThe necessity of extracting and integrating data from multiple Web-based sources
Machine-Learning Based Approach
A key component of IE systems is a set of extraction patterns that can be generated by machine
learning algorithms.
Related Work
Shopbot Doorenbos, Etzioni, Weld, AA-97
Ariadne Ashish, Knoblock, Coopis-97
WIEN Kushmerick, Weld, Doorenbos, IJCAI-97
SoftMealy wrapper representation Hsu, IJCAI-99
STALKER Muslea, Minton, Knoblock, AA-99 A hierarchical FST
WIEN
N. Kushmerick, D. S. Weld, R. Doorenbos, University of Washington, 1997http://www.cs.ucd.ie/staff/nick/
Example 1
Extractor for Example 1
HLRT
Wrapper Induction
Induction: The task of generalizing from labeled
examples to a hypothesis
Instances: pagesLabels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}Hypotheses: E.g. (<p>, <HR>, <B>, </B>, <I>,
</I>)
BuildHLRT
Other Family
OCLR (Open-Close-Left-Right) Use Open and Close as delimiters for eac
h tupleHOCLRT Combine OCLR with Head and Tail
N-LR and N-HLRT Nested LR Nested HLRT
Terminology
Oracles Page Oracle Label Oracle
PAC analysis is to determine how many examples are
necessary to build an wrapper with two parameters: accuracy and confidence :
Pr[E(w)<]>1-, or Pr[E(w)>]<
Probably Approximate Correct (PAC) Analysis
With =0.1, =0.1, K=4, an average of 5 tuples/page, Build HLRT must examine at least 72 examples
1))1(21())1(21( 2||2
22
KT
Empirical Evaluation
Extract 48% web pages successfully. Weakness:
Missing attributes, attributes not in order, tabular data, etc.
SoftmealyChun-Nan Hsu, Ming-Tzung Dung, 1998Arizona State Universityhttp://kaukoai.iis.sinica.edu.tw/~chunnan/mypublications.html
Softmealy Architecture
Finite-State Transducers for Semi-Structured Text Mining Labeling: use a interface to label ex
ample by manually. Learner: FST (Finite-State Transducer) Extractor: Demonstration
http://kaukoai.iis.sinica.edu.tw/video.html
Softmealy Wrapper
SoftMealy wrapper representation Uses finite-state transducer where each d
istinct attribute permutations can be encoded as a successful path
Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes
Example
4 種情形
Label the Answer Key
Finite State Transducer
b
M -A A
-N
N-UU
e
extract
extractextract
extractskip
skipskip
skip
skip多解決了(N, M) 、(N, A, M)2 個情形
Find the starting position -- Single Pass
新增的定義
Contextual based Rule Learning
TokensSeparators SL ::= … Punc(,) Spc(1) Html(<I>) SR ::= C1Alph(Professor) Spc(1) OAlph(of) …
Rule generalization Taxonomy Tree
Tokens
All uppercase string: CALph An uppercase letter, followed by at least
one lowercase letter, C1Alph A lowercase letter, followed by zero or m
ore characters: OAlph HTML tag: HTML Punctuation symbol: Punc Control characters: NL(1), Tab(4), Spc(3)
Rule Generalization
Learning Algorithm
Generalize each column by replacing each token with their least common ancestor
Taxonomy Tree
Generating to Extract the Body
The contextual rules for the head and tail separators are:hL::=C1alpha(Staff) Html(</H2>) NL(1)Html(<HR>) NL(1) Html(<UL>)tR::=Html(</UL>) NL(1) Html(<HR>) NL(1) Html(<ADDRESS>) NL(1) Html(<I>) Clalpha(Please)
More Expressive Power
Softmealy allows Disjunction Multiple attribute orders within tuples Missing attributes Features of candidate strings
Stalker
I. Muslea, S. Minton, C. Knoblock, University of Southern California
http://www.isi.edu/~muslea/
STALKER
Embedded Catalog Tree Leaves (primitive items): 所要擷取的東西。 Internal nodes (items):
Homogeneous list, or Heterogeneous tuple.
EC Tree of a page
Extracting Data from a Document
For each node in the EC Tree, the wrapper needs a rule that extracts that particular node from its parentAdditionally, for each list node, the wrapper requires a list iteration rule that decomposes the list into individual tuples.Advantages:
The hierarchical extraction based on the EC tree allows us to wrap information sources that have arbitrary many levels of embedded data.
Second, as each node is extracted independently of its siblings, our approach does not rely on there being a fixed ordering of the items, and we can easily handle extraction tasks from documents that may have missing items or items that appear in various orders.
Extraction Rules as Finite Automata
• Landmarks• A sequence of tokens and wildcards
• Landmark automata• A non-deterministic finite automata
Landmark Automata
• A linear LA has one accepting state
• from each non-accepting state, there are exactly two possible transitions: a loop to itself, and a transition to the next state;
• each non-looping transition is labeled by a landmarks;
• all looping transitions have the meaning “consume all tokens until you encounter the landmark that leads to the next state”.
Rule Generating
1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; <i> _Symbol_ _HtmlTag_} perfect Disj:{<i> _HtmlTag_} positive example: D3, D42nd: uncover{D1, D2} Candicate:{; _Symbol_}
Extract Credit info.
Possible Rules
The STALKER Algorithm
Features
Process is performed in a hierarchical manner.沒有 Attributes not in order 的問題。Use disjunctive rule 可以解決 Missing attributes 的問題。
Multi-pass SoftmealyChun-Nan Hsu and Chian-Chi ChangInstitute of Information ScienceAcademia SinicaTaipei, Taiwan
Multi-pass
Tabular style document
(Quote Server)
Tagged-list style document
(Internet Address Finder)
Layout styles and learnability
Tabular style missing attributes, ordering as hints
Tagged-list style variant ordering, tags as hints
Prediction single-pass for tabular style multi-pass for tagged-list style
Tabular result (Quote Server)
Tagged-list result (Internet Address Finder)
Comparison
Both : can handle irregular missing attributes. 對於未見過的 attribute ,需要 training
Single-pass : 允許的 attribute permutations 有限 Single-pass is good for tabular pages 比較快
Multi-pass: Attribute permutations 沒有影響 Multi-pass is good for tagged-list pages 比較慢
Comparison
Quote Server Stalker: 10 example tuples, 79%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 85%, single-pass 97%
Internet Address Finder Stalker: 80% ~ 100%, 500 test WIEN: the collection beyond learn’s capablity SoftMealy: multi-pass 68%, single-pass 41%,
Comparison
Okra(tabular pages) Stalker: 97%, 1 example tuple WIEN: 100% , 13 example tuples, 30 test SoftMealy: single-pass 100%, 1 example tuple, 30
testBig-book(tagged-list pages) Stalker: 97%, 8 example tuples WIEN: perfect, 18 example tuples, 30 test SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test
References
Kushmerick, N. (2000) Wrapper induction: Efficiency and expressiveness. Artificial Intelligence J. 118(1-2):15-68 (special issue on Intelligent Internet Systems). Chun-Nan Hsu and Ming-Tzung Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521-538, Special Issue on Semistructured Data, 1998. Ion Muslea, Steve Minton, Craig Knoblock.Hierarchical Wrapper Induction for Semistructured Information Sources, Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001 .