1
! " # $ % Webpage with identified records and attributes 1 Domain schema concepts 2 Learned terms with confidence values 3 URLs for analysis 4 Seed Gazetteer 5 AMBER GUI Reasoning Attribute Alignment Record Segmentaton DataArea Identification Annotation GATE Domain Gazetteers Web Access Browser Common API WebKit Mozilla AMBER Learning Evaluation !"#$%& %' )$"*&+&,% !"# $% & %' -"*#.. /, 0#. )$"*&+&, % /, 0#. -"*#.. 123 453 613 773 8223 -,9%: 8 -,9%: ; -,9%: 5 !"#$% ' !"#$% ( !"#$% ) * +* '** '+* (** (+* )** ,-./$0 ,"1.02"$3 4"//-10 ,"1.02"$3 Learning Accuracy Terms learnt Learning Locations from 250 pages from 150 sites (UK real estate) Starting with 25% of the full gazetteer (containing 33.243 terms) unannotated instances (328) total instances (1484) rnd. aligned corr. prec. rec. prec. rec. 1 226 196 86.7% 59.2% 84.7% 81.6% 2 261 248 95.0% 74.9% 93.2% 91.0% 3 271 265 97.8% 80.6% 95.1% 93.8% 4 271 265 97.8% 80.6% 95.1% 93.8% rnd. unannot. recog. corr. prec. rec. terms 1 331 225 196 86.7% 59.2% 262 2 118 34 32 94.1% 27.1% 29 3 79 16 16 100.0% 20.3% 4 4 63 0 0 100.0% 0% 0 Fails to annotate 328 or 1,484 locations Saturated after 3 rounds 97.5% 98.0% 98.5% 99.0% 99.5% 100.0% data areas records attributes precision recall 94.0% 96.0% 98.0% 100.0% reception price bathroom legal status detailed page bedroom location postcode property type precision recall overall attributes large scale Real Estate 97.5% 98.0% 98.5% 99.0% 99.5% 100.0% data areas records attributes precision recall 90.0% 92.5% 95.0% 97.5% 100.0% door num detail page make fuel type color price mileage cartype trans model engine size location precision recall overall attributes Used Cars 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% price location detailed page bedroom legal status postcode property type bathroom reception 250 pages, manual 2215 pages, automatic 281 real estate 14,614 2785 12,732 attributes 114,714 records 20,723 1,608 151 2,215 pages used car (large scale) extracts attributes with >99% precision and >98% recall Automatically Learning Gazetteers from the Deep Web AMBER Learning Cycle Page Segmentation Page Retrieval Mozilla, GATE annotations Data Area Identification Pivot node (mandatory fields) clustering Record Segmentation Head/tail cut off, Segment boundary shifting Attribute Alignment Attribute Cleanup Discard attributes of low support Attribute Disambiguation Discard redundant attributes of lower support Attribute Generalization Add new attributes of sufficient support Gazetteer Learning Term Formulation Spilt new attributes into terms Term Validation Track term relevance, Discard irrelevant terms 1 2 3 2 3 L L P X L P P P A L P A L P L P A A 2 3 1 1 1 2 L P A 1 Oxford, Walton Street, top-floor apartment Oxford Walton Street top-floor apartment Gazet- teers A data area is a maximal DOM subtree, which contains 2 pivot nodes, which are depth consistent (depth(n)=k±ε) distance consistent (pathlen(n,n')=k±δ) continuous, such that their least common ancestor is d's root. A result record is a sequence of children of the data area root. A result record segmentation divides a data area into non-overlapping records, containing the same number of siblings, each based on a single selected pivot node. 2 R D L P X D L L P P P A P L P A L P L P A A 3 R D L P X D L L P P P A P L P A L P L P A A The tag path of a node n in a record r is the tag sequence occurring on the child/next-sibling path from r's root to n. The support of a type/tag path pair (t,p) is the fraction of records having an annotation for t at path p. Remove terms which occur in black lists, in other gazetteers Compute confidence based on support of its type/tag path pair , relative size of the term within the entire attribute Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, Cheng Wang Authors diadem.cs.ox.ac.uk/amber [email protected] Digital Home Reasoning in Datalog (DLV) rules stratified negation finite domains non-recursive aggregation easy integration with domain knowledge Number of Rules Data Area Idenifitication: 11 Record Segmentation: 32 Attribute Alignment: 34 AMBER Applications Result Page Analysis Data Extraction Example Generation for Wrapper Induction Gazetteer Learning Gazetteer Ontology Part of DIADEM (Domain-centric Intelligent Automated Data Extraction Methodology): Analyzing the pages reached via OPAL to generate OXPath expressions for efficient extraction. ... but useable independently of DIADEM as well... Sponsors AMBER Evaluation AMBER Architecture DIADEM data extraction methodology domain-centric intelligent automated diadem.cs.ox.ac.uk P is only allowed to appear once, thus the second P with less support is dropped. X only occurs once and has too low support to be kept. A has a support of 3/4 at this node and hence we add the annotation. We inferred that this node is of type A -- hence we learn its terms.

AMBER WWW 2012 Poster

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: AMBER WWW 2012 Poster

!

"

#

$

%

Webpage with identifiedrecords and attributes1 Domain schema concepts2 Learned terms with

confidence values3 URLs for analysis4 Seed Gazetteer5

AMBER GUI

Reas

onin

g Attribute Alignment

Record Segmentaton

DataArea IdentificationAnno

tatio

n

GATE

Domain Gazetteers

Web

Acc

ess

Browser Common API

WebKitMozilla

AMBER Learning Evaluation!"##$%

&'(#)%

!"#$%&%'(

)$"*&+&,%

!"#$%&%'

-"*#..

/,0#.

)$"*&+&,%

/,0#.

-"*#..

123

453

613

773

8223

-,9%:(8 -,9%:(; -,9%:(5

!"##$%

&'(#)%

!"#$%&' !"#$%&( !"#$%&)

*

+*

'**

'+*

(**

(+*

)**

,-./$0&,"1.02"$3 4"//-10&,"1.02"$3

Learning Accuracy Terms learnt

• Learning Locations from 250 pages from 150 sites (UK real estate)

• Starting with 25% of the full gazetteer (containing 33.243 terms)

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.

unannotated instances (328) total instances (1484)

rnd. aligned corr. prec. rec. prec. rec.

1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%

Table 1: Total learned instances

rnd. unannot. recog. corr. prec. rec. terms

1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0

Table 2: Incrementally recognized instances and learned terms

miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.

We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.

Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.

Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.

We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.

Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of

94.0%!

96.0%!

98.0%!

100.0%!

data

are

a!re

cord

s!pr

ice!

deta

ils U

RL!

locat

ion!

legal!

postc

ode!

bedr

oom!

prop

erty

type!

rece

ption!

bath!

precision! recall!

Figure 4: AMBER Evaluation on Real-Estate Domain

fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.

4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction

from Large Websites. Journal of the ACM, 51(5):731–779,2004.

[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE

(Version 6). The University of Sheffield, Department ofComputer Science, 2011.

[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.

of the ACM SIGMOD International Conference on

Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic

wrappers for large scale web extraction. The Proceedings of

the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical

Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.

[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.

[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th

ACM Conference on Information and Knowledge

Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:

Ontology-Assisted Data Extraction. ACM Transactions on

Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web

Based on Partial Tree Alignment. IEEE Transactions on

Knowledge and Data Engineering, 18(12):1614–1628, 2006.

• Fails to annotate 328 or 1,484 locations• Saturated after 3 rounds

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%

data areas records attributes

precision recall

94.0%

96.0%

98.0%

100.0%

reception price bathroom

legal status

detailed page bedroom

location postcode

property type

precision recall

overall attributes large scale

Real Estate

97.5%

98.0%

98.5%

99.0%

99.5%

100.0%

data areas records attributes

precision recall

90.0%

92.5%

95.0%

97.5%

100.0%

door num detail page make

fuel type color price mileage

cartype trans model

engine size location

precision recall

overall attributes

Used Cars

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

price location

detailed page bedroom

legal status postcode

property type bathroom

reception

250 pages, manual 2215 pages, automatic

281real estate 14,6142785

12,732

attributes

114,714

records

20,7231,608151

2,215

pages

used car(large scale)

extracts attributes with >99% precision and >98% recall

Automatically LearningGazetteers from the Deep Web

AMBER Learning Cycle

Page Segmentation

PageRetrieval

Mozilla,GATE annotations

Data AreaIdentification

Pivot node (mandatoryfields) clustering

RecordSegmentation

Head/tail cut off,Segment boundary shifting

Attribute Alignment

AttributeCleanup

Discard attributesof low support

AttributeDisambiguation

Discard redundantattributes of lower support

AttributeGeneralization

Add new attributes ofsufficient support

Gazetteer Learning

TermFormulation

Spilt new attributes intoterms

TermValidation

Track term relevance,Discard irrelevant terms

1

2

3

2

3

L L

P X

L

P P P A

L

P A

L

P

L

P AA2 31

1

1

2

L

P A

1

Oxford, Walton Street, top-floor apartment

Oxford Walton Street top-floor apartment

Gazet-teers

A data area is a maximal DOM subtree, which• contains ≥2 pivot nodes, which are• depth consistent (depth(n)=k±ε)• distance consistent (pathlen(n,n')=k±δ)• continuous, such that• their least common ancestor is d's root.

A result record is a sequence of children of the data area root.

A result record segmentation divides a data area • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node.

2 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

3 R

D

L

P X

D

LL

P P P AP

L

P A

L

P

L

P AA

The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.

The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.

Remove terms which occur• in black lists,• in other gazetteers

Compute confidence based on• support of its type/tag path pair,• relative size of the term within the entire attribute

Tim Furche, Giovanni Grasso, Giorgio Orsi,Christian Schallhart, Cheng Wang

Authors

diadem.cs.ox.ac.uk/[email protected]

Digital Home

Reasoning in Datalog (DLV) rules• stratified negation• finite domains• non-recursive aggregation

easy integration with domain knowledge

Number of Rules• Data Area Idenifitication: 11• Record Segmentation: 32• Attribute Alignment: 34

AMBER Applications

Result PageAnalysis

Data ExtractionExample Generation

forWrapper Induction

Gazetteer Learning

GazetteerOntology

Part of DIADEM (Domain-centric Intelligent Automated DataExtraction Methodology): Analyzing the pages reached viaOPAL to generate OXPath expressions for efficient extraction.

... but useable independently of DIADEM as well...

Sponsors

AMBER Evaluation AMBER Architecture

DIADEM data extraction methodologydomain-centric intelligent automated

diadem.cs.ox.ac.uk

P is only allowed toappear once, thus the

second P with less supportis dropped.

X only occurs onceand has too low

support to be kept.

A has a support of3/4 at this node andhence we add the

annotation.

We inferred that thisnode is of type A --

hence we learn its terms.