28
Crawling the Hidden Web Authors: Sriram Raghavan Hector Gracia- Molina Presented by: Jorge Zamora

Crawling the Hidden Web

  • Upload
    fleta

  • View
    44

  • Download
    1

Embed Size (px)

DESCRIPTION

Crawling the Hidden Web. Authors: Sriram Raghavan Hector Gracia-Molina Presented by: Jorge Zamora. Outline. Hidden Web Crawler Operation Model HiWE – Hidden Web Exposer LITE – Layout-based Information Extraction Experimental Results Relation to class lectures Pros/Cons Conclusion. - PowerPoint PPT Presentation

Citation preview

Page 1: Crawling the Hidden Web

Crawling the Hidden Web

Authors: Sriram Raghavan

Hector Gracia-Molina

Presented by: Jorge Zamora

Page 2: Crawling the Hidden Web

Outline• Hidden Web• Crawler Operation Model• HiWE – Hidden Web Exposer• LITE – Layout-based Information Extraction• Experimental Results• Relation to class lectures• Pros/Cons• Conclusion

July 21, 2011 JAZ-2Crawling the Hidden Web

Page 3: Crawling the Hidden Web

Hidden Web• PIW – Publicly Indexable Web• Deep Web

– 500 times the PIW

• Hidden Crawler– Parse, process and interact with forms

• Task specific approach• Two Steps

– Resource Discovery– Content Extraction

July 21, 2011 JAZ-3Crawling the Hidden Web

Page 4: Crawling the Hidden Web

Hidden Crawler – Operation Model

July 21, 2011 JAZ-4Crawling the Hidden Web

Page 5: Crawling the Hidden Web

Hidden Crawler – Operation Model • Internal form representation

F = ({{E1, E2,…,En},S,M})

• Task specific database– Formulates search queries

• Matching FunctionMatch(({E1,…,En},S,M),D) = {[E1<-v1,…,En<- Vn]}.

• Response Analysis– Success and error pages, Storage, Tuning

July 21, 2011 JAZ-5Crawling the Hidden Web

Page 6: Crawling the Hidden Web

Hidden Crawler – Performance• Challenge

– Wanted to get away from a metric significantly depended on D

• Submission Effiency– Ntotal = total number of forms crawler submits– SEstrict = Nsucess/Ntotal

• Penalizes the crawler which might be correct but did not yield any results

– SElenient = Nvalid/NTotal• Penalized only if the form submission is semantically incorrect.

• Difficult to evaluate - must evaluate every form submission.

July 21, 2011 JAZ-6Crawling the Hidden Web

Page 7: Crawling the Hidden Web

HiWE • Hidden Web Exposer• Prototype Hidden Web Crawler built at Stanford• Basic idea

– extract some kind of descriptive information or label for each element in the form

– task-specific which contains a finite set of categories with associated labels

– Matching algorithms attempts to match form labels with database values to form value assignment sets

July 21, 2011 JAZ-7Crawling the Hidden Web

Page 8: Crawling the Hidden Web

HiWE – Conceptual Parts

July 21, 2011 JAZ-8Crawling the Hidden Web

Page 9: Crawling the Hidden Web

HiWE – Form Representation

• F = ({E1,E2,…,En} S, 0)– Dom(Ei)– Label(Ei)

July 21, 2011 JAZ-9Crawling the Hidden Web

Page 10: Crawling the Hidden Web

HiWE – Task specific Database• Organized as a finite set of concepts of

categories• Each concept has one or more labels and

associated values• Each Row in the LVS table is of the form (L, V),

– L is a label– V = {v1,…, vn} is a fuzzy– vi represents a value– Fuzzy set V has associated membership function Mv– Mv(vi) is the crawlers confidence of assignment

July 21, 2011 JAZ-10Crawling the Hidden Web

Page 11: Crawling the Hidden Web

HiWE – Matching Function• Label Matching

– All labels are normalized• Common case, Stemming, Stop word removal

– String Matching • with min edit distances, word orderings

– Threshold of Sigma < edit operations. Then set to nil

• Ranking Value Assignments– Min Rho.– Fuzzy Conjunction - Rho fuz– Average – Rho avg– Probabilistic – Rho prob

July 21, 2011 JAZ-11Crawling the Hidden Web

Page 12: Crawling the Hidden Web

HiWE – Populating LVS Table• Explicit Initialization• Built-in entries

– Dates, Times, names of months, days of the week

• Wrapped data Sources– Set of Labels, new entries– Set of Values, search similar, expand existing

• Crawling Experience– Finite domain elements– Can be used to fill out the second form more efficiently

July 21, 2011 JAZ-12Crawling the Hidden Web

Page 13: Crawling the Hidden Web

HiWE – Computing Weights• Explicit initialization

– Fixed, predefined weights (usually 1) representing maximum confidence in human supplied values

• External data sources or crawler activity– Positive boost – Successful– Negative boost – Unsuccessful

• Initial weights obtained from external data sources are computed by the wrapper

July 21, 2011 JAZ-13Crawling the Hidden Web

Page 14: Crawling the Hidden Web

HiWE – Computing Weights• Finite domain

– Case 1 – Crawler Extracts label, Label Match found• Unions the values to the • Boost the weights/confidence of the existing values

– Case 2 – Crawler Extracts label, Label Match = nil• New row is added in LVS table

– Case 3 – Can not extract label• Identify values that most closely resembles Dom(E)• Once located, add values in Dom(E) to value set

July 21, 2011 JAZ-14Crawling the Hidden Web

Page 15: Crawling the Hidden Web

HiWE – Explicit Configuration

July 21, 2011 JAZ-15Crawling the Hidden Web

1 Set of sites to crawl

2 Explicit initialization entries for the LVS table

3 Set of data sources, wrapped if necessary

4 Label matching threshold (σ)

5 Minimum acceptable value assignment rank (ρ min)

6 Minimum form size (α)

7 Value assignment aggregation function

Page 16: Crawling the Hidden Web

LITE• Layout-based information extraction• Used in automatically extracting semantic

information from search forms.• In addition to text, uses the physical layout of the

page to aid in extraction• Not always reflected in HTML markup

July 21, 2011 JAZ-16Crawling the Hidden Web

Page 17: Crawling the Hidden Web

LITE – Usage in HiWE• Used in Label Extraction• Implemented by page

pruning. Isolate elements that directly influence the layout of the form elements and labels

July 21, 2011 JAZ-17Crawling the Hidden Web

Page 18: Crawling the Hidden Web

LITE – Steps• Approximate layout of pruned page discarding

images, font styles and style sheets• Identifies pieces of text closest to form element

as candidates• Ranks Each candidate taking into account

position, font size, font style, number of words• Chooses the highest ranked candidate as label

associated with element

July 21, 2011 JAZ-18Crawling the Hidden Web

Page 19: Crawling the Hidden Web

Experiment - Parameters

• Task 1 Shown which is for “News articles, reports, press releases, and white papers relating to the semiconductor industry, dated sometime in the last ten years”

July 21, 2011 JAZ-19Crawling the Hidden Web

PARAMETER VALUENumber of sites visited 50

Number of forms encountered 218Number of forms chosen for submission 94

Label matching threshold (σ) 0.75Minimum form size (α) 3

Value assignment ranking function ρfuzMinimum acceptable value assignment rank (ρmin) 0.6

Page 20: Crawling the Hidden Web

Results – Value Ranking• Was executed three times with

same parameters, initializations values and parameters but using different ranking function

• Pave might be a better choice for maximum content extraction

• Pfuz is the most efficient• Pprob submits the most forms

but performs poorly

July 21, 2011 JAZ-20Crawling the Hidden Web

RankingFunction

Task 1

Ntotal Nsuccess SEstrict

ρfuz3214 2853 88.8

ρavg 3760 3126 83.1

ρprob4316 2810 65.1

Page 21: Crawling the Hidden Web

Results – Form Size

July 21, 2011 JAZ-21Crawling the Hidden Web

3735

29503214

2853 28002491

1404

78.9%

88.77%

88.96%

90%

Num

ber o

f for

m s

ubm

issi

ons

Page 22: Crawling the Hidden Web

Results – Crawler additions to LVS

July 21, 2011 JAZ-22Crawling the Hidden Web

Page 23: Crawling the Hidden Web

Results – LITE Label Extraction• Elements from 1 to 10• Manually analyzed to

derive correct label• Also ran other label

extraction heuristics– Purely textual analysis– Common ways forms are laid

out

• LITE was 93% vs 72% and 83%

July 21, 2011 JAZ-23Crawling the Hidden Web

Total number of forms 100

Number of sites from which forms were picked 52

Total number of elements 460

Total number of finite domain elements 140

Average number of elements per form 4.6

Minimum number of elements per form 1

Maximum number of elements per form 12

Page 24: Crawling the Hidden Web

Relation to Class Notes• Content driven Crawler

– Different crawlers for different purposes

• Contains Similar crawler Metrics– Crawling speed– Scalability– Page importance– Freshness

• Data Transfer– Stored after crawled

July 21, 2011 JAZ-24Crawling the Hidden Web

Page 25: Crawling the Hidden Web

Cons• Freshness/Recrawling isn’t addressed• Task specific, human configuration• Login Based, Cookie JAR implementation• Didn’t discuss Hidden fields or Capchas• Didn’t run task 1 results without LITE.• Not using the “name” element tag in form elements• Required fields vs. not required• Wild cards, incomplete forms• Form element decencies.

July 21, 2011 JAZ-25Crawling the Hidden Web

Page 26: Crawling the Hidden Web

Pros• First Hidden Crawler Report• Not run at runtime

– VS. shopping and travel sites that do.

• Gets better overtime

July 21, 2011 JAZ-26Crawling the Hidden Web

Page 27: Crawling the Hidden Web

Conclusion / Thoughts• Hidden web is much bigger now.• Hidden web reached now with google analytics

and google ads• Now we also have ajax based forms. How do we

deal with ajax based forms?

July 21, 2011 JAZ-27Crawling the Hidden Web

Page 28: Crawling the Hidden Web

Thank YouQuestions

?

July 21, 2011 JAZ-28Crawling the Hidden Web