66
Learning Based Web Query Processing Yanlei Diao Computer Science Department Hong Kong U. of Science & Technology

Learning Based Web Query Processing Yanlei Diao Computer Science Department Hong Kong U. of Science & Technology

  • View
    224

  • Download
    1

Embed Size (px)

Citation preview

Learning Based Web Query Processing

Yanlei DiaoComputer Science Department

Hong Kong U. of Science & Technology

Mphil Thesis, Yanlei Diao 2

Outline

Background Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions Demonstration

Mphil Thesis, Yanlei Diao 3

Searching the Web

Want to find a piece of information on the Web?

Huge Size Heterogeneity

Lack of Structure

DiversifiedUser Bases

Ever- Changing

Mphil Thesis, Yanlei Diao 4

Search Engines

Maintain indices, keyword input, match input keywords with indices, return relevant documents.

Problems Large hit lists with low precision. Users find

relevant documents by browsing. URLs but not the required information are

returned. Users read the pages for the required information.

Mphil Thesis, Yanlei Diao 5

Web Information Retrieval

IR: Vector-space model, search and browse capabilities

Web IR: Web navigation, indexing, query languages, query-document matching, output ranking, user relevance feedback

Recent Improvement: Hierarchical classification, better presentation of results, hypertext study,

metasearching...

Mphil Thesis, Yanlei Diao 6

Web IR for Query Processing

Problems A list of URLs or documents is returned. Users

browse a lot to find information. It asks users for precise query requirements,

which is hard for casual users. It lacks a well-defined underlying model. Vector-

space model does not convey as much as Hypertext.

Large hit lists with low precision, rely on input queries

Mphil Thesis, Yanlei Diao 7

Intelligent Agents

The agents learn user profiles/models from their search behaviors and employ the knowledge to predict URLs of interest to the user.

Some rely on search engines and heuristics to find targets of a specific type: e.g. papers or homepages

Some help users in an interactive mode: They learn while users are browsing.

Some adaptive agents work autonomously: They use heuristics, recommend pages of interest and take user feedback to improve.

Mphil Thesis, Yanlei Diao 8

Agents for Query Processing

Problems Recommending pages of interest, but not

information of interest to the user Using vector-space model or converting HTML to

text documents Requiring a prior knowledge, such as user

profiles, or using heuristics for a particular domain

Not well suited for ad hoc queries

Mphil Thesis, Yanlei Diao 9

Database Approaches

The Web is a directed graph: nodes are Web pages and edges are hyperlinks between pages.

Query languages: 1st generation combines content-based and structure-based queries. 2nd generation accesses structure of Web objects and creates complex objects.

Wrappers and mediators: they present an integrated view of the resources.

Mphil Thesis, Yanlei Diao 10

DB Approaches for Query Processing

Problems Wrapper generation is only feasible for a number

of sites in a domain. The Web is growing very fast!

Web query languages require knowledge of the Web sites (content and linkage) and the language syntax. They are hard to use.

Not scalable, good for Web site management but not queries on the entire Web.

Mphil Thesis, Yanlei Diao 11

Our Goal

A Web query processing system for any Web users that processes ad hoc queries on HTML pages automatically extracts succinct and precise query

results ( a result may take the form of a table, a list or a paragraph).

Learn the knowledge for query processing from the User!

Mphil Thesis, Yanlei Diao 12

Proposed Approach

An approach with learning capabilities: Keyword input (probably not precise) Search engines return a URL list During browsing, learns from users

to navigate through the web pages to identify the required information on a web page

Processes the rest URLs automatically Returns succinct and precise results

Mphil Thesis, Yanlei Diao 13

Unique Features

Returning succinct and precise results, i.e. segments of pages;

No a prior knowledge or preprocessing, suited for ad hoc queries;

exploiting page formatting and linkage information simultaneously, good use of rich information conveyed by HTML.

Mphil Thesis, Yanlei Diao 14

Benefits from Learning

Bridging the gap between keyword input and real query requirements

Capable of navigating in the neighborhoods of documents returned by search engines

Automating the processing of all possibly relevant documents in one query

Almost imperceptible to users, user-friendly

Mphil Thesis, Yanlei Diao 15

Outline

Background Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions Demonstration

Mphil Thesis, Yanlei Diao 16

Modeling a Web Page

Segment: a group of tag delimited elements, unit in query processing, e.g. paragraph, table, list, nested (atomic segments to the document), Segment Tree

Attributes of a segment content: text in the scope of the segment description: summary of the content

Hyperlink: represented as segments to be comparable content: URL description: anchor text associated with the parent segment

Mphil Thesis, Yanlei Diao 17

A Sample

<html><head><title> … Hotel </title></head><body><p>1999 Room Rates</p><table><tr><td><ul><li><a href="ac01a.html">Guest Room</a></li><li><a href="ac02a.html">Executive Suite</a></li></ul></td><td> Special Promotion <br><table><tr><td>Room Type</td><td>Single/Double (HK$)</td><tr><td>Standard</td><td>1000</td></tr><tr><td>Excutive Suite</td><td>2750</td></tr></table></td></tr></table></body></html>

1. ac01a.html2. ac02a.html

"Room Type Single /Double (HK$) Standard 1000 Executive Suite 2750"

"Special Promotion" & the content of the child table

& contents of child paragraph and table

"1999 Room Rates"

Document

Paragraph

Content

Content

Table

Table

List

Link

Content

Content

Content

Mphil Thesis, Yanlei Diao 18

Modeling a Web Site

Ignore backward links, links pointing to themselves, links outside a site.

A Web site is modeled as hyperlink-connected segment trees, called

Segment Graph.

Definition:Sijk: SegmentLm:Hyperlink

S1

S11 S12

S13 S131

S2S21

S3

S31 S32

S4 S41

L1

L2

L3

L4

Mphil Thesis, Yanlei Diao 19

Knowledge for the Locating Task

1) Exhaustive search simplifies it, but is impractical.2) Navigation in the graph should terminate if a segment answers the query

well enough or conclusion of irrelevancy can be drawn.

A decision of following a link or choosing a segment should be made on each page.

Segments and links on a page should be comparable!

The locating task is to find a segment in the Segment Graph of a site as the query result.

Mphil Thesis, Yanlei Diao 20

Two Types of Knowledge

A link conveys description of the pointed page while a queried segment contains both description and the result itself.

Segments and links on a page are not comparable by content!

Two types of knowledge are needed! One only concerns descriptive information and helps find

the navigational path. The other checks if a segment meets query requirements

on both descriptive information and the result.

Mphil Thesis, Yanlei Diao 21

Navigation Knowledge

concerns descriptive information and helps find the navigational path

a set of (term, weight) pairs Term: a selected word f the description of

segments and links on the navigational path Weight: indicating the importance of the term in

leading to the queried segment

Mphil Thesis, Yanlei Diao 22

Learning Navigation Knowledge

Navigational path, (link)*segment, e.g. L2L4S41.Extended navigational path, ((segment )*link)* ((segment )*

segment), e.g. (S1S11L2) (S3S31L4) (S4S41).

Step1. Assign a weight to each component on the path, e.g. L2, S31, S41. The closer to the target, the higher the weight.

Step2. Assign a weight to each term in the description of a component on the path.

The weight of a term can be summed up over navigational paths. The set of (term, weight) pairs is stored into the navigation knowledge base.

Mphil Thesis, Yanlei Diao 23

Classification knowledge

Checks if a segment meets query requirements on both descriptive information and the result.

Cast in the Bayesian learning framework.

Set of triples: (feature, NP, NN) Feature: word, integer, real, symbol, …, date, time, email

address, …, contained in a segment NP: #occurrences of the feature in positive samples NN: #occurrences of the feature in negative samples

Mphil Thesis, Yanlei Diao 24

Learning Classification knowledge

Count NP and NN accumulatively for each feature over all samples. Store all triples (feature, NP, NN) into the classification knowledge base.

The queried segment is a positive sample. All other segments on the same page are negative samples.

The content of each segment is parsed into a set of features, either simple and complex types.

Mphil Thesis, Yanlei Diao 25

Query Processing Using Learned Knowledge

After a Web page is retrieved, the segment graph is built For each segment and link, a score is computed by applying

the navigation knowledge (ApplyNavigation). Segments/links are sorted on the score

If a link has the highest score, the system navigates through the link

If a segment has the highest score, all segments on the page are checked to see if there is a queried segment

The process is repeated until either a segment is found or conclusion can be made that the site does not contain queried information.

Mphil Thesis, Yanlei Diao 26

Locating Algorithm

On each page, if the result is not found:

choosing an unprocessed component with highest score:if a link is chosen if a segment is chosen Definition:

Sijk: SegmentLm:Hyperlink

S1

S11 S12

S13 S131

S2S21

S3

S31 S32

S4 S41

L1

L2

L3

L4

Mphil Thesis, Yanlei Diao 27

Locating Algorithm

On each page, if the result is not found:

choosing an unprocessed component with highest score:if a link is chosenif a segment is chosen (ApplyClassification)

Definition:Sijk: SegmentLm:Hyperlink

S1

S11 S12

S13 S131

S2S21

S3

S31 S32

S4 S41

L1

L2

L3

L4

Mphil Thesis, Yanlei Diao 28

Applying Learned Knowledge

Application of Navigation Knowledge: extracts terms in the description of a link/segment reads the weights of the terms and assigns a score to the

link/segment by a certain function (max currently) sorts all links and segments by their scores

Application of Classification Knowledge: computes the confidence C to classify a segment as the

queried result chooses the segment on a page with the largest C. If the

largest C is over a threshold, returns the segment

Mphil Thesis, Yanlei Diao 29

Hotel 2

Hotel 1

3

done

forward

User browses it!

Mphil Thesis, Yanlei Diao 30

User clicks here!

Mphil Thesis, Yanlei Diao 31

Room information

User marks it!

Mphil Thesis, Yanlei Diao 32

Generating Navigation Knowledge

The navigation path looks like:

Hotel Reservation->single hk$ double hk$ standard room deluxe room +executive room

By our weighting scheme, a weight is assigned to each term

hotel reservation single double standard deluxe executive0.25 0.25 0.2 0.2 0.2 0.2 0.2

Mphil Thesis, Yanlei Diao 33

Generating Classification Knowledge

Training Samples

Occurrences of each feature are counted

Positive single hk$ double hk$ standard room 999.00 1,039.00 deluxe room 1,199.00 1,239.00

+executive room 1,399.00 1,499.00

NegativeHoliday Inn Golden Mile

In the heart of Tsim Sha Tsui - Kowloon, Holiday Inn Golden Mile is your number one choice for accommodation, dining, meetings and banquets.

Ideally situated in the heart of ...

accomodation banquet … deluxe double &'$' executive &float … single standardpositive 0 0 1 1 2 1 6 1 1negative 1 1 1 0 2 1 0 1 2

Mphil Thesis, Yanlei Diao 34

back

Fact starts here!

Mphil Thesis, Yanlei Diao 35

Mphil Thesis, Yanlei Diao 36

Applying Navigation Knowledge

The page contains

Navigation knowledge shows

LinksMain

Features & Services

Dining and Banqueting

Hotel Rates

Reservation

...

Paragraph57 - 73 Lockhart Road, Wanchai, Hong Kong, SAR, PRC

ParagraphLocated in the hub of Wanchai, the Wharney Hotel is within walking distance of the Hong Kong Arts Centre, Convention and Exhibition Centre, busy commercial complexes and shopping malls.

...ParagraphTEL: (852) 2861-1000 FAX: (852) 2865-6023

… execut hong hotel … kong … rate servic reserve suit0.2 0.285714 0.392857 0.285714 3 0.066667 0.25 0.230769

Mphil Thesis, Yanlei Diao 37

Fact chooses it!

0.285714

0.392857

Current 0.0666667 0 3.0 0.25 0

0.230769

0

0

0.392857

Navigation Knowledge

assigns scores

Mphil Thesis, Yanlei Diao 38

Table: 0.586447

Paragraph: 3.0

Paragraph: 0.25

List: 0.25

Visited Current0.0666667 0 0.25 0

Navigation Knowledge

assigns scores

Mphil Thesis, Yanlei Diao 39

C=6.3e-008

C=0.0001

C=2.5e-007

C=1.0Apply

Classification

Knowledge to

all Segments

C=0.3569

Classification Knowledge

computes confidence

Mphil Thesis, Yanlei Diao 40

Fact finds it!

Mphil Thesis, Yanlei Diao 41

Outline

Background Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions Demonstration

Mphil Thesis, Yanlei Diao 42

A Query Processing System

A learning based query processing system: User Interface: accepts user queries, presents query results, a

browser capable of capturing user actions Query Analyzer: analyzes and transforms user queries Session Controller: coordinates learning and locating Learner: generates knowledge from captured user actions Locator: applies knowledge and locates query results Retriever & Parser: retrieves pages and parses to trees Knowledge Base: stores learned knowledge

Mphil Thesis, Yanlei Diao 43

Reference Architecture

Session

Controller

Locator

Search Engine

Web

User Interface

Knowledge

Base

Learner

Query Analyzer

Retriever & Parser

User

Mphil Thesis, Yanlei Diao 44

A Query Session

Session Controller

Training

StrategySegment

GraphResult

Buffer

Knowledge

Base

User Actions

Query

results Checking

URLs

Locating ProcessLocator

Query Result Presenter

Learning Process

LearnerBrowserScripts

Mphil Thesis, Yanlei Diao 45

Training Strategies

Sequential First n sites: user browses and system learns Next N-n sites: system processes

Random Randomly choose n sites: user browses and system learns the system processes the rest

Interleaved First n0 sites, user browses and system learns Next n - n0 site, system makes decision. For incorrect ones,

user browses and system re-learns Next N-n sites: system processes

Mphil Thesis, Yanlei Diao 46

Outline

Background Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions Demonstration

Mphil Thesis, Yanlei Diao 47

System Evaluation System Capabilities Performance

Effectiveness: precision, recall, correctness Efficiency: in a site, how many pages the system visits to

find a result or to recognize the irrelevancy Training efficiency: how many training samples are needed

Key Issues Effectiveness of the knowledge Effectiveness of training strategies

Tests on A Range of Queries

Mphil Thesis, Yanlei Diao 48

A System Output Sample

Mphil Thesis, Yanlei Diao 49

System Capabilities

The system returns segments of the Web pages The segments may not contain any input keyword

but meet the requirement of room rates. The system learned the query requirement from the user!

Segments can be from pages whose URLs are not directly returned by Yahoo!. The system learned how to follow the hyperlinks to the

queried segment!

Mphil Thesis, Yanlei Diao 50

System Evaluation - Effectiveness

Given a set of URLs in a query session, the system makes N decisions

N =N1 + N2 + N3 + N4

Precision = N1 / (N1+N3) ,Recall = N1 / # sites that

contain results,Correctness = (N1+N2) / N .

Found Not FoundRight N1 N2Wrong N3 N4

Mphil Thesis, Yanlei Diao 51

System Evaluation - Efficiency

How efficiently the system finds a queried segment in a site?

Level of a Queried Segment = the length of the shortest path to find it

Absolute Path length = # Visited pages,

Relative Path Length = # Visited pages / Level of the Queried Segment .

Mphil Thesis, Yanlei Diao 52

Basic Performance

URLReturned

URLSelected

URLUsed

URL forTraining

URLProcessed

RelevantURL

Q11 424 100 69 9 60 29Q12 69533 100 71 9 62 38

Q11: Hong Kong Hotel Room RateQ12: Hong Kong Hotel

Precision Recall CorrectnessQ11 76.7 79.3 81.7Q12 87.5 73.7 79.0

Sequential training

Mphil Thesis, Yanlei Diao 53

Effectiveness of Knowledge

Other two systems implemented for comparison Classification Knowledge Only: treat links and

segments the same by the Bayes classifier Learning

Locating

Action positive negativeclick a link the link other links on the pagemark a segment the segment other segments on the page

Classify all segments and linksIf a link has the highest confidence, follow the link;If a segment has the highest confidence and passes

the threshold, return it.

Mphil Thesis, Yanlei Diao 54

Effectiveness of Knowledge

Navigation Knowledge Only: only checks the descriptive information of links and segments Learning

Locating

Navigational path Navigation Knowledge

Assigns scores to all links and segments using navigation knowledgeIf a link has the highest score, follow the link;If a segment has the highest score, return it.

Mphil Thesis, Yanlei Diao 55

Effectiveness of Knowledge

Correctness Precision Recall Correctness Precision RecallBoth Types of Knowledge 81.70% 76.70% 79.30% 79.00% 87.50% 73.70%Bayesian Only 58.30% 51.20% 69.00% 38.70% 34.00% 42.10%Navigation Only 36.70% 28.80% 55.60% 29.00% 26.80% 39.50%

Query 2Query 1Effectiveness of three systems

Irrelevant Level=1 Level=2 Irrelevant Level=1 Level=2 31(60) 27(60) 2(60) 24(62) 12(62) 26(62)

Both Types of Knowledge 83.80% 81.50% 50% 87.50% 91.70% 65.40%Bayesian Only 48.40% 70.40% 50% 33.30% 75% 26.90%Navigation Only 21.20% 60% 0% 12.50% 58.30% 30.80%

Query 2Query 1Correctness of three systems

Only works for results on the first page

Bad filtering capability!Navigation only checks description,

nearly not workable

Poor navigation capability!

Mphil Thesis, Yanlei Diao 56

Effects of Training Strategies

Query 12 - Precision

0

0.2

0.4

0.6

0.8

1

3 5 7 9 10

Training Size

Pre

cisi

on

Sequential

Random

Interactive

Query 12 - Recall

0

0.2

0.4

0.6

0.8

1

3 5 7 9 10Training Size

Rec

all

Sequential

Random

Interactive

Query 12 - Correctness

0

0.2

0.4

0.6

0.8

1

3 5 7 9 10Training Size

Cor

rect

ness

Sequential

Random

Interactive

Query Q12

Training Size 3-10

Mphil Thesis, Yanlei Diao 57

Effects of Training Strategies

Random training performs badly, low in recall As the training size increases, interleaved training

outperforms sequential training Best accuracy reaches or exceeds 90% in all metrics

when the interleaved training strategy is used Enlarging the training size for

random and sequential training is not effective

Query 2 - Correctness (3-20)

0

0.2

0.4

0.6

0.8

1

3 5 7 9 10 12 15 20

Training Size

Cor

rect

ness

Sequential

Random

Mphil Thesis, Yanlei Diao 58

Improved Performance

Q11 Q12Correctness 0.93 0.9Precision 0.92 0.92Recall 0.88 0.94Relative Path Length (Found)

1 1.21

Absolute Path Length (Not Found)

1.3 1.57

Interleaved training

Mphil Thesis, Yanlei Diao 59

A Range of Queries

Hotel room rates: targets at prices, easy to identify

Admission requirements on graduate student: includes items such as degree, GPA, GRE, etc. that are not easy to specify in keywords but easy to show by marking

Data Mining Researcher: concept, subjective, evidence including research interests, projects, professional activity, etc

Query Requirement (QR) Keyword Query KQ1 Keyword Query KQ21: room rates of Hong Kong hotels 11: “Hong Kong hotel room rate” 12: "Hong Kong hotel"2: admission requirements on graduate applicants

21: “requirements graduate applicant” 22: “graduate applicant”

3: data mining researcher 3: “data mining researcher”

Mphil Thesis, Yanlei Diao 60

Results of A Range of Queries

QR3KQ11 KQ12 KQ21 KQ22 KQ3

Correctness 0.93 0.9 0.84 0.91 0.83Precision 0.92 0.92 0.85 0.88 0.64Recall 0.88 0.94 0.94 0.91 0.67Relative Path Length (Found) 1 1.21 1.08 1.1 1Absolute Path Length (Not Found) 1.3 1.57 2.5 1.76 1.67

QR2QR1

Interleaved training

More precise More precise

Mphil Thesis, Yanlei Diao 61

Performance for the Queries

Effectiveness first 4 queries: accuracy is 80% to above 90% the last query: still capable of filtering out irrelevant sites

Efficiency relative path length to locate a queried segment is close to 1 absolute path length to conclude irrelevancy is no more than

2.5 pages. The performance is not affected much by how precise

the keyword query is. The system learns query requirements

Mphil Thesis, Yanlei Diao 62

Outline

Background Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions Demonstration

Mphil Thesis, Yanlei Diao 63

Conclusions

Proposed and implemented learning based Web query processing with the following features Returning succinct results: segments of pages; No a prior knowledge or preprocessing, suited for

ad hoc queries; exploiting page formatting and linkage

information simultaneously. The preliminary results are promising

Mphil Thesis, Yanlei Diao 64

Future Work

Better segmentation for HTML documents Better knowledge, key factor that affects system

performance other weighting schemes for navigation

knowledge other implementation of classification knowledge

More system evaluation Dynamic web pages

Mphil Thesis, Yanlei Diao 65

Outline

Background Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions Demonstration

Mphil Thesis, Yanlei Diao 66

Demonstration