24
Dec 6, 2004 1 2004 Michigan Technological University Nicholas Kushmerick Nicholas Kushmerick Department of Computer Science, Department of Computer Science, University College Dublin, Ireland University College Dublin, Ireland Learning to remove Internet Learning to remove Internet advertisements advertisements Presented by Bo Zhang Department of Computer Science Michigan Technological University

Learning to remove Internet advertisements

  • Upload
    oralee

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Learning to remove Internet advertisements. Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland. Presented by Bo Zhang Department of Computer Science Michigan Technological University. - PowerPoint PPT Presentation

Citation preview

Page 1: Learning to remove Internet advertisements

Dec 6, 2004 12004 Michigan Technological University

Nicholas KushmerickNicholas Kushmerick

Department of Computer Science,Department of Computer Science,

University College Dublin, IrelandUniversity College Dublin, Ireland

Learning to remove Internet Learning to remove Internet advertisementsadvertisements

Presented by Bo ZhangDepartment of Computer Science Michigan Technological University

Page 2: Learning to remove Internet advertisements

Dec 6, 2004 22004 Michigan Technological University

OverviewOverview

BackgroundBackground

Introduction of ADEATERIntroduction of ADEATER

Design of ADEATERDesign of ADEATER

EvaluationEvaluation

Related WorkRelated Work

Conclusion and Future WorkConclusion and Future Work

Page 3: Learning to remove Internet advertisements

Dec 6, 2004 32004 Michigan Technological University

BackgroundBackground Negative Impact of advertisement images on InternetNegative Impact of advertisement images on Internet

Slow down the speed of browsing Consume resources of computer Extra costs for users

Advertisement Image

Advertisement Image

Advertisement Image

Page 4: Learning to remove Internet advertisements

Dec 6, 2004 42004 Michigan Technological University

Introduction of ADEATERIntroduction of ADEATER

Definition:Definition:

- A browsing assistant that automatically removes advertisement images from Internet pages.

Property:Property:

Rules generated from learning algorithm

Page 5: Learning to remove Internet advertisements

Dec 6, 2004 52004 Michigan Technological University

Introduction of ADEATERIntroduction of ADEATER ExamplesExamples

Page 6: Learning to remove Internet advertisements

Dec 6, 2004 62004 Michigan Technological University

Design of ADEATER Design of ADEATER

System ArchitectureSystem Architecture

Page 7: Learning to remove Internet advertisements

Dec 6, 2004 72004 Michigan Technological University

Design of ADEATERDesign of ADEATER Encoding instanceEncoding instance

Fixed–width feature vector

Images enclosed in anchor tag <A> is a candidate advertisement

Geometric features of an image: -Height <IMG height=90> -Width <IMG width=90> -Aspect ratio (ratio of width to height)

Local feature: -Whether destination URL and image URL are in the same internet

domain www.ee.mtu.edu/page.html www.cs.mtu.edu/image.jpg YES

www.dell.com/notebook.html www.cs.mtu.edu/image.jpg No

Page 8: Learning to remove Internet advertisements

Dec 6, 2004 82004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding instanceEncoding instance

Fixed–width feature vector

Caption feature: -Words occuring in enclosing <A> tag with phrase length<K

and phrase count >M -K is maximum phrase length -M is minimum phrase count

Alt Feature -Set of “alternate” words in the <IMG> tag (<IMG alt=“ad”>)

with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count

Page 9: Learning to remove Internet advertisements

Dec 6, 2004 92004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding instanceEncoding instance

Fixed–width feature vector

Ubase, Udest, Uimg

-Words occuring in base URL, destination URL, image URL with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count

Stop list -Low-information terms (“http”, “www”, ”jpg”, etc.)

Page 10: Learning to remove Internet advertisements

Dec 6, 2004 102004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding instanceEncoding instance

Samples of HTML page

Page 11: Learning to remove Internet advertisements

Dec 6, 2004 112004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding of samples

Page 12: Learning to remove Internet advertisements

Dec 6, 2004 122004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Encoding of samples (cont)

Page 13: Learning to remove Internet advertisements

Dec 6, 2004 132004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Gathering examplesGathering examples

AD samples are generated by ADGRABBER browsing assistant

Identifier candidate advertisements

Generate vector encoding

NON-AD samples are generated by a custom-built Internet spider

Extract images from randomly-generated URLs.

Page 14: Learning to remove Internet advertisements

Dec 6, 2004 142004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Learning rules

Algorithm - C4.5 decision tree learning algorithm

Properties - Quick on-line execution of classifier - Not be overly sensitive to missing features or noises - Scale well and insensitive to irrelevant features

Examples of rules - If aspect ratio > 4.5833, alt doesn’t contain “to” but does

contain “click+here”,and Udest doesn’t contain “http+www”, then instance is an AD

- If Ubase does not contain “messier”, and Udest contains the “redirect+cgi”, then instance is an AD

Page 15: Learning to remove Internet advertisements

Dec 6, 2004 152004 Michigan Technological University

Design of ADEATERDesign of ADEATER

Removing advertisementsRemoving advertisements

Process

- Fetch HTML pages from Internet - Identify candidate advertisements - Classify instances with learned rules - Replace the image’s URL with the URL of an inconspicuous low-bandwidth image

Implementation

- Removal module as a proxy server

Page 16: Learning to remove Internet advertisements

Dec 6, 2004 162004 Michigan Technological University

Evaluation

Speed and accuracySpeed and accuracy

Experiment setting

Total samples - AD: 459 examples

- NON-AD: 2820 examples

10-fold cross-validation - Training set: 90% examples - Test set: 10% examples

Off-line training phase: 5.8 minutes

On-line classification phase: 70 msec/image

Average accuracy: 97.1%

Page 17: Learning to remove Internet advertisements

Dec 6, 2004 172004 Michigan Technological University

Evaluation Learning curvesLearning curves

Simple methodology - Not recalculate feature set Realistic methodology - Recalculate feature set

Page 18: Learning to remove Internet advertisements

Dec 6, 2004 182004 Michigan Technological University

Evaluation

Alternative encodingsAlternative encodings

Page 19: Learning to remove Internet advertisements

Dec 6, 2004 192004 Michigan Technological University

Related Work Muffin: Filtering web pages

ImageKill Filter: Hand-crafted rules

ImageKill.minheight

- Only remove images which are at least n pixels high

ImageKill.minwidth

- Only remove images which are at least n pixels wide

ImageKill.ratio

- Remove images which are more than n times as wide as

they are high

ImageKill.exclude

- Don't remove images that match the given string/regexp

Page 20: Learning to remove Internet advertisements

Dec 6, 2004 202004 Michigan Technological University

Related Work

WebFilter: Filtering web pages

Solution

- User provides a list of URL templates and corresponding

filter scripts

Page 21: Learning to remove Internet advertisements

Dec 6, 2004 212004 Michigan Technological University

Related Work

Junkbuster: Junkbuster: Filtering web pages

Solution

- User provides a block file

Page 22: Learning to remove Internet advertisements

Dec 6, 2004 222004 Michigan Technological University

Related Work

Smokey: Detect abusive messagesSmokey: Detect abusive messages

Solution

- Training samples and generate rules by training - Parse messages and generate feature vector - Classify the feature vector with rules generated

Page 23: Learning to remove Internet advertisements

Dec 6, 2004 232004 Michigan Technological University

Conclusion and Future Work

ConclusionConclusion

High accuracy

Modest resource cost (processing time, training samples)

Future WorkFuture Work

Incremental learning algorithm

More efficient feature selection mechanism

Page 24: Learning to remove Internet advertisements

Dec 6, 2004 242004 Michigan Technological University

Thank you!Thank you!