Internet Research: Whats hot in Search, Advertizing & Cloud Computing Rajeev Rastogi Yahoo! Labs Bangalore

Internet Research: What’s hot in Search, Advertizing & Cloud Computing

Rajeev RastogiYahoo! Labs Bangalore

The most visited site on the internet

• 600 million+ users per month

• Super popular properties– News, finance, sports– Answers, flickr,

del.icio.us– Mail, messaging– Search

Unparalleled scale

• 25 terabytes of data collected each day– Over 4 billion clicks every day– Over 4 billion emails per day– Over 6 billion instant messages per day

• Over 20 billion web documents indexed• Over 4 billion images searchable

No other company on the planet processes as much data as we do!

Yahoo! Labs Bangalore

• Focus is on basic and applied research– Search– Advertizing– Cloud computing

• University relations– Faculty research grants– Summer internships– Sharing data/computing

infrastructure– Conference sponsorships– PhD co-op program

Web Search

What does search look like today?

Search results of the future: Structured abstracts

yelp.com

babycenter

epicurious

answers.com

LinkedIn

webmd

New York Times

Gawker

Search results of the future: Query refinement

Search results of the future: Rich media

Technologies that are enabling search transformation

• Information extraction (structured abstracts)• Web page classification (query refinement)• Multimedia search (rich media)

Reviews

Information extraction (IE)

• Goal: Extract structured records from Web pages

Name

AddressCategory

PhonePrice

Map

Multiple verticals

• Business, social networking, video, ….

Price

Category

Address

Phone Price

One schema per vertical

NameTitle

Education

Connections

Posted by

Title

Date

Rating Views

IE on the Web is a hard problem

• Web pages are noisy• Pages belonging to different Web sites have different layouts

Noise

Web page types

Template-based Hand-crafted

Template-based pages

• Pages within a Web site generated using scripts, have very similar structure – Can be leveraged for extraction

• ~30% of crawled Web pages • Information rich, frequently appear in the top

results of search queries• E.g. search query: “Chinese Mirch New York”

– 9 template-based pages in the top 10 results

Wrapper Induction

Learn

AnnotatePages

Sample pagesWebsite pages

LearnWrappers

Apply wrappers

Records

XPathRules

Extract

Annotations

Extract

Website pages

Sample

• Enables extraction from template-based pages

Example

XPath: /html/body/div/div/div/div/div/div/span /html/body//div//spanGeneralize

Filters

• Apply filters to prune from multiple candidates that match XPath expression

XPath: /html/body//div//span

Regex Filter (Phone):([0-9]3) [0-9]3-[0-9]4

Limitations of wrappers

• Won’t work across Web sites due to different page layouts

• Scaling to thousands of sites can be a challenge– Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites

can be time-consuming & expensive

Research challenge

• Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site

• Only annotate pages from a few sites initially as training data

Conditional Random Fields (CRFs)

• Models conditional probability distribution of label sequence y=y1,…,yn given input sequence x=x1,…,xn

– fk: features, k: weights

• Choose k to maximize log-likelihood of training data

• Use Viterbi algorithm to compute label sequence y with highest probability

||

11 ),,,(exp

)(

1)|(

x

xx

xyt k

ttkk tyyfZ

P

CRFs-based IE

Name

Category

Address

Phone

Noise

• Web pages can be viewed as labeled sequences

• Train CRF using pages from few Web sites• Then use trained CRF to extract from remaining sites

Drawbacks of CRFs

• Require too many training examples• Have been used previously to segment short

strings with similar structure• However, may not work too well across Web

sites that – contain long pages with lots of noise– have very different structure

An alternate approach that exploits site knowledge

• Build attribute classifiers for each attribute– Use pages from a few initial Web sites

• For each page from a new Web site– Segment page into sequence of fields (using static repeating

text)– Use attribute classifiers to assign attribute labels to fields

• Use constraints to disambiguate labels– Uniqueness: an attribute occurs at most once in a page– Proximity: attribute values appear close together in a page– Structural: relative positions of attributes are identical across

pages of a Web site

Attribute classifiers + constraints example

Chinese Mirch Chinese, Indian 120 Lexington AvenueNew York, NY 10016

(212) 532 3663Page1:

Jewel of India Indian 15 W 44th StNew York, NY 10016

(212) 869 5544Page2:

21 Club American 21 W 52nd StNew York, NY 10019

(212) 582 7200Page3:

Page3:

PhoneAddress

CategoryName

Category

Category, Name

Name

Name, Noise

Address

Address

Phone

Phone

Uniqueness constraint: NamePrecedence constraint: Name < Category

21 Club American 21 W 52nd StNew York, NY 10019

(212) 582 7200

CategoryName AddressPhone

Other IE scenarios: Browse page extraction

Similar-structuredrecords

IE big picture/taxonomy

• Things to extract from– Template-based, browse, hand-crafted pages, text

• Things to extract– Records, tables, lists, named entities

• Techniques used– Structure-based (HTML tags, DOM tree paths) – e.g.

Wrappers– Content-based (attribute values/models) – e.g. dictionaries– Structure + Content (sequential/hierarchical relationships

among attribute values) – e.g. hierarchical CRFs• Level of automation

– Manual, supervised, unsupervised

Web Page Classification: Requirements

• Quality– High Precision and Recall – Leverage structured input (links, co-citations) and output

(taxonomy)• Scalability

– Large numbers of training Examples, Features and Classes– Complex Structured input and output

• Cost– Small human effort (for labeling of pages)– Compact classifier model– Low prediction time

Structured Output Learning

• Structured Output Examples– Multi-class– Taxonomy

• Naïve approach– Separate binary classifier per class– Separate classifier for each taxonomy level

• Better approach – single (SVM) classifier– Higher accuracy, more efficient– Sequential Dual Method (SDM)

• Visit each example sequentially and solve associated QP problem (in dual) efficiently

• Order of magnitude faster

Sport

Cricket

Health

One-day Test

Fitness MedicineSoccer

Classification With Relational Information

• Relational Information– Web page links, structural similarity

• Graph representation – Pages as nodes (with labels)– Edge weights (s(j,k)): Page similarity,

out-link/co-citation existence, etc.• Classification can be expressed as

an optimization problem:

Co-citation

Similar structure

Link

Multimedia Search

• Availability & consumption of multimedia content on the Internet is increasing– 500 billion images will be captured in 2010

• Leveraging content and metadata are important for MM search

• Some big technical challenges are:– Results diversity– Relevance– Image Classification, e.g., pornography

Near-Duplicate Detection

• Multiple near-similar versions of an image exist on the internet– scaled, cropped, captioned, small scene

change, etc.• Near-duplicates adversely impact user

experience • Can we use a compact description and dedup

in constant time?• Fourier-Mellin Transform (FMT): translation,

rotation, and scale invariant• Signature generation using a small number of

low-frequency coefficients of FMT

Filtering noisy tags to improve relevance

• Measures such as IDF may assign high weights to noisy tags– Treat Tag-Sets as Bag-of-words, random collection of terms

• Boosting weights of tags based on their co-occurrence with other tags can filter out noise

10.2765 hinduism8.6589 hindu8.6259 finger7.8524 kerala7.3432 mother6.7895 smile6.6507 child

6.576 women6.5535 point6.4512 happy6.0312 orange5.2129 india

4.312 family

8.8989 child8.8033 smile8.338 happy7.982 mother

6.0989 women4.8763 family4.208 india

2.9307 hinduism2.8871 hindu2.8318 orange1.4355 kerala0.2292 point

0 finger

idf co-occur

Online Advertizing

Search query

Ad

Sponsored search ads

How it works

AdvertiserSponsoredsearch engine

I want to bid $5 oncanon cameraI want to bid $2 oncannon camera

• Engine decides when/where to show this ad on search results page

• Advertizer pays only if user clicks on ad

Ad Index

Ad selection criterion

• Problem: which ads to show from among ads containing keyword?

• Ads with highest bid may not maximize revenue

• Choose ads with maximum expected revenue – Weigh bid amount with

click probability

Ad Bid Click Prob

Expected Revenue

A1 $4 0.1 0.4

A2 $2 0.7 1.4

A3 $3 0.3 0.9

Contextual Advertising

Ads

Contextual ads

• Similar to sponsored search, but now ads are shown on general Web pages as opposed to only search pages– Advertizers bid on keywords– Advertizer pays only if user clicks, Y! & publisher

share paid amount– Ad matching engine ranks ads based on expected

revenue (bid amount * click probability)

Estimating click probability

• Use logistic regression model

p(click | ad, page, user) =

• fi: ith feature for ad, page, user

• wi: weight for feature fi

• Training data: ad click logs (all clicks + non-click samples)

• Optimize log-likelihood to learn weights

i ii userpageadfw ),,(exp1

1

Features

• Ad: bid terms, title, body, category,… • Page: url, title, keywords in body, category, … • User

– Geographic (location, time)– Demographic (age, gender) – Behavioral

• Combine above to get (billions of) richer featuresE.g: (apple ad title) (ipod page body) (20 < user age < 30)

• Select subset that leads to improvement in likelihood

Banner ads

• Show Web page with display ads

Ad

Creates Brand Awareness

How it works

• Engine guarantees 1M impressions• Advertiser pays a fixed price

– No dependence on clicks

• Engine does admission control, decides allocation of ads to pages

AdvertiserBanner AdEngine

I want 1M impressions

“On finance.yahoo.com, gender = male, age = 20-30 during the month of April 2009”

Ad Index

Allocation ExampleSUPPLY (Qty, Price)

DEMAND (Target, Qty)

Age

Gen

der

Mal

eF

emal

e

20 - 30 > 30

(Gender=Male, 12M)

(Age>30, 12M)

(10M,$20) (10M,$10)

(10M,$10)(10M,$10)

Suboptimal

Optimal

(6M,$10)

Value=$60M

Value=$120M

(6M,$20)

Unallocated

12

12

1212

Research problem

• Goal: Allocate demands so that the value of unallocated inventory is maximized

• Similar to transportation problem

Transportation problem

iRj ij dxi

1 1

ji

2 2

Demands Supply Priced1

d2

di

s1

sj

s2

pj

p2

p1

Edges to Ri

xi1

xi2

xijji ij sx

)( maximize :Objective i ijjj j xsp

xij: Units of demand I allocated to region j

Ads taxonomy

Search pages Web pages

ContextualSponsored search Banner

Online Ads

Keywords Keywords AttributesTargeting:

Guarantees: NG NG NGG

CPC CPM/CPCCPMCPCModel:

Major trend: Ads convergence

• Today

Contextual

CPC

Display

CPM

Separate systems for contextual & display

• TomorrowUnified Ads marketplace

– Unify contextual & Display– Increase supply & demand– Enable better matching– CPC, CPM ads compete

Y! Ad ExchangeCPC, CPM

Advertiser: Creates demandPublisher: Creates supply of pages

Research challenge

• Which ad to select between competing CPC, CPM ads?– Use eCPM

• For CPM ads: eCPM = bid• For CPC ads: eCPM = bid * Pr(click)

– Select ad with max eCPM to maximize revenue

• Problem: ad with highest eCPM may not get selected– eCPMs “estimated” based on historical data, which

can differ from actual eCPMs– Variance in estimated eCPMs higher for CPC ads– Selection gets biased towards ads which have

higher variance as they have higher probability of over-estimated eCPMs

Est

ima

ted

eC

PM

CPCad

CPMad

Actual eCPM

Estimated eCPM

Cloud Computing

Much of the stuff we do is compute/data-intensive

• Search– Index 100+ billion crawled Web pages– Build Web graph, compute PageRank

• Advertizing– Construct ML models to predict click probability

• Cluster, classify Web pages– Improve search relevance, ad matching

• Data mining– Analyze TBs of Web logs to compute correlations between

(billions of) user profiles and page views

Solution: Cloud computing

• A cloud consists of– 1000s of commodity machines (e.g., Linux PCs)– Software layer for

• Distributing data across machines• Parallelizing application execution across cluster• Detecting and recovering from failures

– Yahoo!’s software layer based on Hadoop Open Source

Cloud computing benefits

• Enables processing of massive compute-intensive tasks• Reduces computing and storage costs

– Resource sharing leads to efficient utilization– Commodity hardware, open source

• Shields application developers from complexity of building in reliability, scalability in their programs– In large clusters, machines fail every day– Parallel programming is hard

Cloud computing at Yahoo!

• 10,000s of nodes running Hadoop, TBs of RAM, PBs of disk

• Multiple clusters, largest is a 1600 node cluster

Hadoop’s Map/Reduce Framework

• Framework for parallel computation over massive data sets on large clusters

• As an example, consider the problem of creating an index for word search. – Input: Thousands of documents/web pages– Output: A mapping of word to document IDs

Farmer1 has the following animals: bees, cows, goats.

Some other animals …

Animals: 1, 2, 3, 4, 12Bees: 1, 2, 23, 34

Dog: 3,9Farmer1: 1, 7

…

Hadoop’s Map/Reduce

Machine1

Machine2

Machine3

Animals: 1,3Dog: 3

Animals:2,12 Bees: 23

Dog:9Farmer1: 7

Machine4

Animals: 1,3Animals:2,12

Bees:23

Machine5

Dog: 3Dog:9

Farmer1: 7

Machine4

Animals: 1,2,3,12Bees:23

Machine5

Dog: 3,9Farmer1: 7

Input split MapTasks

intermediateoutput (sorted)

Shuffle ReduceTasks

Index example (contd.)

Research challenges

Rack 1

Rack 2

Rack i

Rack n

Compute Nodes in Racks Data Blocks for a given jobdistributed and replicated across nodes in a rack and across racks

Data Distribution and Replication

Challenges:• Optimize distribution to provide maximum locality• Optimize replication to provide best fault tolerance

Job Queues based on priorities and SLAs

1 2 3L1

L2

SDS Q1 40%

YST Q2 35%

ATG Qm 25% Lm

Job Scheduling

Challenges:• Schedule jobs to maximize resource utilization while preserving SLAs• Schedule jobs to maximize data locality• Performance modeling

Summary

• Internet is an exciting place, plenty of research needed to improve– User experience– Monetization– Scalability

• Search -> Information extraction, classification, …. • Advertizing -> Click prediction, ad placement, ….• Cloud computing -> Job scheduling, perf modeling, …• Solving problems will require techniques from multiple

disciplines: ML, statistics, economics, algos, systems, …

Documents

Internet Research: Whats hot in Search, Advertizing & Cloud Computing Rajeev Rastogi Yahoo! Labs Bangalore