12
Machine Learning at Orbitz Robert Lancaster and Jonathan Seidman Strata 2011 February 02 | 2011

Machine Learning at Orbitz Robert Lancaster and Jonathan Seidman Strata 2011 February 02 | 2011

Embed Size (px)

Citation preview

Machine Learning at Orbitz

Robert Lancaster and Jonathan Seidman

Strata 2011

February 02 | 2011

page 2

Launched: 2001, Chicago, IL

Why Start the Machine Learning Team at Orbitz?

• Team was created in 2009 with the goal to apply machine learning techniques to improve the customer experience.

• For example:

– Hotel sort optimization: How can we improve the ranking of hotel search results in order to show consumers hotels that more closely match their preferences?

– Cache optimization: can we intelligently cache hotel rates in order to optimize the performance of hotel searches?

– Personalization/segmentation: can we show targeted search results to specific consumer segments?

page 3

Data Challenges

• The team immediately faced challenges getting access to data:

– Performing required analysis requires access to large amounts of data on user interaction with the site.

– This data is available in web analytics logs, but required fields were not available in our data warehouse because of size considerations.

– Even worse, we had no archive of the data beyond several days.

– Size constraints aside, there’s considerable time and effort to get new data added to the data warehouse.

page 4

New Data Infrastructure to Address These Challenges

• Hadoop provides a solution to these challenges by:

– Providing long-term storage of entire raw dataset without placing constraints on how that data is processed.

– Allowing us to immediately take advantage of new web analytics data added to the site.

– Providing a platform for efficient analysis of data, as well as preparation of data for input to external processes for further analysis.

• Hive was added to the infrastructure to provide structure over the prepared data, facilitating ad-hoc queries and selection of specific data sets for analysis.

• Data stored in Hive not only supports machine learning efforts, but also provides metrics to analysts not available through other sources.

page 5

New Data Infrastructure – Cont’d

• Hadoop and Hive are now being used by the machine learning team to:

– Extract data from logs for hotel sort and cache optimization analyses.

– Distribute complex cross-validation and performance evaluation operations.

– Extracting data for clustering.

• Hadoop and Hive have also gained rapid adoption in the organization beyond the machine learning team: evaluating page download performance, searching production logs, keyword analysis, etc.

page 6

Use Case – Hotel Cache Optimization

Overview:

Search methodology:

• Subset of total properties in a location (1 page at a time).

• Get “just enough” information to present to consumers.

Caching:

• Reduces impact to suppliers (maintain “look-to-book” ratio).

• Reduces latency.

• Increases “coverage.”

Optimization Goal:

Improve the customer experience (reduce latency, increase coverage) when searching for hotel rates while controlling impact on suppliers (maintain look-to-book).

page 7

Hotel Cache Optimization – Early Attempts

Early approaches were well intended, but were not driven by analysis of the available data. For example:

Theory: High amount of thrashing leads to eviction of more useful cache entries.

Attempted Solution: Increase cache size.

Result: No increase in measured coverage.

Problem: No actual analysis on required cache size.

Theory: Locally managed inventory represents “free” information and can be requested without limit to improve coverage.

Attempted Solution: Don’t cache locally managed inventory. Increase the amount of local inventory requested with each user search.

Result: No increase in measured coverage.

Problem: Locally managed inventory doesn’t represent a large percentage of total inventory and is already highly preferenced.

page 8

Hotel Cache Optimization – Data Driven Approaches

Data Driven Approaches:

Traffic Partitioning: Identify the subset of traffic that is most efficient and optimize that subset through prefetching and increased bursting.

TTL Optimization: Use historic logs of availability and rate change information to predict volatility of hotel rates and optimize cache TTL.

page 9

Hotel Cache Optimization– Traffic Distribution

page 10

A small number of queries (3%) make up more than a third of search volume.

Optimize Hotel Cache – Traffic Partitioning

Evaluate possible mechanisms for determining most frequent queries.

Favor mechanisms that gives high search/query ratio for the greatest percentage of search volume.

Test for stability of mechanism across multiple time periods.

Partion Strategy Description Pct Queries Pct Searches Searches/Query

Baseline All traffic 100.00% 100.00% 2.19

Top 50 Top 50 searched markets 14.88% 26.76% 3.94

HeuristicTop 50 searched markets, weekend stay within 1 month. 0.87% 8.52% 21.4

Enumeration Queries repeated 5 or more times. 3.45% 28.80% 18.29

Prediction TBD TBD TBD TBD

page 11

Conclusions and Lessons Learned

• Start with a manageable problem (ease of measuring success, availability of data, etc.)

• Avoid thinking of machine learning team as an R&D organization.

• Instead, foster machine learning approaches throughout the organization:

– Embed resources on actual feature teams.

– Machine learning study groups, etc.

page 12