3. Why Start the Machine Learning Team at Orbitz?
• Team was created in 2009 with the goal to apply machine
learning techniques to improve the customer experience.
• For example:
– Hotel sort optimization: How can we improve the ranking of
hotel search results in order to show consumers hotels that
more closely match their preferences?
– Cache optimization: can we intelligently cache hotel rates in
order to optimize the performance of hotel searches?
– Personalization/segmentation: can we show targeted search
results to specific consumer segments?
page 3
4. Data Challenges
• The team immediately faced challenges getting access to data:
– Performing required analysis requires access to large
amounts of data on user interaction with the site.
– This data is available in web analytics logs, but required
fields were not available in our data warehouse because of
size considerations.
– Even worse, we had no archive of the data beyond several
days.
– Size constraints aside, there’s considerable time and effort
to get new data added to the data warehouse.
page 4
5. New Data Infrastructure to Address These Challenges
• Hadoop provides a solution to these challenges by:
– Providing long-term storage of entire raw dataset without
placing constraints on how that data is processed.
– Allowing us to immediately take advantage of new web
analytics data added to the site.
– Providing a platform for efficient analysis of data, as well as
preparation of data for input to external processes for further
analysis.
• Hive was added to the infrastructure to provide structure over
the prepared data, facilitating ad-hoc queries and selection of
specific data sets for analysis.
• Data stored in Hive not only supports machine learning efforts,
but also provides metrics to analysts not available through
other sources.
page 5
6. New Data Infrastructure – Cont’d
• Hadoop and Hive are now being used by the machine learning
team to:
– Extract data from logs for hotel sort and cache optimization
analyses.
– Distribute complex cross-validation and performance
evaluation operations.
– Extracting data for clustering.
• Hadoop and Hive have also gained rapid adoption in the
organization beyond the machine learning team: evaluating
page download performance, searching production logs,
keyword analysis, etc.
page 6
7. Use Case – Hotel Cache Optimization
Overview:
Search methodology:
• Subset of total properties in a location (1 page at a time).
• Get “just enough” information to present to consumers.
Caching:
• Reduces impact to suppliers (maintain “look-to-book” ratio).
• Reduces latency.
• Increases “coverage.”
Optimization Goal:
Improve the customer experience (reduce latency, increase
coverage) when searching for hotel rates while controlling impact
on suppliers (maintain look-to-book).
page 7
8. Hotel Cache Optimization – Early Attempts
Early approaches were well intended, but were not driven by analysis of
the available data. For example:
Theory: High amount of thrashing leads to eviction of more useful cache entries.
Attempted Solution: Increase cache size.
Result: No increase in measured coverage.
Problem: No actual analysis on required cache size.
Theory: Locally managed inventory represents “free” information and can be
requested without limit to improve coverage.
Attempted Solution: Don’t cache locally managed inventory. Increase the amount
of local inventory requested with each user search.
Result: No increase in measured coverage.
Problem: Locally managed inventory doesn’t represent a large percentage of total
inventory and is already highly preferenced.
page 8
9. Hotel Cache Optimization – Data Driven Approaches
Data Driven Approaches:
Traffic Partitioning: Identify the subset of traffic that is most
efficient and optimize that subset through prefetching and
increased bursting.
TTL Optimization: Use historic logs of availability and rate
change information to predict volatility of hotel rates and
optimize cache TTL.
page 9
10. Hotel Cache Optimization– Traffic Distribution
100.00%
72% of queries are Queries
singletons and make up
90.00% Searches
nearly a third of total
search volume.
80.00% Reverse Running Total
(Searches)
71.67%
Reverse Running Total
70.00% (Queries)
60.00%
A small number of
queries (3%) make
50.00% up more than a third
of search volume.
40.00%
34.30%
31.87%
30.00%
20.00%
10.00%
2.78%
0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
page 10
11. Optimize Hotel Cache – Traffic Partitioning
Evaluate possible mechanisms for determining most
frequent queries.
Favor mechanisms that gives high search/query ratio for
the greatest percentage of search volume.
Test for stability of mechanism across multiple time periods.
Par$on
Strategy
Descrip$on
Pct
Queries
Pct
Searches
Searches/Query
Baseline
All
traffic
100.00%
100.00%
2.19
Top
50
Top
50
searched
markets
14.88%
26.76%
3.94
Top
50
searched
markets,
weekend
stay
HeurisCc
within
1
month.
0.87%
8.52%
21.4
EnumeraCon
Queries
repeated
5
or
more
Cmes.
3.45%
28.80%
18.29
PredicCon
TBD
TBD
TBD
TBD
page 11
12. Conclusions and Lessons Learned
• Start with a manageable problem (ease of measuring success,
availability of data, etc.)
• Avoid thinking of machine learning team as an R&D
organization.
• Instead, foster machine learning approaches throughout the
organization:
– Embed resources on actual feature teams.
– Machine learning study groups, etc.
page 12