2. Agenda
• Overview of big data analytics
• Insights of big data modeling
• A case for preference profiles
– Recommender for a wine seller
• Cases for behavioral profiles for predictive models
– Yahoo mail retention
– Tribal Fusion display ads impression optimization
– University of Phoenix student retention
– University of Phoenix lead optimization
• Case of Ask.com SEM algorithms
2
3. Daqing Zhao, PhD
• Big Data scientist with deep domain knowledge
• Academic training
– Analyzed molecular spectra on Cray supercomputers
– Determined, modeled, simulated molecular motions in 3D space
• Enjoy working with large data and large scale computing
• At Bank of America, led the development of a risk
management system of global portfolio
• Worked on computational Internet marketing since 1999
3
4. Big data, Big Opportunities
• Thanks to Moore’s law, on CPU, storage, network connections
• Too much data, too little knowledge
• Data, analytics changed every field
• From science, government, to commerce
4
5. Things computers good at
• Computers have perfect memory
– Every page view, click, transaction, every event,…
• Good at finding a needle in a haystack
– Identify clickers of any particular web page at some time
– E.g., target abandoned shopping carts with promotions
• Good at trade offs among a large number of factors
– Female, 25-34, with child < 5, Asian, earning $30K, rent,
divorced, live in Calif., some college, Walmart, visits
Coupons.com, Monster.com, drive Camry, …
– Buyer of X or not?
5
6. Computers make it possible
• Given data, optimize models and parameters
– Identify reproducible patterns in the data
– Provide a simple picture, predict events in the future
• Simulations generate future events, given
assumptions, and current state
– Given a set of models, how future scenario will look like,
under given set of conditions, “what ifs”
– Like flight simulator
• Crowd sourcing from big data and big data modeling
– Define similarity, translations, quality, relevance
6
7. Computers can’t do everything
• Data often have issues before being well analyzed
• Data often have no taxonomy and context
• Free format data, relevant information need to be
extracted
• Computers don’t define targets, construct predictors
• Don’t know if critical predictive factors are missing
• Computers don’t have common sense
• Computers don’t have goals to achieve
7
8. Modeling need to scale
• Traditional predictive models take long time to build
– Small data sets, samples expensive to collect
• Now data are cheap and models may degrade in weeks
– Dimension of predictors are very large
– Number of categories are large
• Human interactive model building not scalable
• Reasons for target events are complex
• Without detailed analysis, it is unclear what drives the
event
• We need to rely on “out of sample testing” and “off the
shelf” modeling
8
9. Big Data problem
• Data size larger than what databases can handle
• Terabytes of data may take hours just to scan it
• A solution requires a cloud of servers with local
storage
– Read, process and write intermediate results in
parallel
– Aggregate at the end
• Cloud computing build models in scale
• Cloud often scales linearly as number of servers
9
10. Cloud computing
• We built a SAS cloud at University of Phoenix
– I have an invited SAS talk available at SAS web site
– We can process billions of impressions in minutes
• Hadoop clouds are used widely
– Open source software
– Commodity servers and storage
• Clouds may have 100Ks of servers
– Find needle in a haystack in milliseconds
– Model computations usually would take years to
compute now finishes in minutes
10
11. Example: Google Data Centers
Estimated 500K commodity servers Data centers near Columbia River
At Dalles, Oregon
11
12. In use from 1999 to 2001
CUSTOMER PREFERENCE PROFILES
13. 1:1 email case
• Weekly emails recommending 6 wines
• Inventory of 20K+ wines
• Wine.com had clean data
– Purchase, time, product, spend
– Wine color, varietal, body, acidity, oak, tannin, sweetness,
complexity, price, producer, region
– Email response
– Self reported preferences and demographics
– Web behavior clusters
– No data of explicit customer rating, like Netflix
• Most customers have one or two data points
13
14. Dynamic Newsletter
Dear “First name”,
Welcome to our Newsletter. Celebrate holidays with family and
friends with a bottle of some wine.
History of some wine. Tips on wine tasting. Recipes using
wine. Health benefits of wine. Wine drinking is socially
fashionable, culturally sophisticated, etc., etc.
Clicks tracked and
Dynamic XML Sincerely, Linked to purchases
Template Signature
Text blurb Text blurb Text blurb
Text blurb Text blurb Text blurb
14
15. Wine direct marketing
• Goal, to lift purchase revenue
• Present wines customers more likely to buy
• A/B testing against weekly selections by
merchandisers
• Concentrate on long time performance
– Over many email campaigns
• Focus on most important predictor – behavior profile
15
16. Build similarity of all wines
• Decompose purchases into product attributes
– Even 1 click can generate a taste profile
• When go out of stock, wine profile info still usable
• New inventory immediately mapped to existing profile
• Build an implicit and explicit profile of customer
• Add association rules, “customer bought these also
bought… “
• For new customers, augment profile with nearest
neighbors who had more purchases as “mentors”
16
17. Customer experience is key
• Recommend similar wines
– Based on cosine distance to taste profile, price, and
text mining on producer name, region, country
– Shuffle among higher scored wines
– Repeated campaigns take care of prediction errors
– Dedup recent recommendations and purchases
– Use decaying memory function and factor in
seasonality
• Reinforce learning
• Use simulations to ensure quality
17
18. Learnings and insights
• Our 1:1 emails increased revenue up to 300%
• Out perform 40% over 2+ year period
• Purchase data most important
– Putting money where your mouth or mouse is
• Email response data also predictive
• Self reported preferences are different
from actions
– Talk the talk versus walk the walk
• Aggregated web segments least useful
18
20. Email retention models
• New email subscribers, 40% never return
– High “infant” mortality rate
– Activity immediately after sign ups correlate with
normal retention
– Frequent page views in certain pages, such as
Help and Junk folders predictive
– Find actionable retention drivers, such as send
welcome emails, improve customer service, user
experience, etc.
20
21. Online edu retention models
• Students have low persistence rate until after
several courses
– Depend on major, credits finished, demo, socio-
economic status, first generation students
– Also by lead source, lead form entries, etc.
• We set up to track data include search,
display, landing page, home site, call center,
enrollment, class finishes,…, 360 degree view
• Billions of events per month
21
22. Lead conversion models
• From impression to sign up as a lead is just 1/3 of
student life cycle
• Leads have very low enrollment rates
– Takes 3 to 6 months to enroll
– Leads easy to convert may also be easier to drop out
• Need student performance data over long time to
assess
– Trade off between statistics and relevance
– Use life time values, brand values, cost of service to
determine media allocation
22
23. Display ad conversion models
• Advertisers have different conversion drivers
– Publisher, channel, geo, behavior, demographic data,
data append, session depth, etc.
– Require an array of predictive models on conversion
to work together with an auction engine
• Billions of display ads
– Individual and event information
• Too many models, too little time to build by
humans
23
24. Unexpected data challenges
• Task: predict enrollment and revenue in future
• Problem: more than one definitions of metrics
– Made by past business analysts, using reasonable
business rules
• Some rules are built into a BI reporting product
– FP&A watches them every month as “truth” they
monitor and guide the street
• With IT/BI turnovers, rules change over time
– Few current people knew or can articulate the rules
24
25. Solution with data issues
• Without the rules, cannot calculate their version of
enrollment and revenue from student financial
transaction data
• After several meetings, still no correct rules
• We then modeled time series of reported data
– One time data errors diluted
– Rules changes long ago also less weighted
• We were able to predict customer and revenue for 3
to 6 months
25
26. Data most important
• In modeling, find key data most important
– Identify the smoking gun
• Data transformations
– PageRank is a game changing data transformation
– Wine.com case, wineRank
– Social graph is a key data transformation for credit
card fraud detection
26
27. Modeling can go wrong
• Leakage in lead scoring model
– For example, use lead source to predict
conversion, when certain values of the field were
populated only for converters
• Display ads conversion model
– Construct data set by taking all converters and a
sample of non-converters
– Predict conversion using page view profiles, etc.
– Problem: sample of non-converters included
customers who had no impressions of the ad
27
28. Modeling lessons
• Yahoo DSL subscribers with one year contract
• If you try to model month to month retention, you
find high retention rate
– Due to contracts and penalties
• The correct way is to model retention at contract
expiry, only on 1/12 of the customers
• For Yahoo email, if you look at quarter by quarter
retention, you find that those acquired early in the
first quarter have lower retention rate
– Because those customers have more time to churn
• A correct way is to use survival analysis
28
30. Ask.com background
• Founded ~16 years ago
• Ask.com attracts 100 million global users
– Biggest Q&A site on the web
• Over last 2 years we’ve revamped our approach to Q&A with a
product that
– Combines search technology with answers from real people
• Instead of 10 blue links, we deliver
– Real answers to people’s questions – both from already published data
sources
– And our growing community of users – on the web and across mobile
30
31. Ask SEM Analytics Systems
• Select quality keywords at Big Data scale
• Determine bids using search engine and internal data
• Keyword segmentation and clustering at big data level
– Text mining, behavioral association, historic performance
– Use of data from organic traffic
– Map similarity of keywords
• Optimize landing page and custom creatives
• Reinforce learning, testing hypotheses
• Optimize algorithms and parameters via A/B tests
31
32. SEM Bid Algorithms
• Building models for revenue at keyword level,
predictive modeling using data include
– User search streams
– Ad depth, Landing Page CTR
– Quality Score and minCPC
– Effective CPC
– Keyword categories
– Natural language clusters
– Search behavioral clusters
• Use Hadoop/Hive/Mahout to process data
32
33. Benefits of SEM Algorithms
• Predict keyword performance
• Bid the right keyword at the right price, at the right time
• Improve ROI, maximize profitable traffic volume
• Shift traffic to keywords with higher quality scores
• Optimize user experience
• Find similar keywords for management and expansion
33
34. Segmenting keywords
• In order to manage a large portfolio
• We group keywords together based on
– Customer behavior
– Text mining
– Keyword performance metrics
• Generate keyword groups for content and bid
management
• Similar keywords have similar performance
• Leverage learnings to other keywords
34
35. Conclusions
• For optimal modeling, dive deep in domain knowledge
• Identify key data and transformations
• May require Big Data solutions to scale
• Data are not reliable until after being seriously analyzed
• Test hypotheses and optimize in real market
• Use simulations to see if changes are reasonable
• Focus on customer experience not data mining tools, model
complexity or predictive accuracy
• Use a lot of common sense
• “The best way to get good ideas to have a lot of them”
– Linus Pauling
35