Title: BigData, AllData, Old Data: Predictive Analytics in a Changing Data Landscape
Abstract:
The landscape of the platform, access methodologies, shapes, and storage representations has changed dramatically. Much of the assumptions of a structured data world dominated by relational databases have been rendered obsolete. Today’s data analyst faces big challenges and a bewildering environment of technologies and challenges involving semi-structured and unstructured data with access methodologies that have almost no relation to the past. This talk will cover issues and challenges in how to make the benefits of advanced analytics fit within the application environment. The requirement for Real-time data streaming and in situ data mining is stronger than ever. We demonstrate how many of the critical problems remain open with much opportunity for innovative solutions to play a huge enabling role. This opportunity extends equally well to Knowledge Management and several related fields.
51. CONFIDENTIAL
Why did Google Serve
this Ad?
51
this is how NetSeer
actually sees this
content
NetSeer:
SOLVING ACCURACY ISSUES | AMBIGUITY, WASTE, BRAND SAFETY
51
52. CONFIDENTIAL
high MPG
ford
low emission
fuel efficiency
ECONOMY CARS
economy
vehicles
microscope
lenses
reading
glasses
autofocus
bifocal
refraction
VISION TOOLS
eye chart
focus groups
A/B testing
consumer study
surveying
blind study
analytics
MARKET RESEARCH
~ ~ ~ ~ ~
~ ~ ~ ~ ~
~ ~ ~ ~
~ ~ ~ ~ ~
WEBSITE.COM
~ ~ ~ ~ ~
~ ~ ~ ~
~ ~ ~ ~ ~
~ ~ ~ ~ ~
~ ~ ~ ~
electric
vehicles
service
record
safety rating
NetSeer – How it works
52
focus
<CONCEPT>
DISCERNS AND MONETIZES HUMAN
INTENT
+ Identifies Concepts expressed on
a page
+ Disambiguates language
+ Builds increasingly rich profile over
52M
2.3B
CONCEPTS
RELATIONSHIPS
BETWEEN
CONCEPTS
59. Yahoo! – One of Largest Destinations on the Web
80% of the U.S. Internet population uses Yahoo!
– Over 600 million users per month globally!
Global network of content, commerce, media, search and
access products
100+ properties including mail, TV, news, shopping, finance,
autos, travel, games, movies, health, etc.
25+ terabytes of data collected each day
• Representing 1000’s of cataloged consumer behaviors
More people visited
Yahoo! in the past
month than:
• Use coupons
• Vote
• Recycle
• Exercise regularly
• Have children
living at home
• Wear sunscreen
regularly
Sources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005.
Data is used to develop content, consumer, category and campaign
insights for our key content partners and large advertisers
60. Yahoo! Big Data – A league of its
own…
Terrabytes of Warehoused Data
25 49 94 100
500
1,000
5,000
Amazon
Korea
Telecom
AT&T
Y!LiveStor
Y!Panama
Warehouse
Walmart
Y!Main
warehouse
GRAND CHALLENGE PROBLEMS OF DATA PROCESSING
TRAVEL, CREDIT CARD PROCESSING, STOCK EXCHANGE, RETAIL, INTERNET
Y! Data Challenge Exceeds others by 2 orders of magnitude
Millions of Events Processed Per Day
50 120 225
2,000
14,000
SABRE VISA NYSE YSM Y! Global
61. Behavioral Targeting (BT)
Search
Ad Clicks
Content
Search Clicks
BT
Targeting ads to
consumers whose recent
behaviors online indicate
which product category is
relevant to them
62. Male, age 32
Lives in SF
Lawyer
Searched on
from London
last week
Searched on:
“Italian
restaurant
Palo Alto”
Checks Yahoo!
Mail daily via
PC & Phone
Has 25 IM Buddies,
Moderates 3 Y!
Groups, and hosts a
360 page viewed by
10k people
Searched on:
“Hillary Clinton”
Clicked on
Sony Plasma TV
SS ad
Registration Campaign Behavior Unknown
Spends 10 hour/week
On the internet Purchased Da
Vinci Code
from Amazon
Yahoo! User DNA
• On a per consumer basis: maintain a behavioral/interests profile and
profitability (user value and LTV) metrics
63. How it works | Network + Interests +
Modelling
Analyze predictive patterns for purchase
cycles in over 100 product categories
In each category, build models to describe
behaviour most likely to lead to an ad
response (i.e. click).
Score each user for fit with every
category…daily.
Target ads to users who get highest
‘relevance’ scores in the targeting
categories
Varying Product Purchase CyclesMatch Users to the ModelsRewarding Good BehaviourIdentify Most Relevant Users
65. Differentiation | Category specific
modelling
time
intensityscore
time
intensityscore
IntenseClickZone
Example 1: Category Automotive Example 2: Category Travel/Last Minute
Different models allow us to weight and determine intensity and recency
Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click
IntenseClickZone
66. Differentiation | Category specific
modelling
time
intensityscore
Intense Click Zone
Example 1: Category Automotive
Different models allow us to weight and determine intensity and recency
with no further activity, decay takes effect
Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click
user is in the Intense Click Zone
67. Automobile Purchase Intender Example
A test ad-campaign with a major Euro automobile manufacturer
Designed a test that served the same ad creative to test and control groups
on Yahoo
Success metric: performing specific actions on Jaguar website
Test results: 900% conversion lift vs. control group
Purchase Intenders were 9 times more likely to configure a vehicle, request
a price quote or locate a dealer than consumers in the control group
~3x higher click through rates vs. control group
68. Mortgage Intender Example
We found:
1,900,000 people looking
for mortgage loans.
+122%
CTR Lift
Mortgages Home Loans Refinancing Ditech
Financing section in Real Estate
Mortgage Loans area in Finance
Real Estate section in Yellow Pages
+626%
Conv Lift
Example search terms qualified for this target:
Example Yahoo! Pages visited:
Source: Campaign Click thru Rate lift is determined by Yahoo! Internal
research. Conversion is the number of qualified leads from clicks over
number of impressions served. Audience size represents the audiencewithin
this behavioralinterestcategorythat has the highestpropensitytoengagewithabrandorproductandto
clickon anoffer.Date: March 2006
Results from a client campaign on Yahoo!
Network
Example: Mortgages
69. Experience summary at Yahoo!
• Dealing with one of the largest data sources (25
Terabyte per day)
• Behavioral Targeting business was grown from $20M
to > $400M in 3 years of investment!
• Yahoo! Specific? -- BigData critical to operations
– Ad targeting creates huge value
– Right teams to build technology (3 years of recruiting)
– Search is a BigData problem (but this has moved to
mainstream)
70. Lessons Learned
A lot more data than qualified talent
Finding talent in BigData is very difficult
Retaining talent in BigData is even harder
At Yahoo! we created central group that drove huge value to
company
Data people need to feel like they have critical mass
Makes it easier to attract the right people
Makes it easier to retain
Drive data efforts by business need, not by technology
priorities
Chief Data Officer role at Yahoo! – now popular
72. RapidMiner’s Strengths
7272
• Open Source Community & Marketplace – Crowd-sourced innovation,
quality assurance, market awareness.
• Fully-integrated Platform – Integrated, process-based business
analytics platform with focus on predictive analytics.
• No Programming Required – Easy-to-use, low maintenance costs,
standard platform for business analysts.
• Advanced Analytics at Every Scale – In-memory, in-database and in-
Hadoop analytics offer best option for every size of database.
• Connectivity – More than 60 connectors (incl. SAP & Hadoop), allowing
easy access to structured and unstructured data.
73. 30,000+ Downloads per Month
SELECT LIST OF RECIPIENT ORGANIZATIONS
7373
Government & DefensePharma & Healthcare
Consulting
Oil & Gas, Chemicals
Financial ServicesSoftware & Analytics
Retail
Manufacturing
Business Services
Consumer ProductsAerospace
Technology
Entertainment Academia
74. 74
PayPal
Who > world leading
online payment services
provider
Solution > Customer
feedback and voice of the
customer analysis, churn
prediction and prevention,
text mining and sentiment
analysis
SmartSoft
Who > provider of
solutions for preventing
fraud, money laundering,
and risks in financial
institutions
Solution > Integration of
Rapid-I’s predictive
analytics engine into their
solutions for fraud
detection and fraud
prevention for the
financial and telecom
sectors
Select Customer Stories
76. So the data is naturally moving to
Hadoop...
Situation:
–The data is moving to Hadoop for Cost (storage) and
Convenience (ETL) forces
–How do we get the value of predictive analytics to the data?
Rather than move the data out, move the analytics to the
data!
–Can we minimize the need for data movement?
–Data copies can become a management nightmare
–Analytics on a “Business As Usual” manner require
convenience
76
77. Radoop – RapidMiner on Hadoop
Opportunity:
–Avoid expensive data movement
–Leverage convenient data transformation
–Thousands of data connectors, many over semi-
structured and unstructured data
Why is this big news?
–Leverages a naturally occurring wave
–Analytics over a richer variety requires much more
processing
–The energy placed on data extraction and loading
moves to energy applied on actual analysis and
modelling
77
78. Big Picture on Big Data Analytics
Observations and
Concluding Remarks
80. Integrating Mail and News
Data showed that users often check their mail
and news in the same session
–But no easy way to navigate to Y! News from Y! Mail
Mail users who also visit Y! News are 3X more
active on Yahoo
–Higher retention, repeat visits and time-spent
on Yahoo
81. “In the news” Module on Mail Welcome Page
Increased retention on Mail for light users by 40%!
–Est. Incremental revenue of $16m a year on Y! Mail alone
82. Nordstrom: Queries with No
Matches
Julie Bornstein, Web Marketing Director
–What are my customers looking for and not
finding?
–June 2002: queries for “belly button rings”
–returned no matches in store
–Why the sudden interest?
83. Nordstrom: Queries with No
Matches
Print Ad Campaign
Models happen to be sporting a
navel ring
Nordstrom does not sell navel
rings
What to do???
85. Today’s Auto:
It just works!
No need to understand what happens
when you turn on ignition
Very complex inside, but all simplicity
on the outside
86. Application Dates
Internships: Just launched
Contact : Faizan Chaudhary (faizan.chaudhary@barclays.com) and Tushar
Wadaskar (tushar.wadaskar@barclays.com)
87. A Sampling of Technology Opportunities at Barclays:
We seek world-class technologists, data scientists, problem-solvers ,Data
systems engineers-- the team that will re-invent Financial Services
Amer Sajed
CEO, Barclaycard US
Bassel Ojjeh
Chief Data Architect, DSI
Simon Gordon
MD, Head of Risk and Legal Technology - DSI
innovating Data and Analytics
solutions within the Financial
Markets industry ; understand and
influence the future of the global
Derivatives market. Risk
Technology within Barclays are
looking for world-class big data and
quantitative modeling skills.
A great opportunity to work on
some of the hardest technical
problems in the industry by using
open source to catch the bad guys
stealing money all the way up to
helping kids saving for their college
We at Barclays are unleashing the power of
disruptive technology. Innovation led by data
and design keeping our customers at the heart
of every product we build is our mantra. Come
join the revolution.
We have a challenge to quadruple
our business here in the US - and
that can only be delivered through
analytics - in marketing, risk, fraud,
and operations. And, Pune is the
undisputed center of the universe!
Faizan Chaudhary
Director, Data Systems & Insights
(DSI)