More Related Content Similar to Criteo Infrastructure (Platform) Meetup (20) Criteo Infrastructure (Platform) Meetup3. 3 | Copyright © 2017 Criteo
Our mission
TARGET THE
RIGHT USER
AT THE
RIGHT TIME
WITH THE RIGHT
MESSAGE
4. 4 | Copyright © 2017 Criteo
Key Figures
18 000 PUBLISHERS90%
RETENTION RATE2
+130
COUNTRIES
LISTED ON THE NASDAQ
SINCE
OCTOBER 2013
R&D REPRESENTS 21% OF THE
WORKFORCE
2500
EMPLOYEES
21
BILLIONS $3
14 000
ADVERTISERS
$1,799 million1
31
OFFICES
1: REVENUE IN 2016
2: ANNUAL RATE 2015
3: $ OF TURNOVER GENERATED TO OUR CLIENTS - TURNOVER POST-CLICK WW FROM JANUARY TO DECEMBER
2015
6. 6 | Copyright © 2017 Criteo
GENERAL CONCEPT
Users visit an
advertiser’s website
1
Criteo identifies the users
(via cookies)
2
Users leave the advertiser’s website
& browse publisher on the Internet
3
Criteo identifies users on
these pages
(via cookie)
4
Criteo displays an advertising
banner, personalized for
each user
5
Click through directly
to the advertiser’s
page
6
@
Retargeting principles
8. 8 | Copyright © 2017 Criteo
• 3.2B catalog items ingested/day, 6B
items stored
• 3.6B cookies/device IDs seen per
month
• 3.9B personalized banners/day
• 49 RTBs @ 120B bid requests/day
• 3M QPS at peak
• 90 Gbps bandwidth
• 20K servers
• 27PB of data stored
• 3.6PB of data read daily
• 500B log lines processed/day
• 363TB of RAM in memcached, 37M req/s
• 300K Hadoop jobs/day
Scale @ Criteo
9. 9 | Copyright © 2017 Criteo
Batch processing:
• Hadoop as a Service:
• 2 clusters – main + backup one for degraded mode
• Cloudera CDH5
• 2300 servers total (1300 + 1000), 76K vcores
• 50PiB storage capacity
• Own job scheduler for improved data flow and coordination
• 300k jobs per day
Hadoop @ Criteo
10. 10 | Copyright © 2017 Criteo
Infrastructure Key Figures
Hosting Global Partners :
Sunnyvale
2 PoP
500 kVA
2 006 Servers
New York
2 PoP
930 kVA
2 793 Servers
Hong Kong
2 PoP
472 kVA
2 185 Servers
Paris
3 Pop
1 800 kVA
5 003 Servers
Amsterdam
2 PoP
+2 500 kVA
3 874 Servers
Tokyo
2 PoP
455 kVA
2 564 Servers
Shanghai
1 PoP
200 kVA
907 Servers
Worldwide
16 PoP
~8 MVA Contracted
20 526 Servers
Up to 90 Gbps
3M QPS
Ashburn
2 PoP
1,1 MVA
1 170 Servers
Hosting Global Partners :
11. 11 | Copyright © 2017 Criteo
Some of the many technologies used at Criteo
13. 13 | Copyright © 2017 Criteo
Top Level Applications
Platforms
Infrastructure
SRE
Advertiser Publisher
WebScale
Prediction Dynamic
Creative
Recommendation
Engine
• Catalog
• User Events
• Campaigns
• Reporting
• RTB
• Direct
• Campaigns
• Reporting
Systems
Platforms
Systems
Engine
14. 14 | Copyright © 2017 Criteo
Analytics Platforms
Advertiser Publisher
Analytics
AX/BI
Reporting / Billing Reporting / Payments
16. 16 | Copyright © 2017 Criteo
Tonight’s menu
Bill of Fare
***
1st talk: FastTrack: scaling customer integration
- Nicolas Laveau, Leo-Paul Goffic & Camille Coueslant -
2nd talk: Evolution of data structures in Yandex.Metrica
- Alexey Milovidov -
3rd talk: Don't take your software for granted
- Cedrick Montout -
4th talk: Evolution of analytics at Criteo
- Justin Coffey -
***
21:05 - 22:00 Networking
19. 19 | Copyright © 2017 Criteo
What do we do in Criteo?
Deliver the right message to the right user at the right time
20. 20 | Copyright © 2017 Criteo
Integration: Creatives settings
• Banners need branding
• Logo
• Font
• Color palette
• Banners come in many formats
21. 21 | Copyright © 2017 Criteo
Integration: Tags
• Banners are based on user intent
• Tags on customer store
• Different types of intent
• Home page view
• Product view
• Listing view
• Basket
• Sales
• Intent at product level
<script type="text/javascript" src="//static.criteo.net/js/ld/ld.js" async="true">
</script>
<script type="text/javascript">
window.criteo_q = window.criteo_q || [];
window.criteo_q.push(
{ event: "setAccount", account: 666 },
{ event: "setEmail", email: "harry.potter@hogwarts.org" },
{ event: "setSiteType", type: "g" },
{ event: "viewHome" }
);
</script>
<script type="text/javascript" src="//static.criteo.net/js/ld/ld.js" async="true">
</script>
<script type="text/javascript">
window.criteo_q = window.criteo_q || [];
window.criteo_q.push(
{ event: "setAccount", account: 666 },
{ event: "setEmail", email: "harry.potter@hogwarts.org" },
{ event: "setSiteType", type: "g" },
{ event: "trackTransaction", id: "tr-56182-2123", item: [
{ id: "patronus", price: 12.54, quantity: 3 },
{ id: "avada-kedavra", price: 1099.99, quantity: 1 }
/* add a line for each item in the user's basket */
]}
);
</script>
Home
Sales
22. 22 | Copyright © 2017 Criteo
Integration: Product Feed
• Banners contain products
• Characteristics of products are used for
recommendation
• Name, description, image, price for display
<item>
<g:id>0</g:id>
<title>Abracadabra</title>
<g:image_link>
http://www.magic.com/assets/spells/abracadabra.png
</g:image_link>
<link>
http://www.magic.com/spells/abracadabra
</link>
<description>
Multi-purpose spell. Your companion for every occasion!
</description>
<g:price>625.99</g:price>
<g:google_product_category>35</g:google_product_category>
</item>
id;title;image_link;link;description;price;google_product_
category
0;Abracadabra;http://www.magic.com/assets/spells/abracadab
ra.png;http://www.magic.com/spells/abracadabra;Multi-
purpose spell. Your companion for every
occasion!;625.99;Arts & Entertainment > Hobbies & Creative
Arts > Magic & Novelties
XML
CSV
23. 23 | Copyright © 2017 Criteo
Back in 2014
When the customer was seeing what he had to implement
24. 24 | Copyright © 2017 Criteo
Back in 2014
When the technical support was seeing the first implementation
25. 25 | Copyright © 2017 Criteo
Back in 2014
When the customer was trying to debug his implementation
26. 26 | Copyright © 2017 Criteo
Criteo grows… fast!
This does not scale!
« Performance is everything »
BUT
we need to onboard first
Clients
TS
27. 27 | Copyright © 2017 Criteo
All is not lost!
Technology & UX to the rescue!
29. 29 | Copyright © 2017 Criteo
Goal
Show near real-time metrics on trackers format issues
Detect mismatches between the trackers and the product feed
Provide fine-grained data (max 24 hours)
Available for each of our clients (=worldwide)
31. 31 | Copyright © 2017 Criteo
How
1. Audit the tracker events
2. Send this audit event to Kafka
3. Consume it from Druid
32. 32 | Copyright © 2017 Criteo
Why Druid
• Druid is an open-source column-oriented distributed data store
• Advantages:
• Fast aggregation queries on huge amount of metrics
• Real-time streaming ingestion
• Scalable
• Highly available
33. 33 | Copyright © 2017 Criteo
1. Audit the tracker events
2. Send this audit event to Kafka
3. Consume it from Druid
4. Query Druid from Integrate
How
36. 36 | Copyright © 2017 Criteo
Tag Debug Mode
How do I make sure I send Criteo the right information from my website?
?
? Fig 1: Criteo Hotline
37. 37 | Copyright © 2017 Criteo
Tag Debug Mode
How do I make sure I send Criteo the right information from my website?
Fig 2: Happy customer
38. 38 | Copyright © 2017 Criteo
How tags work
https://www.mvmtwatches.com/
39. 39 | Copyright © 2017 Criteo
How tags work
https://www.mvmtwatches.com/
ld.js
40. 40 | Copyright © 2017 Criteo
How tags work
https://www.mvmtwatches.com/
ld.js
GET /event?a=%5B30072%…
41. 41 | Copyright © 2017 Criteo
How tags work
https://www.mvmtwatches.com/
ld.js
GET /event?a=%5B30072%…
200 OK
43. 43 | Copyright © 2017 Criteo
Tag Debug Mode
https://www.mvmtwatches.com/#enable-tag-debug-mode
44. 44 | Copyright © 2017 Criteo
Tag Debug Mode
https://www.mvmtwatches.com/#enable-tag-debug-mode ld.js
if (document.location.hash == debugHash)
loadLdDebug();
45. 45 | Copyright © 2017 Criteo
Tag Debug Mode
https://www.mvmtwatches.com/#enable-tag-debug-mode ld.js
ld-debug.js
if (document.location.hash == debugHash)
loadLdDebug();
addDebugIframe();
46. 46 | Copyright © 2017 Criteo
Tag Debug Mode
https://www.mvmtwatches.com/#enable-tag-debug-mode ld.js
GET /event?a=%5B30072%…&debugMode=1
ld-debug.js
if (document.location.hash == debugHash)
loadLdDebug();
addDebugIframe();
47. 47 | Copyright © 2017 Criteo
Tag Debug Mode
https://www.mvmtwatches.com/#enable-tag-debug-mode ld.js
GET /event?a=%5B30072%…&debugMode=1
200 OK
Content-Type: application/javascript
sendDebugInformationToIframe({
audit: {
product: { image: ‘…’ },
errors: […]
}
});
ld-debug.js
if (document.location.hash == debugHash)
loadLdDebug();
addDebugIframe();
48. 48 | Copyright © 2017 Criteo
Tag Debug Mode
Gives you fine-grained insights on the quality of information sent
Requires no technical knowlege
Mirrors exactly what will be processed down the line
50. 50 | Copyright © 2017 Criteo
Goal
Provide feedbacks ASAP on a subset of products
Provide feedbacks on the whole feed
Automatic format detection (Google specs)
User can validate the structure of the feed
User can review some products
As close as possible as the daily feed import
51. 51 | Copyright © 2017 Criteo
Full import
Daily import architecture
52. 52 | Copyright © 2017 Criteo
Full import
Update feed processing
Hadoop job to compute
errors and attributes
statistics
53. 53 | Copyright © 2017 Criteo
Full import
Launch full import from
Integrate, retrieve and
display statistics
54. 54 | Copyright © 2017 Criteo
Test import
Create a Marathon application
that:
- Stream incoming feed
- Detect format
- Reuse part of feed processing
Hadoop job java code
- Save import & statistics in DB
- Provide API to fetch statistics
58. 58 | Copyright © 2017 Criteo
How banners work at Criteo
• Actual humans pick predefined
layouts, colors, CTAs
• Then those are combined with product
information and optimized on-the-fly
Je découvre !
J’achète !
× ×
×
=
59. 59 | Copyright © 2017 Criteo
How banners work at Criteo
“Can I have drop shadows on my products?”
“I’m not sure about the pink”
“Could it autoplay loud music?”
As a result, clients worry
“What will my banners look like?”
60. 60 | Copyright © 2017 Criteo
How banners work at Criteo
There is stuff we can’t do, and stuff we don’t necessarily want to do
“What will my banners look like?”
“Can I have drop shadows on my products?”
“I’m not sure about the pink”
“Could it autoplay loud music?”
61. 61 | Copyright © 2017 Criteo
Creatives to the rescue
And it takes back and forth.
Our goal:
• Give advertisers a preview of what it’ll look like
• Give advertisers customization options
• Feedback the performance impact
• 80% of advertisers validate their Creatives in < 2 minutes
• 80% of advertisers don’t ask for a change
62. 62 | Copyright © 2017 Criteo
Creatives
Bring on UX, R&D, Product, Sales, Creatives & Technical Support
63. 63 | Copyright © 2017 Criteo
Creatives
Bring on UX, R&D, Product, Sales, Creatives & Technical Support
64. 64 | Copyright © 2017 Criteo
Creatives
1 Education
Preview
Performance
Customization
2
3
4
1
2
3
4
66. 66 | Copyright © 2017 Criteo
eCommerce Platforms
Lots of our clients run on ready-to-use platforms that have APIs
As a result, we can completely automate the integration workflow for them!
67. 67 | Copyright © 2017 Criteo
Shopify integration
Only 2 clicks needed!
Reduced integration time from 14 days to 20 minutes
69. 69 | Copyright © 2017 Criteo
How customers / technical support / we feel
70. 70 | Copyright © 2017 Criteo
“
”
• Only 25% in 2014
• 66% complete
Feed in < 1h
• 43 days in 2014
• 2014: 600
integrations/quarter
• 2016: 1800
integrations/quarter
• 50% handled
through Integrate
• 95% accept “as-is”
• 4% accept with
performance
downgrade
• Only 1% ask for
modification
Nassim Aissat, Global TS
I’m in love with the
Tag Debug Mode
7514d %Median
integration time
Tags without help
Integrate achievements
92%Validate Creatives
< 2 mn
20mnIntegration w/
Shopify App
73. 73 | Copyright © 2017 Criteo
What does Black Friday mean at Criteo?
74. 74 | Copyright © 2017 Criteo
Release freeze: trying to guarantee the stability of the platform...
... with nasty side-effects
Getting ready for Black Friday
75. 75 | Copyright © 2017 Criteo
How to know evaluate at a glance the health of the datacenter?
Comes grafana
Monitoring the datacenter
76. 76 | Copyright © 2017 Criteo
With specific filters, deviant machines can be spotted easily
Monitoring the datacenter
77. 77 | Copyright © 2017 Criteo
Drilling down...
Monitoring the datacenter
78. 78 | Copyright © 2017 Criteo
Until finding a likely culprit
Monitoring the datacenter
79. 79 | Copyright © 2017 Criteo
And switching to micro analysis to find the root cause
• Process Explorer
• Profiling
• Windbg
• ClrMD
Monitoring the datacenter
85. 85 | Copyright © 2017 Criteo
• This is a bullet
• 2nd level bullet
Gen8 vs Gen9 servers
87. 87 | Copyright © 2017 Criteo
Conclusion
Do not take your software for granted
• Internal Infrastructure will change
• External workload will change
… be prepared
88. 88 | Copyright © 2017 Criteo
The Analytics Stack at
Criteo
Yesterday, Today and Tomorrow with an assist from Bill Murray
Justin Coffey, Team Lead
89. 89 | Copyright © 2017 Criteo
The Ghost of Christmas
Present
What do we have now?
90. 90 | Copyright © 2017 Criteo
Criteo: Scale of Data
• 4 Billion ads served each day
• 200+ Billion events logged each day
• 50TBs of data ingested each day
• 10 trillion records processed each day
91. 91 | Copyright © 2017 Criteo
Criteo: Scale of the Analytics Stack
50+ TB ingested / day
2000+ jobs / day
7+PB
Under
Management
200+ Analysts
400+ Engineers
1000+
Sales and Ops
92. 92 | Copyright © 2017 Criteo
Criteo: Scaling Analysts
0
20
40
60
80
100
120
140
160
180
Analysts Hired since
2010
93. 93 | Copyright © 2017 Criteo
Criteo: Scaling Data
0
2E+10
4E+10
6E+10
8E+10
1E+11
1.2E+11
1.4E+11
Growth of a Single Dataset Since July 2014
94. 94 | Copyright © 2017 Criteo
Criteo: The Analytics Stack Today
Ad-Hoc
Analysis
Hadoop for primary
storage and point of
ingestion
Data Transformation
on top of Hadoop
Hive (7PB) and
Vertica (100+ TB)
Data Warehouses
Ad-Hoc SQL on Hive
and Vertica,
Reporting on
Tableau and Vertica
OrchestrationviaLangoustine
95. 95 | Copyright © 2017 Criteo
Our Stack is Simple
• Few moving parts
• Purposefully built with Shiny Thing blinders on
• It's okay to not have the "latest and greatest" tech
• Good enough is, actually, always good enough
96. 96 | Copyright © 2017 Criteo
On Shiny Things: the universe is vast
so be selective, and master what you select
97. 97 | Copyright © 2017 Criteo
The Ghost of Christmas Past
Before we continue, a quick history lesson of how we got here is in order...
98. 98 | Copyright © 2017 Criteo
Everything starts
somewhere
and it's not always pretty.
99. 99 | Copyright © 2017 Criteo
In early 2013, you could use SQL Server…
AdServer_Db
Publisher_Db
LogStatus_Db
BlogWidgetStat_Db
BlogWidgetAdStat_dbTraffic_custom_db
Extranet_DbTraffic_custom_db
CATEGORY_DB
Mail_MonitorDB
Inventory_Db
AdServerBo_Db
AdServerStat_Db
DashBoard_DB
Dashboard_Security_DB
WebServerStat_db
ABTesting_DB
AdvertiserFatigueStats_db
ADVERTISING_DB
StatPrediction_DB
CAST_DB
CriteoRefdb
ImportDB
RISK_DBGalacticaStats_DB
MaxCpc_DB
UserProfilingDB
WorkflowPersistency_db
CAST_DB_HOURLY
StatEngine_Db
Crawler_Db
BICustom_DB
Lookalike_DB
Widget_db
AOC_DB
AOC_DB
Build_Deploy_Fake_db
publisher_stats_db
TestFwk_Db
LogMonitorDb
ADMINLOGS_DB
SqoopExport_db
FraudDetection_db
HPClink_DB
DW_DB
tsuissesbenl_stat_db
Heyokr_Stat_db
kiabiit_stat_db
Ultaus_Stat_db
Crutchfieldus_Stat_db
Forzierijp_Stat_db
Retailchoiceuk_Stat_db
Ryanairhotelses_Stat_db
Speakyplanetfr_Stat_db
Autowayjp_Stat_db
Sicilianobr_Stat_db
Jukenhousingjp_Stat_db
Cosyforyoufr_Stat_db
Tripadvisorru_Stat_db
Linasmatkassese_Stat_db
Ellepassionsfr_Stat_db
Skyde_Stat_db
Swimdoctormallkr_Stat_db
Sitescoutbr_Stat_db
Travelzoousnewusers_Stat_db
Platekompanietno_Stat_db
Testaoc110413frcom_Stat_db
Megapoolnl_Stat_db
Elektrototaalmarktnl_Stat_db
Intersportuk_Stat_db
Usineadesignfr_Stat_db
Lekmerno_Stat_db
Vuelingit_Stat_db
Valuedopinions_Stat_db
Forzierino_Stat_db
Artisantiuk_Stat_db
Idbusit_Stat_db
Cocostorykr_Stat_db
Artnaturejp_Stat_db
Byggmaxse_Stat_db
Corporatecriteopmit_Stat_db
Aramisauto_Stat_db
Migoaes_Stat_db
Degrotespeelgoedwinkelnl_Stat_db
Diorcouturit_Stat_db
Kaufuniquede_Stat_db
Codigallerykr_Stat_db
Mandarinaduckfr_Stat_db
Comarketingorangenokiafr_Stat_db
Sinbiangkr_Stat_db
Cheapflightsuk_Stat_db
Undergirlkr_Stat_db
Agradinl_Stat_db
Kofferprofide_Stat_db
Domodipl_Stat_db
Mandarinaduckat_Stat_db
Mobilegermany_Stat_db
Chlit_Stat_db
Spreadshirtuk_Stat_db
Casalrunningfr_Stat_db
Bloomfm_Stat_db
Hotelsbe_Stat_db
Strumentimusicaliit_Stat_db
Bathroomworlduk_Stat_db
Verivoxde_Stat_db
Mcmkr_Stat_db
Viaggiedreamsit_Stat_db
Brille24de_Stat_db
Yjgakuseikaikan_Stat_db
Stylepitnl_Stat_db
Cvlibraryrecruiter_Stat_db
Preis24de_Stat_db
Tigershedsuk_Stat_db
Duvetandpillowuk_Stat_db
Noths_Stat_db
Wizwidkr_Stat_db
Ticketonlinede_Stat_db
Lifestyleeuropeuk_Stat_db
Shopeccose_Stat_db
Swanhellenicuk_Stat_db
Deguisementdiscountfr_Stat_db
Freshcottonnl_Stat_db
Tikamoonfr_Stat_db
Testfp1_Stat_db
warehouse_stat_db
Hisjeans_Stat_db
Mountfieldlawnmowers_Stat_db
Sitescoutnl_Stat_db
Lancomeus_Stat_db
Brandelijp_Stat_db
Mesdessousfr_Stat_db
Beautyplanningjp_Stat_db
Lgcobrandingpriceminister_Stat_db
Stockngous_Stat_db
Kickzde_Stat_db
Rockymountaindecorus_Stat_db
Cellbesse_Stat_db
Yvesrocheres_Stat_db
Toshibadirectjp_Stat_db
Seneukr_Stat_db
Waterfeaturesuk_Stat_db
Cottagesforyouuk_Stat_db
Camif_Stat_db
Lojaskdbr_Stat_db
Hipmunkhotels_Stat_db
Sorteonline_Stat_db
Ediets_Stat_db
Bonsportru_Stat_db
Jobjsenjp_Stat_db
Redcoonit_Stat_db
Hmuk_Stat_db
Srtestcetelem2_Stat_db
Iamprettykr_Stat_db
Lebunnybleushopkr_Stat_db
Condenastit_Stat_db
Hotusaes_Stat_db
Chilitvit_Stat_db
Hellinefr_Stat_db
Cobrasonfr_Stat_db
madeindesign_stat_db
Megagadgetsnl_Stat_db
Todaofertabr_Stat_db
bulbus_Stat_db
Calcioshopit_Stat_db
Edenlyes_Stat_db
Recruiterucajp_Stat_db
Engelhornde_Stat_db
Spreadshirtno_Stat_db
Dusparstde_Stat_db
Tabletbr_Stat_db
Ventesecretfr_Stat_db
Venteunique_Stat_db
Dellchde_Stat_db
Dressforlessnl_Stat_db
Multipopkr_Stat_db
allheartus_Stat_db
Trovitdejobs_Stat_db
lesjeudisfr_stat_db
Expediaukcrosssell_Stat_db
Furniturebrituk_Stat_db
Yooxbe_Stat_db
Skyscannerno_Stat_db
Bluetomatoat_Stat_db
Mechakaitaijp_Stat_db
Destinationlightingus_Stat_db
and 10K+ more
100. 100 | Copyright © 2017 Criteo
SQL Server was Production Infrastructure
• Analyst access to data was an afterthought
• Production databases were not designed for analytics
• Reports and queries were tightly coupled to production
• UX was low and Analysts occasionally broke production systems!
101. 101 | Copyright © 2017 Criteo
Hive also made an early appearance…
2013-04-22 11:28:59,942 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
2013-04-22 11:29:01,010 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
2013-04-22 11:29:02,071 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
2013-04-22 11:29:03,134 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
2013-04-22 11:29:04,876 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
2013-04-22 11:29:05,112 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
2013-04-22 11:29:06,047 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
2013-04-22 11:29:06,984 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 365222.27 sec
ZZZZ…
102. 102 | Copyright © 2017 Criteo
But Hive was also an afterthought
• Raw production data batch loaded with no transformations
• Query tools were non-existant
• Queries were slow and only expert analysts could run them
• UX and productivity were extremely low
103. 103 | Copyright © 2017 Criteo
This just wasn't working!
we needed a new approach
105. 105 | Copyright © 2017 Criteo
Requirements for an Analytic Database
• It must be extremely fast
• It must be able to store our most actionable data sets
• Dozens (at the time!) of TBs, now hundreds
• It must be queryable with proper SQL
• It must be deployable on hardware we specify
106. 106 | Copyright © 2017 Criteo
Defining a Proof of Concept Evaluation
• Work with Analysts to identify key data sets
• Analyze query patterns
• Define benchmark queries
• Work with vendors to test closed source solutions
• Test OSS in-house
107. 107 | Copyright © 2017 Criteo
The results
• Vertica struck the right balance between cost, performance and deployment options
• PoC evaluation took ~3 months
• Initial deployment took another ~3 months
• Operations ramped up over the following ~6 months
108. 108 | Copyright © 2017 Criteo
Working with Analysts during deployment
• Analysts in the team helped define and document the data model
• They also created training materials
• Training was done in concert with engineers
109. 109 | Copyright © 2017 Criteo
But was it a success?
• Within a year of the rollout we were able to decomission SQL server for analytics
• Today Vertica has over 100 unique ad-hoc users connected each day
• It executes hundreds of thousands of queries each day
• It is the most important piece of analytics infrastructure at Criteo
110. 110 | Copyright © 2017 Criteo
A fresh deployment to mature infrastructure
• Vertica at Criteo has scaled from ~12TB to ~120TB (going PB soon)
• Ad-hoc users have grown from ~40 to ~200
• Reporting users have grown from ~300 to ~1500
• The number of tables has grown from ~50 to >500
111. 111 | Copyright © 2017 Criteo
Wait, 500 tables in 3
years?
That's a lot of data modelling!
112. 112 | Copyright © 2017 Criteo
Analysts contribute to the data model
• Engineers know how the DB works and know how to optimize a data model, but they don't always know what to put in it
• With good tools Analysts contribute to the evolutions of the data model, including schema additions and modifications
• Engineers in the team can help guide them in the finer details
• Rinse and repeat
113. 113 | Copyright © 2017 Criteo
Side bar: We also had dashboards with SSRS
But we were told it was
ugly and complicated.
We traded ugly for slow,
btw, and it's still
complicated
114. 114 | Copyright © 2017 Criteo
From SSRS to Tableau and SQL Server to Vertica
• Actually, "slow" is just our current perception—we had SSRS dashboards with timeouts on the order of hours.
• SSRS served as our de facto ETL between those 10K+ SQL Server DBs
• Those SQL Server DBs were also production databases.
115. 115 | Copyright © 2017 Criteo
So to Summarize the Past
• Analysts had to query across thousands of DBs
• Dashboards were slow and complicated
• Analytics work was strongly coupled to production
life was great back then wasn't it?
116. 116 | Copyright © 2017 Criteo
We're done then?
Not quite. Things can go awry!
117. 117 | Copyright © 2017 Criteo
The Ghost of Christmas
Future
...here's hoping it's a near future...
118. 118 | Copyright © 2017 Criteo
Criteo is World Wide
We have hundreds of analysts spread across dozens of countries!
119. 119 | Copyright © 2017 Criteo
Criteo has a Rich Product Offering
• Banner Ads, Mobile, In-App, Email, Search
• 10's of Thousands of Advertisers and Publishers
• Some of them very big and very demanding
120. 120 | Copyright © 2017 Criteo
And (reminder!) our Scale Never Seems to Stop Growing
0
2E+10
4E+10
6E+10
8E+10
1E+11
1.2E+11
1.4E+11
Growth of a Single Dataset Since July 2014
121. 121 | Copyright © 2017 Criteo
(reminder #2) Number of analysts hired since 2010
0
20
40
60
80
100
120
140
160
180
123. 123 | Copyright © 2017 Criteo
New Challenges
• With so many hungry analysts to feed and with so much volume and variety of data, Vertica's query planner is working over time
• We need to instrument and monitor more
• We need to level-up analysts' SQL skills
• And yes, finally, we do need some data governance*
*oh how I've resisted this day!
124. 124 | Copyright © 2017 Criteo
2 Analysts and 3 Engineers ain't gonna cut it
• We have scaled up our PM team
• We are moving from a proto-CoE team to an official CoE team
• We are scaling engineering operations
125. 125 | Copyright © 2017 Criteo
What's on the TODO list?
• Documentation, and automating it as much as possible
• Non-invasive, but very intimate query monitoring
• Workload isolation
• Query suggestions and preëmptive query blocking
126. 126 | Copyright © 2017 Criteo
More about query inspection
• No matter how wonderful a database may be its performance comes down to how much IO it has and how much contention there is for it
• The difference between a poorly optimized query and a well optimized one for the IO subsystem can be orders of magnitude
• Better queries means more concurrent, happier users
127. 127 | Copyright © 2017 Criteo
More about query inspection
• Vertica offers lots of ways to find out what is going on behind the scenes, but one of the best ways is to EXPLAIN your users' queries and identify those
who need to be trained!
128. 128 | Copyright © 2017 Criteo
Recalling our Current Challenges
• Tableau Workbooks are Slow
• Vertica is Overloaded
• Reporting Data is Frequently Late
129. 129 | Copyright © 2017 Criteo
Patches and the Arc of History
• Each of our currently challenges can be addressed in the short term
• But we need long term solutions to avoid regressions
130. 130 | Copyright © 2017 Criteo
Tableau Relief Program (TaRP)
Short Term:
• Double the cores on production server
• Isolate critical workbooks
Medium Term:
• Require all production workbooks to go
through gerrit/git review
• Score workbook complexity pre-release
• Monitor released workbooks for QoS
Not So Long Term:
• Work with Product and Central Ops to create
Tableau Center of Excellence and level up BI
131. 131 | Copyright © 2017 Criteo
TaRP: reporting alchemy
Push to production
Productive
Analyst
Angry
Sales Person
No SLA
dataset
Productive
Analyst
Happy
Sales Person
SLA
dataset
Push to review Automated deploy
Knowledgeable
Analyst
132. 132 | Copyright © 2017 Criteo
Why impose a dev cycle on report building?
not to be trite, but, well:
that's good money!
133. 133 | Copyright © 2017 Criteo
More seriously
• Tableau workbooks consume data
• Data comes in all sorts of volumes and velocities (sorry)
• Data query complexity is linked to workbook complexity and features
• If you don't know what you're doing, your workbooks will be:
• slow, because of internal workbook complexity
• slow, because of complex database queries
• not be up to date if it doesn't query the proper data sources
Tableau workbook developers are developers, full stop. Treat them like they are.
134. 134 | Copyright © 2017 Criteo
Consul
Vertica Roadmap
RTIngester
HDFSIngest
er
HL
L
JDBC
VProxy
Admin
VIcO
JVMIngeste
r
DataDisco
135. 135 | Copyright © 2017 Criteo
Vertica as a Service
Short Term:
• Scale out as fast as reasonable
• Split reporting and ad hoc workloads
• Better hardware configuration
• More monitoring
Not So Long Term:
• Better monitoring
• Control Input: Trickle and Bulk Loading, Consistently, Durably and Efficiently
• Control Output: Query inspection/prioritization, Workload management
136. 136 | Copyright © 2017 Criteo
Fixing Your Latent Data Problem
Short Term:
• Migrate critical data workflows to Langoustine
• Optimize DAG and long running queries
Medium Term:
• Migrate long-tail datasets to Langoustine
• Better metrics, capacity planning
Not So Long Term:
• Refactor data model to cull useless data sets
• Better complexity analysis of workflow modifications pre-release
137. 137 | Copyright © 2017 Criteo
We're going to need better instrumentation
Better Workflow Insights in Langoustine Better Hadoop Job Performance Metrics
138. 138 | Copyright © 2017 Criteo
Let's spend less time making data workflows
Langoustine IDE makes building Hive workflows trivial
139. 139 | Copyright © 2017 Criteo
Langoustine IDE promotes best practices
Workflows are source controlled:
Reviews are built-in:
140. 140 | Copyright © 2017 Criteo
We'll need better dev tools (eg dev-cluster)
build an AWS hadoop cluster:
connect to it via a local docker container:
and load it with data saved in S3:
142. 142 | Copyright © 2017 Criteo
Wait, what about Opera and
Vizatra?
didn't you guys do a lot of work on that?
143. 143 | Copyright © 2017 Criteo
A Quick Opera Recap
Opera is the internal replacement for CPOP, built in two parts
A scalding-langoustine data pipeline: And a vizatra-OLAP web app:
144. 144 | Copyright © 2017 Criteo
We learned a lot from building Opera
• How to use SQL to describe a dashboard
• How to master SQL queries executed from an OLAP app
• How to build big, fast databases
• How to build optimal (or so we think) data processing pipelines
• How to make a decent UI with decent UX
146. 146 | Copyright © 2017 Criteo
Using SQL for dashboard meta-data
SELECT
time_id as hour,
country_code as country,
network_id as network,
SUM(clicks) as clicks,
SUM(displays) as displays,
SUM(clicks) / SUM(displays) as ctr
FROM
facts
WHERE
time_id BETWEEN ?start AND ?end
GROUP BY
time_id,
country_code,
network_id
Time dimensions
Dimensions
Metrics
Parameters
147. 147 | Copyright © 2017 Criteo
Using SQL for dashboard meta-data
Time dimension
Dimensions
Metrics
Parameters
148. 148 | Copyright © 2017 Criteo
Big-O(lap)
SELECT
time_id as hour,
country_code as country,
network_id as network,
SUM(clicks) as clicks,
SUM(displays) as displays,
SUM(clicks) / SUM(displays) as ctr
FROM
facts
WHERE
time_id BETWEEN ?start AND ?end
GROUP BY
time_id,
country_code,
network_id
PROJECTION
Revenue by country
SELECTION
Last 7 days in EUR
149. 149 | Copyright © 2017 Criteo
Big-O(lap)
SELECT
time_id as hour,
country_code as country,
network_id as network,
SUM(clicks) as clicks,
SUM(displays) as displays,
SUM(clicks) / SUM(displays) as ctr
FROM
facts
WHERE
time_id BETWEEN ?start AND ?end
GROUP BY
time_id,
country_code,
network_id
PROJECTION
Revenue by country
SELECTION
Last 7 days in EUR
150. 150 | Copyright © 2017 Criteo
Big-O(lap)
SELECT
country_code as country,
SUM(clicks) as clicks,
SUM(displays) as displays
FROM
facts
WHERE
time_id BETWEEN ‘2016-03-01’ AND ‘2016-03-07’
GROUP BY
country_code
PROJECTION
Revenue by country
SELECTION
Last 7 days in EUR
151. 151 | Copyright © 2017 Criteo
Now that we've gotten
intimate with SQL...
Let's see what else we can build...
152. 152 | Copyright © 2017 Criteo
Vizatra Client: One DB Client to Rule Them All
153. 153 | Copyright © 2017 Criteo
Vizatra Client: One DB Client to Rule Them All
• Parse every query and analyze complexity before executing it
• Enforce best practices (e.g. predicates on partitions)
• Degrade gracefully (e.g. don't submit queries to an overloaded DB)
• Score users and queries, share with other users
• Provide basic visualizations to increase analytic productivity
• Support non-SQL datasources
• And your feature?
154. 154 | Copyright © 2017 Criteo
The End.
Thanks for listening. If any of this sounds fun, we're hiring!
Editor's Notes The solution is based on the existing process handling of all user events. Each time a user sees a product or put a product in his basket, Criteo will receive an event and store it in order to display a relevant ad later on. For performance purpose, tracker servers call a “lazy refreshing” memcache, and not directly the real products, which are stored on a Couchbase cluster.
We needed to plug our solution just after Criteo received the event.
Step 1: Audit the tracker events
Check the event for mandatory parameters, parameters format, check if the event is related to one or several products, check whether these products actually exist in our system (this can be missed due to incomplete product feed or just because the advertiser passed us a wrong product id)…
Step 2: Send this audit event to Kafka, the famous Apache messaging system, where a global scale mirroring system have been set up, allowing us to aggregate data from all around the world.
Step 3: Consume Kafka from Druid, which is a column-oriented distributed data store built on a delta architecture, allowing use to do sub-second query on the huge amount of metrics we needed to compute. Step 2: Send this audit event to Kafka, the famous Apache messaging system, where a global scale mirroring system have been set up, allowing us to aggregate data from all around the world.
Step 3: Consume Kafka from Druid, which is a column-oriented distributed data store built on a delta architecture, allowing use to do sub-second query on the huge amount of metrics we needed to compute. WebScale write code to ensure the sustainability and the maintenability of the Criteo real time platform
We are 12 and we spend a good chunk of our time looking at performance ploblem
Some of the performance problem comes from the change of your traffic pattern,
There is no killer like a giant planetary sales and Kevin will talk about the way we prepare for that. Significant increase of traffic over a few days
Release freeze
Teams rush to release features before the release freeze. The platform becomes actually more unstable than ordinary. It is critical to find the issues and fix them before black friday.
Monitoring deviant machines across the datacenters
Spotting isolated abnormally behaving servers
Proactively diagnose and fix the issues before they spread to the DC
Gen9 more core but lower frenquency