Patchwork Data at Etsy4. Etsy
June
2005 2007 2009 2011 2013
7. Okay, we do
• http://codeascraft.etsy.com
• https://www.etsy.com/codeascraft/talks
• http://kongscreenprinting.com
8. Catch Phrases
• Continuous deployment
• Blameless postmortems
• Measure everything
• Continuous experimentation
12. Adtuitive
• Online advertising network
• Match forum post with rich product advertisements
• Unafraid of scaling across Etsy sellers
15. LAMP Stack for Big Data
• HDFS • Pig
• MapReduce • Oozie
• HBase • Avro
• Hive • Zookeeper
• Flume
• JDBC/ODBC http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/
• Hue
16. LAMP Stack for Big Data
• HDFS • Pig
• MapReduce • Oozie
• HBase • Avro
• Hive • Zookeeper
• Flume
• JDBC/ODBC http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/
• Hue
17. LAMP Stack for Big Data
• HDFS S3 • Pig Cascading
• MapReduce (Elastic) • Oozie
• HBase • Avro TupleSerialization
• Hive • Zookeeper
• Flume
• JDBC/ODBC
• Hue
19. Applications
• Log ETL • A/B Analyzer
• Database snapshotter • Catapult
• TasteTest • Distributed search indexing
• Facebook Gift Recommender • Fast Game (search index)
• Complimentary/similar listings • Search autosuggest
• Funnel Cake • SearchAds
• Feature Funnel • SCRAM ETL (fraud detection)
20. Applications
• Log ETL • A/B Analyzer
• Database snapshotter • Catapult
• TasteTest • Distributed search indexing
• Facebook Gift Recommender • Fast Game (search index)
• Complimentary/similar listings • Search autosuggest
• Funnel Cake • SearchAds
• Feature Funnel • SCRAM ETL (fraud detection)
21. Catapult
• End-to-end success story
• Extremely valuable for a web shop
24. Relevancy Thursdays
• Default search order was recency
• Relisting was our equivalent of advertising
• $0.20 updated your listing’s timestamp
25. Relevancy Thursdays
• Recency was meant to support “freshness” in search results
• Search originated as PostgreSQL query
• Converted to Solr to scale
34. Heyday of Tooling
• A/B framework
• Front end event logger
• Database snapshotter
• Barnum and Bailey
• Custom operator library
• Loaders
35. LAMP Stack for Big Data
• HDFS S3 • Pig Cascading
• MapReduce (Elastic) • Oozie
• HBase • Avro TupleSerialization
• Hive • Zookeeper
• Flume
• JDBC/ODBC
• Hue
36. LAMP Stack for Big Data
• HDFS S3 • Pig Cascading
• MapReduce (Elastic) • Oozie Barnum
• HBase • Avro TupleSerialization
• Hive • Zookeeper
• Flume Akamai
• JDBC/ODBC snapshotter/loaders
• Hue
41. Why did it take so long?
• Non-web developers learning the PHP stack
• Failed experiments with “easier to use” MapReduce tools
• Realizing self-service analytics was what Etsy needed
44. Catapult
February
2005 2007 2009 2011 2013
45. Catapult
• A/B Analyzer + Launch Calendar
• Full product lifecycle
49. LAMP Stack for Big Data
• HDFS S3 • Pig Cascading
• MapReduce (Elastic) • Oozie Barnum
• HBase • Avro TupleSerialization
• Hive • Zookeeper
• Flume Akamai
• JDBC/ODBC snapshotter/loaders
• Hue
50. LAMP Stack for Big Data
• HDFS • Pig Cascading
• MapReduce • Oozie
• HBase • Avro TupleSerialization
• Hive Vertica • Zookeeper
• Flume logrotate
• JDBC/ODBC snapshotter/loaders
• Hue
55. RDBMS / Cascading
SQL cascading.jruby
Query Planner/Optimizer Cascading
Execution Engine MapReduce
Storage HDFS
57. cascading.jruby
• Productivity: no compile
• Reuse: factor out structure
• Efficiency: no JRuby runtime
• Optimization: move aggregations map-side
60. Productivity
• Job templates
• Reloader
• Cascading local mode
• Sampled data
65. Efficiency
• Just a constructor
• Calls into Cascading API
• No JRuby runtime on cluster
69. Scalding
• Distributed collections
• Function literals replace UDFs
77. Hive
January
2005 2007 2009 2011 2013
79. Hive
• Slow
• Sensitive
• Operational burden
• Educational burden
80. Vertica
• Offline copy of shards, master, auxiliary databases
• Joins are easy
• Reasonable latency
81. Vertica
November
2005 2007 2009 2011 2013
82. Vertica
• Game changer at Etsy
• High demand for joins
• Rapid prototyping data pipelines
84. RDBMS / Cascading
SQL cascading.jruby
Query Planner/Optimizer Cascading
Execution Engine MapReduce
Storage HDFS
86. Vertica
• Not Hive, Impala, Shark, etc.
• May change our minds
89. Etsyweb
• memcached
• Gearman
• Sharded MySQL
92. Turns out people don’t make
product decisions in real time
http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics
93. Summing Up
• Be glad you’re living in the future
• Automated tools for the common case
• Don’t be afraid to experiment
94. Image Credits
• http://kongscreenprinting.com/what-we-do- • http://www.globaltimes.cn/
showcase SPECIALCOVERAGE/Top10Peopleof2011.aspx
• http://animal.discovery.com • http://www.theculturemap.com/scream-time-
edvard-munch-museum/
• http://www.rallyrace.com/turning-over-the-
stone-event-production-basics/ • http://www.repentamerica.com/webelieve.html
• http://www.flickr.com/photos/bbalaji/ • https://soundcloud.com/tearland/tl-hive
2443820505/
• http://pocketnow.com/2012/08/02/wifi-vs-data-
• http://www.madeyoulaugh.com/funny_photos/ speed-vs-battery-life/bush-scratching-head
caveman_harley/caveman_harley.jpg
• http://theundercoverrecruiter.com/6-ways-
catapult-your-job-search-after-layoff/
95. Contact / Reference
• Matt Walker
• @data_daddy
• http://codeascraft.etsy.com/
• http://www.etsy.com/codeascraft/talks