2. I’m Danielle Jabin
• Data Engineer in the Stockholm office
• A/B testing infrastructure
• California born & raised
• If I can survive a Swedish winter, so can you!
• Studied Computer Science, Statistics, and Real Estate
through the M&T program at the University of
Pennsylvania
5. Big Data
• 40 million Monthly Active Users
• 20+ million tracks
• 1.5 TB of compressed data from users per day
• 64 TB of data generated in Hadoop each day (including
replication factor of 3)
As of June 9, 2014
20. Kafka
• High volume pub-sub
system
• “Producers publish messages to
Kafka topics, and consumers
subscribe to these topics and
consume the messages.”
21. Kafka
• Robust and scalable solution for collection of logs
• Fast data transfer
• Low CPU overhead
• Built-in partitioning, replication, and fault-tolerance
• Consumers can pull data at different rates
• Able to handle extremely high volumes
23. Hadoop
• Process and store massive amounts of unstructured data
across a distributed cluster
• One cluster with 37 nodes to 690 nodes today
• 28 PB of storage
• The largest Hadoop cluster in Europe
24. Hadoop
• Entering the land of optimizations
• Data retention policy
• Move to JVM-based languages
• MapReduce languages
• Moving to Crunch, JVM-based, for speed and scalability
• Python with Hadoop Streaming, Java, Hive, PIG, Scala
• Sprunch: Crunch wrapper for Scala, open sourced by Spotify
• Spotify open-sourced scheduler, Luigi, written in Python
• Simple and easy way to chain jobs
26. Databases
• Aggregates from Hadoop put into PostgreSQL or
Cassandra
• Sqoop
• Core data can be used and manipulated for various needs
• Ad hoc queries
• Dashboards