Submit Search
Upload
Scalding @ Coursera
•
6 likes
•
19,774 views
Daniel Jin Hao Chia
Follow
A lightning talk I gave about how Coursera decided to use Scalding.
Read less
Read more
Software
Report
Share
Report
Share
1 of 22
Download now
Download to read offline
Recommended
Building an API layer for C* at Coursera
Building an API layer for C* at Coursera
Daniel Jin Hao Chia
Cassandra@Coursera: AWS deploy and MySQL transition
Cassandra@Coursera: AWS deploy and MySQL transition
Daniel Jin Hao Chia
Azure DocumentDB 101
Azure DocumentDB 101
Ike Ellis
Real-time Cassandra
Real-time Cassandra
Acunu
Database Choices
Database Choices
Lynn Langit
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users
ScyllaDB
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
Burak TUNGUT
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Tom Kerkhove
Recommended
Building an API layer for C* at Coursera
Building an API layer for C* at Coursera
Daniel Jin Hao Chia
Cassandra@Coursera: AWS deploy and MySQL transition
Cassandra@Coursera: AWS deploy and MySQL transition
Daniel Jin Hao Chia
Azure DocumentDB 101
Azure DocumentDB 101
Ike Ellis
Real-time Cassandra
Real-time Cassandra
Acunu
Database Choices
Database Choices
Lynn Langit
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users
Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users
ScyllaDB
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
Burak TUNGUT
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Tom Kerkhove
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
Sanura Hettiarachchi
Azure CosmosDb - Where we are
Azure CosmosDb - Where we are
Marco Parenzan
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
Ardak Shalkarbayuli
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
ScyllaDB
NoSQL and AWS Dynamodb
NoSQL and AWS Dynamodb
Eduardo Bohrer
Azure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosql
Riccardo Cappello
Mongodb lab
Mongodb lab
Bas van Oudenaarde
Azure CosmosDb
Azure CosmosDb
Marco Parenzan
Scaling horizontally on AWS
Scaling horizontally on AWS
Bozhidar Bozhanov
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
Ramesh Veeramani
Scylla Summit 2018: Scaling your time series data with Newts
Scylla Summit 2018: Scaling your time series data with Newts
ScyllaDB
Big Data Overview Part 1
Big Data Overview Part 1
William Simms
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
Modern Data Stack France
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
Ike Ellis
CosmosDb for beginners
CosmosDb for beginners
Phil Pursglove
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
foundsearch
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Windows Developer
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
gdusbabek
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
Yann Cluchey
SQL vs NoSQL
SQL vs NoSQL
Jacinto Limjap
Analytics with Splunk (Open edX)
Analytics with Splunk (Open edX)
Philippe Chiu
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
More Related Content
What's hot
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
Sanura Hettiarachchi
Azure CosmosDb - Where we are
Azure CosmosDb - Where we are
Marco Parenzan
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
Ardak Shalkarbayuli
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
ScyllaDB
NoSQL and AWS Dynamodb
NoSQL and AWS Dynamodb
Eduardo Bohrer
Azure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosql
Riccardo Cappello
Mongodb lab
Mongodb lab
Bas van Oudenaarde
Azure CosmosDb
Azure CosmosDb
Marco Parenzan
Scaling horizontally on AWS
Scaling horizontally on AWS
Bozhidar Bozhanov
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
Ramesh Veeramani
Scylla Summit 2018: Scaling your time series data with Newts
Scylla Summit 2018: Scaling your time series data with Newts
ScyllaDB
Big Data Overview Part 1
Big Data Overview Part 1
William Simms
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
Modern Data Stack France
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
Ike Ellis
CosmosDb for beginners
CosmosDb for beginners
Phil Pursglove
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
foundsearch
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Windows Developer
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
gdusbabek
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
Yann Cluchey
SQL vs NoSQL
SQL vs NoSQL
Jacinto Limjap
What's hot
(20)
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
Azure CosmosDb - Where we are
Azure CosmosDb - Where we are
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
NoSQL and AWS Dynamodb
NoSQL and AWS Dynamodb
Azure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosql
Mongodb lab
Mongodb lab
Azure CosmosDb
Azure CosmosDb
Scaling horizontally on AWS
Scaling horizontally on AWS
Using cassandra as a distributed logging to store pb data
Using cassandra as a distributed logging to store pb data
Scylla Summit 2018: Scaling your time series data with Newts
Scylla Summit 2018: Scaling your time series data with Newts
Big Data Overview Part 1
Big Data Overview Part 1
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
CosmosDb for beginners
CosmosDb for beginners
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
SQL vs NoSQL
SQL vs NoSQL
Viewers also liked
Analytics with Splunk (Open edX)
Analytics with Splunk (Open edX)
Philippe Chiu
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
What we're learning about burnout and how DevOps can help
What we're learning about burnout and how DevOps can help
Ken Mugrage
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
Twitter by the Numbers
Twitter by the Numbers
Raffi Krikorian
Introduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
Real-time systems at Twitter (Velocity 2012)
Real-time systems at Twitter (Velocity 2012)
Raffi Krikorian
Luigi presentation OA Summit
Luigi presentation OA Summit
Open Analytics
Scaling Twitter
Scaling Twitter
Blaine
Luigi presentation NYC Data Science
Luigi presentation NYC Data Science
Erik Bernhardsson
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
Viewers also liked
(11)
Analytics with Splunk (Open edX)
Analytics with Splunk (Open edX)
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
What we're learning about burnout and how DevOps can help
What we're learning about burnout and how DevOps can help
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Twitter by the Numbers
Twitter by the Numbers
Introduction to Scalding and Monoids
Introduction to Scalding and Monoids
Real-time systems at Twitter (Velocity 2012)
Real-time systems at Twitter (Velocity 2012)
Luigi presentation OA Summit
Luigi presentation OA Summit
Scaling Twitter
Scaling Twitter
Luigi presentation NYC Data Science
Luigi presentation NYC Data Science
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Similar to Scalding @ Coursera
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
Buildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
Shooting rabbits with sling
Shooting rabbits with sling
Tomasz Rękawek
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
huguk
Guacamole
Guacamole
ArangoDB Database
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
MapR Technologies
20170126 big data processing
20170126 big data processing
Vienna Data Science Group
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Continuent
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
StackMate - CloudFormation for CloudStack
StackMate - CloudFormation for CloudStack
Chiradeep Vittal
Intro to Spark
Intro to Spark
Kyle Burke
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
ITCamp
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
Swiss Data Forum Swiss Data Forum
Similar to Scalding @ Coursera
(20)
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
Buildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Shooting rabbits with sling
Shooting rabbits with sling
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
Guacamole
Guacamole
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
20170126 big data processing
20170126 big data processing
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
StackMate - CloudFormation for CloudStack
StackMate - CloudFormation for CloudStack
Intro to Spark
Intro to Spark
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
Recently uploaded
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
ICS
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
Willy Marroquin (WillyDevNET)
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
Jhone kinadey
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
OnePlan Solutions
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
proinshot.com
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
Software Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
Arshad QA
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
Andolasoft Inc
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Define the academic and professional writing..pdf
Define the academic and professional writing..pdf
PearlKirahMaeRagusta1
Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid
Philip Schwarz
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
software pro Development
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
Delhi Call girls
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
ryanfarris8
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
kalichargn70th171
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
ComplianceQuest1
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
Delhi Call girls
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
panagenda
Recently uploaded
(20)
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
Software Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Define the academic and professional writing..pdf
Define the academic and professional writing..pdf
Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
Scalding @ Coursera
1.
@ Coursera Daniel
Chia @DanielJHChia Software Engineer, Infrastructure
2.
Overview • Context
• Growing Needs • Hive / Pig / Scalding
3.
Technical (Online Stack)
• 100% hosted on AWS • Service-oriented architecture • Mix of MySQL and Cassandra for persistence • Scala
4.
Existing Warehouse Streaming
5.
Future Warehouse Flow
S3 Event Data
6.
Need 1: Expressive
• Joins • Aggregations • Secondary sort • Multiple map-reduce
7.
Need 2: Semi-structured
Data • Increased usage of Cassandra • Events data
8.
{ “timestamp”:1411359695744, “membershipState":"LearnerEnrolled"
}
9.
{ "typeName": "multipart",
"definition": { "assignmentParts": { "id1": { "typeName": "plainText", "order": 0, "definition": { "prompt": "Write a sentence describing what you think about cereal." } }, "id2": { "typeName": "richText", "order": 1, "definition": { "prompt": "Write a long essay with lots of fancy formatting describing what you think about cereal." } }, "id3": { "typeName": "url", "order": 2, "definition": { "prompt": "Post a link to your favorite cereal." } }, "id4": { "typeName": "plainText", "order": 3, "definition": { …
10.
Choices • Hive
• Pig • Scalding
11.
Hive • SQL-like
language • Great for simple rollups and aggregations • Procedural transforms difficult to express
12.
Pig • Mature
• Procedural • Pig Latin + Lots of UDFs
13.
Scalding – Pros
• Succinct • Expressive • All code in one language • Re-use online data models
14.
Scaling – Pros
• Easy to test
15.
Scalding – Cons
• Have to learn Scala • More heavy weight for simple experimental things. • Many layers abstracted from MapReduce
16.
Scalding – Example
• User event data • Want to join with course and topic data
17.
Scalding – Example
val events = TypedTsv … /* load data */ .toTypedPipe val courses = TypedTsv … .toTypedPipe val topics = TypedTsv … .toTypedPipe
18.
Scalding – Example
events.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */
19.
Scalding – Example
events.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */
20.
Scalding – Example
events.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .sketch(reducer = 100) .leftJoin(topics.groupBy(_.topicId))
21.
Scalding – Wish-list
• More documentation • Scala 2.11 soon, please?
22.
Questions? We’re hiring!
coursera.org/jobs
Download now