SlideShare a Scribd company logo
1 of 34
@Scalding
https://github.com/twitter/scalding


          Oscar Boykin
            Twitter
            @posco
#hadoopsummit
I encourage live tweeting (mention @posco/@scalding)
• What is Scalding?
• Why Scala for Map/Reduce?
• How is it used at Twitter?
• What’s next for Scalding?
Yep, we’re counting
               words:
Scalding jobs
subclass Job
Yep, we’re counting
                words:

Logic is in the
 constructor
Yep, we’re counting
              words:


Functions can
 be called or
defined inline
Scalding Model

• Source objects read and write data (from
  HDFS, DBs, MemCache, etc...)
• Pipes represent the flows of the data in the
  job. You can think of Pipe as a distributed
  list.
Yep, we’re counting
               words:


  Read and
  Write data
   through
Source objects
Yep, we’re counting
               words:


Data is modeled
  as streams of
named Tuples (of
     objects)
Why Scala
• The scala language has a lot of built-in
  features that make domain-specific
  languages easy to implement.
• Map/Reduce is already within the functional
  paradigm.
• Scala’s collection API covers almost all usual
  use cases.
Word Co-occurrence
Word Co-occurrence


 We can use
standard scala
  containers
Word Co-occurrence


We can do real
  logic in the
mapper without
external UDFs.
Word Co-occurrence

  Generalized
“plus” handles
lists/sets/maps
   and can be
  customized
  (implement
  Monoid[T])
GroupBuilder: enabling
 parallel reductions
           • groupBy takes a
             function that mutates a
             GroupBuilder.
           • GroupBuilder adds
             fields which are
             reductions of
             (potentially different)
             inputs.
           • On the left, we add 7
             fields.
scald.rb
• driver script that compiles the job and runs
  it locally or transfers and runs remotely.
• we plan to add EMR support.
Most functions in the API
have very close analogs in
 scala.collection.Iterable.
Cascading
• is the java library that handles most of the map/
  reduce planning for scalding.
• has years of production use.
• is used, tested, and optimized by many teams
  (Concurrent Inc., DSLs in Scala, Clojure, Python
  @Twitter. Ruby at Etsy).
• has a (very fast) local mode that runs without
  Hadoop.
• flow planner designed to be portable (cascading
  on Spark? Storm?)
mapReduceMap
• We abstract Cascading’s map-side
  aggregation ability with a function called
  mapReduceMap.
• If only mapReduceMaps are called, map-side
  aggregation works. If a foldLeft is called
  (which cannot be done map-side), scalding
  falls back to pushing everything to the
  reducers.
Most Reductions are
 mapReduceMap
Optimized Joins
• mapside join is called joinWithTiny.
  Implements left or inner join with a very
  small pipe.
• blockJoin: deals with data skew by
  replicating the data (useful for walking the
  Twitter follower graph, where everyone
  follows Gaga/Bieber/Obama).
• coming: combine the above to dynamically
  set replication on a per key basis: only Gaga
  is replicated, and just the right amount.
Scalding @Twitter
• Revenue quality team (ads targeting, market
  insight, click-prediction, traffic-quality) uses
  scalding for all our work.
• Scala engineers throughout the company
  use it (i.e. storage, platform).
• More than 60 in-production scalding jobs,
  more than 200 ad-hoc jobs.
• Not our only tool: Pig, PyCascading,
  Cascalog, Hive are also used.
Example: finding
         similarity
• A simple recommendation algorithm is
  cosine similarity.
• Represent user-tweet interaction as a
  vector, then find the users whose vectors
  point in directions near the user in
  question.
• We’ve developed a Matrix library on top of
  scalding to make this easy.
Cosine Similarity

Matrices are
 strongly
  typed.
Cosine Similarity
   Col,Row
types (Int,Int)
     can be
   anything
 comparable.
  Strings are
useful for text
    indices.
Cosine Similarity

    Value
(Double) can
 be anything
with a Ring[T]
 (plus/times)
Cosine Similarity

  Operator
 overloading
gives intuitive
    code.
Matrix in foreground,
        map/reduce behind

    With this
  syntax, we can
  focus on logic,
not how to map
linear algebra to
     Hadoop
Example uses:
• Do random-walks on the following graph.
  Matrix power iteration until convergence:
  (m * m * m * m).
• Dimensionality reduction of follower graph
  (Matrix product by a lower dimensional
  projection matrix).
• Triangle counting: (M*M*M).trace / 3
What is next?

• Improve the logical flow planning (reorder
  commuting filters/projections before maps,
  etc...).
• Improve Matrix flow planning to narrow
  the gap to hand optimized code.
One more thing:


• Type-safety geeks can relax: we just pushed
  a type-safe API to scalding 0.6.0 analogous
  to Scoobi/Scrunch/Spark, so relax.
That’s it.


• follow and mention: @scalding @posco
• pull reqs: http://github.com/twitter/scalding

More Related Content

What's hot

What's hot (20)

Scalding
ScaldingScalding
Scalding
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Meet scala
Meet scalaMeet scala
Meet scala
 
Scala+data
Scala+dataScala+data
Scala+data
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017Kotlin @ Coupang Backend 2017
Kotlin @ Coupang Backend 2017
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Joy of scala
Joy of scalaJoy of scala
Joy of scala
 
Requery overview
Requery overviewRequery overview
Requery overview
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
ACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and Language
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
 

Viewers also liked

Procedure Typing for Scala
Procedure Typing for ScalaProcedure Typing for Scala
Procedure Typing for Scala
akuklev
 

Viewers also liked (19)

Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Scala test
Scala testScala test
Scala test
 
Metaprogramming in Scala 2.10, Eugene Burmako,
Metaprogramming  in Scala 2.10, Eugene Burmako, Metaprogramming  in Scala 2.10, Eugene Burmako,
Metaprogramming in Scala 2.10, Eugene Burmako,
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 
Scala profiling
Scala profilingScala profiling
Scala profiling
 
Procedure Typing for Scala
Procedure Typing for ScalaProcedure Typing for Scala
Procedure Typing for Scala
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2BigData, Hadoop과 Node.js, R2
BigData, Hadoop과 Node.js, R2
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
RESTful Java
RESTful JavaRESTful Java
RESTful Java
 
Elixirと他言語の比較的紹介 ver.2
Elixirと他言語の比較的紹介ver.2Elixirと他言語の比較的紹介ver.2
Elixirと他言語の比較的紹介 ver.2
 
Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方Go言語によるwebアプリの作り方
Go言語によるwebアプリの作り方
 
PHP7はなぜ速いのか
PHP7はなぜ速いのかPHP7はなぜ速いのか
PHP7はなぜ速いのか
 
java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰java 8 람다식 소개와 의미 고찰
java 8 람다식 소개와 의미 고찰
 
地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)地獄のElixir(目黒スタートアップ勉強会)
地獄のElixir(目黒スタートアップ勉強会)
 
PHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイントPHP7で変わること ——言語仕様とエンジンの改善ポイント
PHP7で変わること ——言語仕様とエンジンの改善ポイント
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
GoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンGoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホン
 

Similar to Scalding: Twitter's Scala DSL for Hadoop/Cascading

Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Xebia Nederland BV
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 

Similar to Scalding: Twitter's Scala DSL for Hadoop/Cascading (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scaldingWhy hadoop map reduce needs scala, an introduction to scoobi and scalding
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Scalding intro 20141125
Scalding intro 20141125Scalding intro 20141125
Scalding intro 20141125
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Scala final ppt vinay
Scala final ppt vinayScala final ppt vinay
Scala final ppt vinay
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Scalding: Twitter's Scala DSL for Hadoop/Cascading

  • 1. @Scalding https://github.com/twitter/scalding Oscar Boykin Twitter @posco
  • 2. #hadoopsummit I encourage live tweeting (mention @posco/@scalding)
  • 3. • What is Scalding? • Why Scala for Map/Reduce? • How is it used at Twitter? • What’s next for Scalding?
  • 4. Yep, we’re counting words: Scalding jobs subclass Job
  • 5. Yep, we’re counting words: Logic is in the constructor
  • 6. Yep, we’re counting words: Functions can be called or defined inline
  • 7. Scalding Model • Source objects read and write data (from HDFS, DBs, MemCache, etc...) • Pipes represent the flows of the data in the job. You can think of Pipe as a distributed list.
  • 8. Yep, we’re counting words: Read and Write data through Source objects
  • 9. Yep, we’re counting words: Data is modeled as streams of named Tuples (of objects)
  • 10. Why Scala • The scala language has a lot of built-in features that make domain-specific languages easy to implement. • Map/Reduce is already within the functional paradigm. • Scala’s collection API covers almost all usual use cases.
  • 12. Word Co-occurrence We can use standard scala containers
  • 13. Word Co-occurrence We can do real logic in the mapper without external UDFs.
  • 14. Word Co-occurrence Generalized “plus” handles lists/sets/maps and can be customized (implement Monoid[T])
  • 15. GroupBuilder: enabling parallel reductions • groupBy takes a function that mutates a GroupBuilder. • GroupBuilder adds fields which are reductions of (potentially different) inputs. • On the left, we add 7 fields.
  • 16. scald.rb • driver script that compiles the job and runs it locally or transfers and runs remotely. • we plan to add EMR support.
  • 17.
  • 18. Most functions in the API have very close analogs in scala.collection.Iterable.
  • 19. Cascading • is the java library that handles most of the map/ reduce planning for scalding. • has years of production use. • is used, tested, and optimized by many teams (Concurrent Inc., DSLs in Scala, Clojure, Python @Twitter. Ruby at Etsy). • has a (very fast) local mode that runs without Hadoop. • flow planner designed to be portable (cascading on Spark? Storm?)
  • 20. mapReduceMap • We abstract Cascading’s map-side aggregation ability with a function called mapReduceMap. • If only mapReduceMaps are called, map-side aggregation works. If a foldLeft is called (which cannot be done map-side), scalding falls back to pushing everything to the reducers.
  • 21. Most Reductions are mapReduceMap
  • 22.
  • 23. Optimized Joins • mapside join is called joinWithTiny. Implements left or inner join with a very small pipe. • blockJoin: deals with data skew by replicating the data (useful for walking the Twitter follower graph, where everyone follows Gaga/Bieber/Obama). • coming: combine the above to dynamically set replication on a per key basis: only Gaga is replicated, and just the right amount.
  • 24. Scalding @Twitter • Revenue quality team (ads targeting, market insight, click-prediction, traffic-quality) uses scalding for all our work. • Scala engineers throughout the company use it (i.e. storage, platform). • More than 60 in-production scalding jobs, more than 200 ad-hoc jobs. • Not our only tool: Pig, PyCascading, Cascalog, Hive are also used.
  • 25. Example: finding similarity • A simple recommendation algorithm is cosine similarity. • Represent user-tweet interaction as a vector, then find the users whose vectors point in directions near the user in question. • We’ve developed a Matrix library on top of scalding to make this easy.
  • 27. Cosine Similarity Col,Row types (Int,Int) can be anything comparable. Strings are useful for text indices.
  • 28. Cosine Similarity Value (Double) can be anything with a Ring[T] (plus/times)
  • 29. Cosine Similarity Operator overloading gives intuitive code.
  • 30. Matrix in foreground, map/reduce behind With this syntax, we can focus on logic, not how to map linear algebra to Hadoop
  • 31. Example uses: • Do random-walks on the following graph. Matrix power iteration until convergence: (m * m * m * m). • Dimensionality reduction of follower graph (Matrix product by a lower dimensional projection matrix). • Triangle counting: (M*M*M).trace / 3
  • 32. What is next? • Improve the logical flow planning (reorder commuting filters/projections before maps, etc...). • Improve Matrix flow planning to narrow the gap to hand optimized code.
  • 33. One more thing: • Type-safety geeks can relax: we just pushed a type-safe API to scalding 0.6.0 analogous to Scoobi/Scrunch/Spark, so relax.
  • 34. That’s it. • follow and mention: @scalding @posco • pull reqs: http://github.com/twitter/scalding

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n