SlideShare a Scribd company logo
1 of 19
Apache Crunch
SIMPLE AND EFFICIENT MAPREDUCE PIPELINES
THIN VENEER ON TOP OF MAPREDUCE
Aim of Crunch
Main goal of Crunch is to provide a high-level API for writing and testing
complex MapReduce jobs that require multiple processing stages
In other words
Make pipelines that are composed of many user-defined functions simple
to write, easy to test, and efficient to run
Why Crunch?
 A framework for writing, testing and running map reduce pipelines.
 Crunch does not impose a single data type that all of its inputs must conform to. Useful while
processing time series data, serialized object formats, HBase rows and columns, etc.
 Crunch provides a library of patterns to implement common tasks like joining data, performing
aggregations and sorting records.
 Type safety makes it much less likely to make mistakes in your code.
 Simple powerful testing using the supplied MemPipline for fast in-memory unit tests.
 Pluggable execution engines such as MapReduce and Spark which let us keep up to date
with new technology advancements in the big data space without having to rewrite all of our
pipelines with each increment.
 Manages pipeline execution.
Metadata about Crunch
 Modeled after ‘FlumeJava’ by Google.
 Initial coding of Crunch was done by Josh Wills at Cloudera in 2011.
 Under Apache License, Version 2.0
 DoFns are used by Crunch in the same way that MapReduce uses the
Mapper or Reducer classes.
 Runs over Hadoop MapReduce and Apache Spark
Crunch APIs
 Centered around 3 interfaces that represents immutable distributed
datasets
1. PCollection
2. PTable
3. PGroupedTable
PCollection – Lazily evaluated parallel collection
 PCollection<T> represents a distributed, unsorted and immutable
collection of elements of type T.
 E.g.: PCollection<String>
 PCollection<T> provides a method parallelDo, that applies DoFn to each
element in the PCollection<T> in parallel and returns a new PCollection<T>
as its result.
parallelDo
It supports element wise comparison over an input Pcollection<T>
Signature: Collection.parallelDo(<Type>, DoFn, PType)
Pipeline – Source > PType > Target
 Crunch composes of processing the pipelines.
 A pipeline is a programmatic description of a DAG.
 Different pipelines available are:
 MapReduce pipeline
 Memory pipeline
 Spark pipeline
 A pipeline start with a ‘Source’ which is necessary various inputs (At least one source per
pipeline).
 Input sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text
 The data from ‘Source’ is read into ‘PType’.
 PType hides the serialization and exposes data in native Java forms.
 The data is persisted into a ‘Target’. (At least one target per pipeline).
 Output sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text
DoFn – The data processor
 A simple API to implement
 Used to transform PCollections form one form to another
 DoFn is the location for custom logics
Example:
class example extends DoFn<String, String>{
….
}
The class need to define a method called ‘process()’
public void process(String s, Emitter<String> emitter){
String data = ..;
Emitter.emit(data);
}
This is where we write our custom logic
DoFn runtime processing steps
1. DoFn is given access to ‘TaskInputOutputContext’ implementation for current
task. This allows the DoFn to access any necessary configuration and runtime
information needed before or during processing.
2. DoFn’s ‘initialize’ method is called. Similar to ‘setup’ of Mapper/ Reducer class.
3. Data processing begins. The map/ reduce phase pass the input to the
‘process’ method in DoFn. The output will be captured by ‘Emitter<T>’ which
then can be given to another DoFn or can be serialized and given as output
of the current stage.
4. Cleaning up: Performed by the ‘cleanup’ method. It has two purpose, emit
the state of the DoFn to another DoFn and release any resources on
‘Emitter<T>’ of every DoFn.
Accessing runtime mapreduce APIs
 DoFn provides access to ‘TaskInputOutputContext’ object
 getConfiguration()
 progress()
 setStatus()/getStatus()
 getTaskAttemptID()
 DoFn provide helper methods to work with Hadoop counters, ‘increment’.
The final value of the counter can be retrieved from ‘StageResult’ object.
Common DoFn patterns
Following are the different flavors of MapFn:
 FilterFn – used to accept only those PCollection<T> object that satisfies the
filter condition.
 MapFn – Used in transformations where each input will have exactly one
output.
 CombineFn –used in conjunction with ‘combineValues’ method defined
on the PGroupedTable instance. This is used to perform associative
functions that are performed in the combiner phase of a mapreduce job.
 The associative patterns supported includes sum, counts and unions, via the
‘Aggregator’ interface.
PTable<K,V>
 Sub interface of PCollection<Pair<K,V>>
 Represents a distributed, immutable and unordered multimap of key type
K and value type V
 PTable<K,V> provides parallelDo, groupByKey, join, cogroup operations
 groupByKey operation aggregates all values in the PTable that has the same
values together. (It triggers the sort phase in a MapReduce job)
 Mapside, Bllomfilter and Sharded joins are available.
 The number of reducers and portioning, grouping and sorting strategies
used in shuffle phase can be specified in an instance of GroupingOptions
class which is then given to groupByKey function.
PGroupedTable<K,V>
 The result of groupByKey function is a PGroupedTable<K,V> object, which
is a distributed sorted map of keys of type K to an iterable that may be
iterated once.
 PGroupedTable<K,V> has parallelDo, combinedValues operations
 combinedValues performs a commutative and associative ‘Aggregator’
to be applied to the values in PGroupedTable instance on both the map
and reduce sides of the shuffle
Across various technologies
Concept
Apache
Hadoop
MapReduce
Apache
Crunch
Apache Pig
Apache
Spark
Cascading Apache Hive Apache Tez
Input Data InputFormat Source LoadFunc InputFormat Tap (Source) SerDe Tez Input
Output Data OutputFormat Target StoreFunc OutputFormat Tap (Sink) SerDe Tez Output
Data
Container
Abstraction
N/A PCollection Relation RDD Pipe Table Vertex
Data Format
and
Serialization
Writables
POJOs and
PTypes
Pig Tuples and
Schemas
POJOs and
Java/Kryo
Serialization
Cascading
Tuples and
Schemes
List<Object>
and
ObjectInspect
ors
Events
Data
Transformation
Mapper,
Reducer, and
Combiner
DoFn
PigLatin and
UDFs
Functions
(Java API)
Operations
HiveQL and
UDFs
Processor
Miscellaneous
 Two different serialization frameworks with a number of convenience methods for
defining PTypes:
 Hadoop's ’Writable’ interface
 Apache ‘Avro’ serialization
 Crunch can execute an individual DoFn in either the map or reduce phase of a
MapReduce job, we can also execute multiple DoFn in a single phase.
 Apache Hive and Apache Pig define domain-specific languages (DSLs) that are
intended to make it easy for data analysts to work with data stored in Hadoop,
while Cascading and Apache Crunch develop Java libraries that are aimed at
developers who are building pipelines and applications with a focus on
performance and testability.
Use Case – Log Data Processor
Lets see how the below simple log data processor can be implemented in Crunch
Use Case – Log Data Processor
Crunch implementation of above use case
Crunch Vs Cascading, Pig, Hive
 Developers who tend to think about problems as data flow patterns prefer Crunch and
Pig, while those who think in SQL style prefer Cascading and Hive.
 Crunch supports an in-memory execution engine that can be used to test and debug
pipelines on local data.
 Pig & Cascading uses ‘Tuple model’, however, Crunch uses arbitrary objects.
 Trade off:-
 Simple data type which requires basic in-built functions – use Cascading
 Complex data type requiring more user defined functions – use Crunch
 Compile-time type checking of the Crunch is highly useful.
QUERIES?

More Related Content

What's hot

いいからベイズ推定してみる
いいからベイズ推定してみるいいからベイズ推定してみる
いいからベイズ推定してみるMakoto Hirakawa
 
CNNチュートリアル
CNNチュートリアルCNNチュートリアル
CNNチュートリアルIkuro Sato
 
深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシンYuta Sugii
 
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...Akihiro Nitta
 
PRML Chapter 8
PRML Chapter 8PRML Chapter 8
PRML Chapter 8Sunwoo Kim
 
Person Re-Identification におけるRe-ranking のための K reciprocal-encoding
Person Re-Identification におけるRe-ranking のための K reciprocal-encodingPerson Re-Identification におけるRe-ranking のための K reciprocal-encoding
Person Re-Identification におけるRe-ranking のための K reciprocal-encodingtancoro
 
MLaPP 5章 「ベイズ統計学」
MLaPP 5章 「ベイズ統計学」MLaPP 5章 「ベイズ統計学」
MLaPP 5章 「ベイズ統計学」moterech
 
SakataMoriLab GNN勉強会第一回資料
SakataMoriLab GNN勉強会第一回資料SakataMoriLab GNN勉強会第一回資料
SakataMoriLab GNN勉強会第一回資料ttt_miura
 
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)Takuma Yagi
 
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningssuserca2822
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational AutoencoderDeep Learning JP
 
言語表現モデルBERTで文章生成してみた
言語表現モデルBERTで文章生成してみた言語表現モデルBERTで文章生成してみた
言語表現モデルBERTで文章生成してみたTakuya Koumura
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...Takanori Nakai
 
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingDecision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingTomoya Oda
 
社内論文読み会資料 Image-to-Image Retrieval by Learning Similarity between Scene Graphs
社内論文読み会資料 Image-to-Image Retrieval by Learning Similarity between Scene Graphs社内論文読み会資料 Image-to-Image Retrieval by Learning Similarity between Scene Graphs
社内論文読み会資料 Image-to-Image Retrieval by Learning Similarity between Scene GraphsKazuhiro Ota
 

What's hot (20)

prml4.1.3-4.1.4
prml4.1.3-4.1.4prml4.1.3-4.1.4
prml4.1.3-4.1.4
 
Chapter9 2
Chapter9 2Chapter9 2
Chapter9 2
 
いいからベイズ推定してみる
いいからベイズ推定してみるいいからベイズ推定してみる
いいからベイズ推定してみる
 
CNNチュートリアル
CNNチュートリアルCNNチュートリアル
CNNチュートリアル
 
深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン
 
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
 
PRML Chapter 8
PRML Chapter 8PRML Chapter 8
PRML Chapter 8
 
Person Re-Identification におけるRe-ranking のための K reciprocal-encoding
Person Re-Identification におけるRe-ranking のための K reciprocal-encodingPerson Re-Identification におけるRe-ranking のための K reciprocal-encoding
Person Re-Identification におけるRe-ranking のための K reciprocal-encoding
 
MLaPP 5章 「ベイズ統計学」
MLaPP 5章 「ベイズ統計学」MLaPP 5章 「ベイズ統計学」
MLaPP 5章 「ベイズ統計学」
 
SakataMoriLab GNN勉強会第一回資料
SakataMoriLab GNN勉強会第一回資料SakataMoriLab GNN勉強会第一回資料
SakataMoriLab GNN勉強会第一回資料
 
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
 
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
 
PRML chapter7
PRML chapter7PRML chapter7
PRML chapter7
 
ITと数学で社会に役立つ数理最適化
ITと数学で社会に役立つ数理最適化ITと数学で社会に役立つ数理最適化
ITと数学で社会に役立つ数理最適化
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
 
言語表現モデルBERTで文章生成してみた
言語表現モデルBERTで文章生成してみた言語表現モデルBERTで文章生成してみた
言語表現モデルBERTで文章生成してみた
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
WSDM2018 読み会 Latent cross making use of context in recurrent recommender syst...
 
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingDecision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence Modeling
 
社内論文読み会資料 Image-to-Image Retrieval by Learning Similarity between Scene Graphs
社内論文読み会資料 Image-to-Image Retrieval by Learning Similarity between Scene Graphs社内論文読み会資料 Image-to-Image Retrieval by Learning Similarity between Scene Graphs
社内論文読み会資料 Image-to-Image Retrieval by Learning Similarity between Scene Graphs
 

Similar to Apache Crunch

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_PennonsoftPennonSoft
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiUnmesh Baile
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data Jay Nagar
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoopishan0019
 

Similar to Apache Crunch (20)

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Java 8 streams
Java 8 streams Java 8 streams
Java 8 streams
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Map reduce
Map reduceMap reduce
Map reduce
 

Recently uploaded

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Apache Crunch

  • 1. Apache Crunch SIMPLE AND EFFICIENT MAPREDUCE PIPELINES THIN VENEER ON TOP OF MAPREDUCE
  • 2. Aim of Crunch Main goal of Crunch is to provide a high-level API for writing and testing complex MapReduce jobs that require multiple processing stages In other words Make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run
  • 3. Why Crunch?  A framework for writing, testing and running map reduce pipelines.  Crunch does not impose a single data type that all of its inputs must conform to. Useful while processing time series data, serialized object formats, HBase rows and columns, etc.  Crunch provides a library of patterns to implement common tasks like joining data, performing aggregations and sorting records.  Type safety makes it much less likely to make mistakes in your code.  Simple powerful testing using the supplied MemPipline for fast in-memory unit tests.  Pluggable execution engines such as MapReduce and Spark which let us keep up to date with new technology advancements in the big data space without having to rewrite all of our pipelines with each increment.  Manages pipeline execution.
  • 4. Metadata about Crunch  Modeled after ‘FlumeJava’ by Google.  Initial coding of Crunch was done by Josh Wills at Cloudera in 2011.  Under Apache License, Version 2.0  DoFns are used by Crunch in the same way that MapReduce uses the Mapper or Reducer classes.  Runs over Hadoop MapReduce and Apache Spark
  • 5. Crunch APIs  Centered around 3 interfaces that represents immutable distributed datasets 1. PCollection 2. PTable 3. PGroupedTable
  • 6. PCollection – Lazily evaluated parallel collection  PCollection<T> represents a distributed, unsorted and immutable collection of elements of type T.  E.g.: PCollection<String>  PCollection<T> provides a method parallelDo, that applies DoFn to each element in the PCollection<T> in parallel and returns a new PCollection<T> as its result. parallelDo It supports element wise comparison over an input Pcollection<T> Signature: Collection.parallelDo(<Type>, DoFn, PType)
  • 7. Pipeline – Source > PType > Target  Crunch composes of processing the pipelines.  A pipeline is a programmatic description of a DAG.  Different pipelines available are:  MapReduce pipeline  Memory pipeline  Spark pipeline  A pipeline start with a ‘Source’ which is necessary various inputs (At least one source per pipeline).  Input sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text  The data from ‘Source’ is read into ‘PType’.  PType hides the serialization and exposes data in native Java forms.  The data is persisted into a ‘Target’. (At least one target per pipeline).  Output sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text
  • 8. DoFn – The data processor  A simple API to implement  Used to transform PCollections form one form to another  DoFn is the location for custom logics Example: class example extends DoFn<String, String>{ …. } The class need to define a method called ‘process()’ public void process(String s, Emitter<String> emitter){ String data = ..; Emitter.emit(data); } This is where we write our custom logic
  • 9. DoFn runtime processing steps 1. DoFn is given access to ‘TaskInputOutputContext’ implementation for current task. This allows the DoFn to access any necessary configuration and runtime information needed before or during processing. 2. DoFn’s ‘initialize’ method is called. Similar to ‘setup’ of Mapper/ Reducer class. 3. Data processing begins. The map/ reduce phase pass the input to the ‘process’ method in DoFn. The output will be captured by ‘Emitter<T>’ which then can be given to another DoFn or can be serialized and given as output of the current stage. 4. Cleaning up: Performed by the ‘cleanup’ method. It has two purpose, emit the state of the DoFn to another DoFn and release any resources on ‘Emitter<T>’ of every DoFn.
  • 10. Accessing runtime mapreduce APIs  DoFn provides access to ‘TaskInputOutputContext’ object  getConfiguration()  progress()  setStatus()/getStatus()  getTaskAttemptID()  DoFn provide helper methods to work with Hadoop counters, ‘increment’. The final value of the counter can be retrieved from ‘StageResult’ object.
  • 11. Common DoFn patterns Following are the different flavors of MapFn:  FilterFn – used to accept only those PCollection<T> object that satisfies the filter condition.  MapFn – Used in transformations where each input will have exactly one output.  CombineFn –used in conjunction with ‘combineValues’ method defined on the PGroupedTable instance. This is used to perform associative functions that are performed in the combiner phase of a mapreduce job.  The associative patterns supported includes sum, counts and unions, via the ‘Aggregator’ interface.
  • 12. PTable<K,V>  Sub interface of PCollection<Pair<K,V>>  Represents a distributed, immutable and unordered multimap of key type K and value type V  PTable<K,V> provides parallelDo, groupByKey, join, cogroup operations  groupByKey operation aggregates all values in the PTable that has the same values together. (It triggers the sort phase in a MapReduce job)  Mapside, Bllomfilter and Sharded joins are available.  The number of reducers and portioning, grouping and sorting strategies used in shuffle phase can be specified in an instance of GroupingOptions class which is then given to groupByKey function.
  • 13. PGroupedTable<K,V>  The result of groupByKey function is a PGroupedTable<K,V> object, which is a distributed sorted map of keys of type K to an iterable that may be iterated once.  PGroupedTable<K,V> has parallelDo, combinedValues operations  combinedValues performs a commutative and associative ‘Aggregator’ to be applied to the values in PGroupedTable instance on both the map and reduce sides of the shuffle
  • 14. Across various technologies Concept Apache Hadoop MapReduce Apache Crunch Apache Pig Apache Spark Cascading Apache Hive Apache Tez Input Data InputFormat Source LoadFunc InputFormat Tap (Source) SerDe Tez Input Output Data OutputFormat Target StoreFunc OutputFormat Tap (Sink) SerDe Tez Output Data Container Abstraction N/A PCollection Relation RDD Pipe Table Vertex Data Format and Serialization Writables POJOs and PTypes Pig Tuples and Schemas POJOs and Java/Kryo Serialization Cascading Tuples and Schemes List<Object> and ObjectInspect ors Events Data Transformation Mapper, Reducer, and Combiner DoFn PigLatin and UDFs Functions (Java API) Operations HiveQL and UDFs Processor
  • 15. Miscellaneous  Two different serialization frameworks with a number of convenience methods for defining PTypes:  Hadoop's ’Writable’ interface  Apache ‘Avro’ serialization  Crunch can execute an individual DoFn in either the map or reduce phase of a MapReduce job, we can also execute multiple DoFn in a single phase.  Apache Hive and Apache Pig define domain-specific languages (DSLs) that are intended to make it easy for data analysts to work with data stored in Hadoop, while Cascading and Apache Crunch develop Java libraries that are aimed at developers who are building pipelines and applications with a focus on performance and testability.
  • 16. Use Case – Log Data Processor Lets see how the below simple log data processor can be implemented in Crunch
  • 17. Use Case – Log Data Processor Crunch implementation of above use case
  • 18. Crunch Vs Cascading, Pig, Hive  Developers who tend to think about problems as data flow patterns prefer Crunch and Pig, while those who think in SQL style prefer Cascading and Hive.  Crunch supports an in-memory execution engine that can be used to test and debug pipelines on local data.  Pig & Cascading uses ‘Tuple model’, however, Crunch uses arbitrary objects.  Trade off:-  Simple data type which requires basic in-built functions – use Cascading  Complex data type requiring more user defined functions – use Crunch  Compile-time type checking of the Crunch is highly useful.