2. Aim of Crunch
Main goal of Crunch is to provide a high-level API for writing and testing
complex MapReduce jobs that require multiple processing stages
In other words
Make pipelines that are composed of many user-defined functions simple
to write, easy to test, and efficient to run
3. Why Crunch?
A framework for writing, testing and running map reduce pipelines.
Crunch does not impose a single data type that all of its inputs must conform to. Useful while
processing time series data, serialized object formats, HBase rows and columns, etc.
Crunch provides a library of patterns to implement common tasks like joining data, performing
aggregations and sorting records.
Type safety makes it much less likely to make mistakes in your code.
Simple powerful testing using the supplied MemPipline for fast in-memory unit tests.
Pluggable execution engines such as MapReduce and Spark which let us keep up to date
with new technology advancements in the big data space without having to rewrite all of our
pipelines with each increment.
Manages pipeline execution.
4. Metadata about Crunch
Modeled after ‘FlumeJava’ by Google.
Initial coding of Crunch was done by Josh Wills at Cloudera in 2011.
Under Apache License, Version 2.0
DoFns are used by Crunch in the same way that MapReduce uses the
Mapper or Reducer classes.
Runs over Hadoop MapReduce and Apache Spark
5. Crunch APIs
Centered around 3 interfaces that represents immutable distributed
datasets
1. PCollection
2. PTable
3. PGroupedTable
6. PCollection – Lazily evaluated parallel collection
PCollection<T> represents a distributed, unsorted and immutable
collection of elements of type T.
E.g.: PCollection<String>
PCollection<T> provides a method parallelDo, that applies DoFn to each
element in the PCollection<T> in parallel and returns a new PCollection<T>
as its result.
parallelDo
It supports element wise comparison over an input Pcollection<T>
Signature: Collection.parallelDo(<Type>, DoFn, PType)
7. Pipeline – Source > PType > Target
Crunch composes of processing the pipelines.
A pipeline is a programmatic description of a DAG.
Different pipelines available are:
MapReduce pipeline
Memory pipeline
Spark pipeline
A pipeline start with a ‘Source’ which is necessary various inputs (At least one source per
pipeline).
Input sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text
The data from ‘Source’ is read into ‘PType’.
PType hides the serialization and exposes data in native Java forms.
The data is persisted into a ‘Target’. (At least one target per pipeline).
Output sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text
8. DoFn – The data processor
A simple API to implement
Used to transform PCollections form one form to another
DoFn is the location for custom logics
Example:
class example extends DoFn<String, String>{
….
}
The class need to define a method called ‘process()’
public void process(String s, Emitter<String> emitter){
String data = ..;
Emitter.emit(data);
}
This is where we write our custom logic
9. DoFn runtime processing steps
1. DoFn is given access to ‘TaskInputOutputContext’ implementation for current
task. This allows the DoFn to access any necessary configuration and runtime
information needed before or during processing.
2. DoFn’s ‘initialize’ method is called. Similar to ‘setup’ of Mapper/ Reducer class.
3. Data processing begins. The map/ reduce phase pass the input to the
‘process’ method in DoFn. The output will be captured by ‘Emitter<T>’ which
then can be given to another DoFn or can be serialized and given as output
of the current stage.
4. Cleaning up: Performed by the ‘cleanup’ method. It has two purpose, emit
the state of the DoFn to another DoFn and release any resources on
‘Emitter<T>’ of every DoFn.
10. Accessing runtime mapreduce APIs
DoFn provides access to ‘TaskInputOutputContext’ object
getConfiguration()
progress()
setStatus()/getStatus()
getTaskAttemptID()
DoFn provide helper methods to work with Hadoop counters, ‘increment’.
The final value of the counter can be retrieved from ‘StageResult’ object.
11. Common DoFn patterns
Following are the different flavors of MapFn:
FilterFn – used to accept only those PCollection<T> object that satisfies the
filter condition.
MapFn – Used in transformations where each input will have exactly one
output.
CombineFn –used in conjunction with ‘combineValues’ method defined
on the PGroupedTable instance. This is used to perform associative
functions that are performed in the combiner phase of a mapreduce job.
The associative patterns supported includes sum, counts and unions, via the
‘Aggregator’ interface.
12. PTable<K,V>
Sub interface of PCollection<Pair<K,V>>
Represents a distributed, immutable and unordered multimap of key type
K and value type V
PTable<K,V> provides parallelDo, groupByKey, join, cogroup operations
groupByKey operation aggregates all values in the PTable that has the same
values together. (It triggers the sort phase in a MapReduce job)
Mapside, Bllomfilter and Sharded joins are available.
The number of reducers and portioning, grouping and sorting strategies
used in shuffle phase can be specified in an instance of GroupingOptions
class which is then given to groupByKey function.
13. PGroupedTable<K,V>
The result of groupByKey function is a PGroupedTable<K,V> object, which
is a distributed sorted map of keys of type K to an iterable that may be
iterated once.
PGroupedTable<K,V> has parallelDo, combinedValues operations
combinedValues performs a commutative and associative ‘Aggregator’
to be applied to the values in PGroupedTable instance on both the map
and reduce sides of the shuffle
14. Across various technologies
Concept
Apache
Hadoop
MapReduce
Apache
Crunch
Apache Pig
Apache
Spark
Cascading Apache Hive Apache Tez
Input Data InputFormat Source LoadFunc InputFormat Tap (Source) SerDe Tez Input
Output Data OutputFormat Target StoreFunc OutputFormat Tap (Sink) SerDe Tez Output
Data
Container
Abstraction
N/A PCollection Relation RDD Pipe Table Vertex
Data Format
and
Serialization
Writables
POJOs and
PTypes
Pig Tuples and
Schemas
POJOs and
Java/Kryo
Serialization
Cascading
Tuples and
Schemes
List<Object>
and
ObjectInspect
ors
Events
Data
Transformation
Mapper,
Reducer, and
Combiner
DoFn
PigLatin and
UDFs
Functions
(Java API)
Operations
HiveQL and
UDFs
Processor
15. Miscellaneous
Two different serialization frameworks with a number of convenience methods for
defining PTypes:
Hadoop's ’Writable’ interface
Apache ‘Avro’ serialization
Crunch can execute an individual DoFn in either the map or reduce phase of a
MapReduce job, we can also execute multiple DoFn in a single phase.
Apache Hive and Apache Pig define domain-specific languages (DSLs) that are
intended to make it easy for data analysts to work with data stored in Hadoop,
while Cascading and Apache Crunch develop Java libraries that are aimed at
developers who are building pipelines and applications with a focus on
performance and testability.
16. Use Case – Log Data Processor
Lets see how the below simple log data processor can be implemented in Crunch
17. Use Case – Log Data Processor
Crunch implementation of above use case
18. Crunch Vs Cascading, Pig, Hive
Developers who tend to think about problems as data flow patterns prefer Crunch and
Pig, while those who think in SQL style prefer Cascading and Hive.
Crunch supports an in-memory execution engine that can be used to test and debug
pipelines on local data.
Pig & Cascading uses ‘Tuple model’, however, Crunch uses arbitrary objects.
Trade off:-
Simple data type which requires basic in-built functions – use Cascading
Complex data type requiring more user defined functions – use Crunch
Compile-time type checking of the Crunch is highly useful.