SlideShare a Scribd company logo
1 of 100
Spark or Hadoop: Is it an either-or
proposition?
By Slim Baltagi (@SlimBaltagi)
Big Data Practice Director
Advanced Analytics LLC
OR
XOR ??
Los Angeles Spark Users Group
March 12, 2015
Your Presenter – Slim Baltagi
2
• Sr. Big Data Solutions Architect
living in Chicago.
• Over 17 years of IT and business
experiences.
• Over 4 years of Big Data
experience working on over 12
Hadoop projects.
• Speaker at Big Data events.
• Creator and maintainer of the
Apache Spark Knowledge
Base:
http://www.SparkBigData.com
with over 4,000 categorized
Apache Spark web resources.
@SlimBaltagi
https://www.linkedin.com/in/slimbalta
gi
sbaltagi@gmail.com
Disclaimer: This is a vendor-independent talk that expresses my own
opinions. I am not endorsing nor promoting any product or vendor mentioned in
this talk.
Agenda
I. Motivation
II. Big Data, Typical Big Data
Stack, Apache Hadoop,
Apache Spark
III. Spark with Hadoop
IV. Spark without Hadoop
V. More Q&A
3
I. Motivation
1. News
2. Surveys
3. Vendors
4. Analysts
5. Key Takeaways
4
1. News
• Is it Spark 'vs' OR 'and' Hadoop?
• Apache Spark: Hadoop friend or foe?
• Apache Spark: killer or savior of Apache Hadoop?
• Apache Spark's Marriage To Hadoop Will Be Bigger
Than Kim And Kanye.
• Adios Hadoop, Hola Spark!
• Apache Spark: Moving on from Hadoop
• Apache Spark Continues to Spread Beyond
Hadoop.
• Escape From Hadoop!
• Spark promises to end up Hadoop, but in a good
way
5
2. Surveys
• "Hadoop's historic focus on batch processing of data
was well supported by MapReduce, but there is an
appetite for more flexible developer tools to support
the larger market of 'mid-size' datasets and use cases
that call for real-time processing.” 2015 Apache Spark
Survey by Typesafe. January 27, 2015.
http://www.marketwired.com/press-release/survey-indicates-apache-spark-
gaining-developer-adoption-as-big-datas-projects-1986162.htm
• Apache Spark: Preparing for the Next Wave of
Reactive Big Data, January 27, 2015 by Typesafe
http://typesafe.com/blog/apache-spark-preparing-for-the-next-wave-of-reactive-
big-data
6
Apache Spark Survey 2015 by
Typesafe - Quick Snapshot
7
3. Vendors
8
• Spark and Hadoop: Working Together. January 21,
2014 by Ion Stoica https://databricks.com/blog/2014/01/21/spark-and-
hadoop.html
• Uniform API for diverse workloads over diverse
storage systems and runtimes.
Source: Slide 16 of ‘Spark's Role in the Big Data Ecosystem (Spark
Summit 2014). November 2014. Matei
Zahariahttp://www.slideshare.net/databricks/spark-summit2014
• "The goal of Apache Spark is to have one engine for all
data sources, workloads and environments.”
Source: Slide 15 of ‘New Directions for Apache Spark in 2015,
February 20, 2015. Strata + Hadoop Summit. Matei Zaharia
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
3. Vendors
• “Spark is already an excellent piece of software and is
advancing very quickly. No vendor — no new project —
is likely to catch up. Chasing Spark would be a waste
of time, and would delay availability of real-time analytic
and processing services for no good reason. ”
Source: MapReduce and Spark, December, 30,2013
http://vision.cloudera.com/mapreduce-spark/
• “Apache Spark is an open source, parallel data
processing framework that complements Apache
Hadoop to make it easy to develop fast, unified Big Data
applications combining batch, streaming, and interactive
analytics on all your data.”
http://www.cloudera.com/content/cloudera/en/products-and-
services/cdh/spark.html
9
3. Vendors
• “Apache Spark is a general-purpose engine for large-
scale data processing. Spark supports rapid application
development for big data and allows for code reuse
across batch, interactive and streaming applications.
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performance.” https://www.mapr.com/products/apache-spark
• MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoop
https://www.mapr.com/company/press-releases/mapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3. Vendors
• “Apache Spark provides an elegant, attractive
development API and allows data workers to rapidly
iterate over data via machine learning and other
data science techniques that require fast, in-
memory data processing.”
http://hortonworks.com/hadoop/spark/
• Hortonworks: A shared vision for Apache Spark on
Hadoop. October 21,
2014https://databricks.com/blog/2014/10/31/hortonworks-a-shared-vision-for-
apache-spark-on-hadoop.html
• “At Hortonworks, we love Spark and want to help our
customers leverage all its benefits.” October 30th, 2014
http://hortonworks.com/blog/improving-spark-data-pipelines-native-yarn-
integration/
11
4. Analysts
• Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice?
• Both are already happening:
• With uncertainty about “what is Hadoop” there is no
reason to think solution stacks built on Spark, not
positioned as Hadoop, will not continue to proliferate
as the technology matures.
• At the same time, Hadoop distributions are all
embracing Spark and including it in their offerings.
Source: Hadoop Questions from Recent Webinar Span Spectrum.
February 25, 2015.http://blogs.gartner.com/merv-adrian/2015/02/25/hadoop-
questions-from-recent-webinar-span-spectrum/
12
4. Analysts
• “After hearing the confusion between Spark and
Hadoop one too many times, I was inspired to write a
report, The Hadoop Ecosystem Overview, Q4 2104.
• For those that have day jobs that don’t include constantly
tracking Hadoop evolution, I dove in and worked with
Hadoop vendors and trusted consultants to create a
framework.
• We divided the complex Hadoop ecosystem into a core
set of tools that all work closely with data stored in
Hadoop File System and extended group of
components that leverage but do not require it.”
Source: Elephants, Pigs, Rhinos and Giraphs; Oh My! – It's Time To
Get A Handle On Hadoop. Posted by Brian Hopkins on November 26,
2014
http://blogs.forrester.com/brian_hopkins/14-11-26-
elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5. Key Takeaways
1. News: Big Data is no longer a Hadoop
monopoly!
2. Surveys: Listen to what Spark developers are
saying!
3. Vendors: <Hadoop Vendor>-tinted goggles!?
FUD is still being ‘offered’ by some Hadoop
vendors. Claims need to be contextualized.
4. Analysts: Thorough understanding of the
market dynamics !?
14
II. Big Data, Typical Big Data
Stack, Hadoop, Spark,
1. Big Data
2. Typical Big Data Stack
3. Apache Hadoop
4. Apache Spark
5. Key Takeaways
15
1. Big Data
• Big Data is still one of the most inflated buzzword of
the last years.
• Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate. http://en.wikipedia.org/wiki/Big_data
• Hadoop is becoming a traditional tool. Above
definition is inadequate!?
• “Big Data refers to datasets and flows large enough
that has outpaced our capability to store, process,
analyze, and understand.” Amir H. Payberah,
Swedish Institute of Computer Science (SICS).
16
2. Typical Big Data Stack
17
3. Apache Hadoop
• Apache Hadoop as an example of a Typical Big Data
Stack.
• Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones).
• Big Data Ecosystem Dataset http://bigdata.andreamostosi.name/
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset.
• "Hadoop's Impact on Data Management's Future" - Amr
Awadallah (Strata + Hadoop 2015). February 19, 2015: Watch
video at 2:36 on ‘Hadoop Isn’t Just Hadoop Anymore’ for a picture
representing the evolution of Apache Hadoop.
https://www.youtube.com/watch?v=1KvTZZAkHy0
18
4. Apache Spark
• Apache Spark as an example of a Typical Big Data Stack.
• Apache Spark provides you Big Data computing and more:
• BYOS: Bring Your Own Storage.
• BYOC: Bring Your Own Cluster.
• Spark Core: http://sparkbigdata.com/component/tags/tag/11-core-spark
• Spark Streaming: http://sparkbigdata.com/component/tags/tag/3-spark-
streaming
• Spark SQL: http://sparkbigdata.com/component/tags/tag/4-spark-sql
• MLlib (Machine Learning) http://sparkbigdata.com/component/tags/tag/5-
mllib
• GraphX: http://sparkbigdata.com/component/tags/tag/6-graphx
• Spark ecosystem is emerging fast with roots from BDAS:
Berkley Data Analytics Stack and new tools from both the open
source community and commercial one. I’m compiling a list.
Stay tuned!
19
5. Key Takeaways
1. Big Data: Still one of the most inflated
buzzword!?
2. Typical Big Data Stack: Big Data Stacks look
similar on paper. Aren’t they!?
3. Apache Hadoop: Hadoop is no longer
‘synonymous’ of Big Data!
4. Apache Spark: Emergence of the Apache
Spark ecosystem.
20
III. Spark with Hadoop
1. Evolution
2. Transition
3. Integration
4. Complementarity
5. Key Takeaways
21
1. Evolution of Programming APIs
• MapReduce in Java is like assembly code of Big
Data! http://wiki.apache.org/hadoop/WordCount
• Pig http://pig.apache.org
• Hive http://hive.apache.org
• Scoobi: A Scala productivity framework for Hadoop
https://github.com/NICTA/scoobi
• Cascading http://www.cascading.org/
• Scalding: A Scala API for Cascading http://twitter.com/scalding
• Crunch http://crunch.apache.org
• Scrunch http://crunch.apache.org/scrunch.html
22
1. Evolution of Compute Models
When the Apache Hadoop project started in 2007,
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop. Now we have, in addition
to MapReduce v2, Tez, Spark and Flink:
23
• Batch • Batch
• Interactive
• Batch
• Interactive
• Near-Real
time
• Batch
• Interactive
• Real-Time
• Iterative
• 1st
Generation
• 2nd
Generation
• 3rd
Generation
• 4th
Generation
1. Evolution:
• This is how Hadoop MapReduce is branding itself: “A
YARN-based system for parallel processing of large data
sets. http://hadoop.apache.org
• Batch, Scalability, Abstractions ( See slide on evolution of
Programming APIs), User Defined Functions (UDFs)…
• Hadoop MapReduce (MR) works pretty well if you can
express your problem as a single MR job. In practice,
most problems don't fit neatly into a single MR job.
• Need to integrate many disparate tools for advanced
Big Data Analytics for Queries, Streaming Analytics,
Machine Learning and Graph Analytics.
24
1. Evolution:
• Tez: Hindi for “speed”
• This is how Apache Tez is branding itself: “The
Apache Tez project is aimed at building an
application framework which allows for a complex
directed-acyclic-graph of tasks for processing
data. It is currently built atop YARN.”
Source: http://tez.apache.org/
• Apache™ Tez is an extensible framework for
building high performance batch and
interactive data processing applications,
coordinated by YARN in Apache Hadoop.
25
1. Evolution:
• ‘Spark’ for lightning fast speed.
• This is how Apache Spark is branding itself:
“Apache Spark™ is a fast and general engine for
large-scale data processing.” https://spark.apache.org
• Apache Spark is a general purpose cluster
computing framework, its execution model
supports wide variety of use cases: batch,
interactive, near-real time.
• The rapid in-memory processing of resilient
distributed datasets (RDDs) is the “core
capability” of Apache Spark.
26
1. Evolution: Apache Flink
• Flink: German for “nimble, swift, speedy”
• This is how Apache Flink is branding itself: “Fast and
reliable large-scale data processing engine”
• Apache Flink http://flink.apache.org/ offers:
• Batch and Streaming in the same system
• Beyond DAGs (Cyclic operator graphs)
• Powerful, expressive APIs
• Inside-the-system iterations
• Full Hadoop compatibility
• Automatic, language independent optimizer
• ‘Flink’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/27-flink
27
Hadoop MapReduce vs. Tez vs. Spark
Criteria
License Open Source
Apache 2.0, version
2.x
Open Source,
Apache 2.0,
version 0.x
Open Source,
Apache 2.0, version
1.x
Processing
Model
On-Disk (Disk-
based
parallelization),
Batch
On-Disk, Batch,
Interactive
In-Memory, On-Disk,
Batch, Interactive,
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java, Python,
Scala], User-Facing
Java,[
ISV/Engine/Tool
builder]
[Scala, Java, Python],
User-Facing
Libraries None, separate tools None [Spark Core, Spark
Streaming, Spark SQL,
MLlib, GraphX]
28
Hadoop MapReduce vs. Tez vs. Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isn’t bound to
Hadoop
Ease of Use Difficult to program,
needs abstractions
No Interactive mode
except Hive, Pig
Difficult to program
No Interactive
mode except Hive,
Pig
Easy to program,
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs. Tez vs. Spark
Criteria
Deployment YARN YARN [Standalone, YARN*,
SIMR, Mesos, …]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
* Partial support
IV. Spark with Hadoop
1. Evolution
2. Transition
3. Integration
4. Complementarity
5. Key Takeaways
31
2. Transition
• Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine:
1. You can often reuse your mapper and
reducer functions and just call them in
Spark, from Java or Scala.
2. You can translate your code from
MapReduce to Apache Spark. How-to:
Translate from MapReduce to Apache Spark
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-
apache-spark/
32
2. Transition
3. The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark:
• Pig, Hive, Sqoop, Cascading, Crunch, Mahout, …
33
 Pig on Spark (Spork)
• Run Pig with “–x spark” option for an easy migration
without development effort.
• Speed up your existing pig scripts on Spark ( Query,
Logical Plan, Physical Pan)
• Leverage new Spark specific operators in Pig such as
Cache
• Still leverage many existing Pig UDF libraries
• Pig on Spark Umbrella Jira (Status: Passed end-to-end test
cases on Pig, still Open) https://issues.apache.org/jira/browse/PIG-4059
• Fix outstanding issues and address additional Spark functionality
through the community
• ‘Pig on Spark’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/19
34
 Hive on Spark (Currently in Beta,
Expected in Hive 1.1.0)
• New alternative to using MapReduce or Tez:
hive> set hive.execution.engine=spark;
• Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort.
• Exposes Spark users to a viable, feature-rich de facto
standard SQL tool on Hadoop.
• Performance benefits especially for Hive queries,
involving multiple reducer stages.
• Hive on Spark Umbrella Jira (Status: Open). Q1 2015
https://issues.apache.org/jira/browse/HIVE-7292
35
Hive on Spark (Currently in Beta,
Expected in Hive 1.1.0)
• Design
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-
motivations-and-design-principles/
• Getting Started
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Start
ed
• Hive on Spark, February 11, 2015, Szehon Ho,
Clouderahttp://www.slideshare.net/trihug/trihug-feb-hive-on-spark
• Hive on spark is blazing fast... or is it? Carter Shanklin and
Mostapah Mokhtar (Hortonworks). February 20, 2015.
http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final
• ‘Hive on Spark’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/12
36
 Sqoop on Spark
(Expected in Sqoop 2)
• Sqoop ( a.k.a from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop.
• The next version of Sqoop, referred to as Sqoop2
supports data transfer across any two data sources.
• Sqoop 2 Proposal is still under
discussion.https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Pro
posal
• Sqoop2: Support Sqoop on Spark Execution Engine (Jira
Status: Work In Progress). The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs. https://issues.apache.org/jira/browse/SQOOP-1532
37
(Expected in 3.1 release)
• Cascading http://www.cascading.org is an application
development platform for building data applications on
Hadoop.
• Support for Apache Spark is on the roadmap and will be
available in Cascading 3.1 release.
Source: http://www.cascading.org/new-fabric-support/
• Spark-scalding is a library that aims to make the
transition from Cascading/Scalding to Spark a little
easier by adding support for Cascading Taps, Scalding
Sources and the Scalding Fields API in Spark. Source:
http://scalding.io/2014/10/running-scalding-on-apache-spark/
38
Apache Crunch
• The Apache Crunch Java library provides a
framework for writing, testing, and running
MapReduce pipelines. https://crunch.apache.org
• Apache Crunch 0.11 releases with a
SparkPipeline class, making it easy to migrate
data processing applications from MapReduce
to Spark.
https://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/impl/spark/Spark
Pipeline.html
• Running Crunch with Spark
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-
x/topics/cdh_ig_running_crunch_with_spark.html
39
(Expec (Expected in Mahout 1.0 )
• Mahout News: 25 April 2014 - Goodbye MapReduce:
Apache Mahout, the original Machine Learning (ML)
library for Hadoop since 2009, is rejecting new
MapReduce algorithm
implementations.http://mahout.apache.org
• Integration of Mahout and Spark:
• Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark: Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark.
• Mahout Interactive Shell: Interactive REPL shell for
Spark optimized Mahout DSL.
http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
40
(Expected in Mahout 1.0 )
• Playing with Mahout's Spark Shell
https://mahout.apache.org/users/sparkbindings/play-with-shell.html
• Mahout scala and spark bindings. Dmitriy Lyubimov,
April 2014
http://www.slideshare.net/DmitriyLyubimov/mahout-scala-and-spark-bindings
• Co-occurrence Based Recommendations with
Mahout, Scala and Spark. Published on May 30, 2014
http://www.slideshare.net/sscdotopen/cooccurrence-based-recommendations-
with-mahout-scala-and-spark
• Mahout 1.0 Features by Engine (unreleased)-
MapReduce, Spark, H2O, Flink
http://mahout.apache.org/users/basics/algorithms.html
41
III. Spark with Hadoop
1. Evolution
2. Transition
3. Integration
4. Complementarity
5. Key Takeaways
42
3. Integration
Service Open Source Tool
Storage/Servi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3. Integration:
• Spark was designed to read and write data from and to
HDFS, as well as other storage systems supported by
Hadoop API, such as your local file system, Hive, HBase,
Cassandra and Amazon’s S3.
• Stronger integration between Spark and HDFS caching
(SPARK-1767) to allow multiple tenants and processing
frameworks to share the same in-memory
https://issues.apache.org/jira/browse/SPARK-1767
• Use DDM: Discardable Distributed Memory
http://hortonworks.com/blog/ddm/ to store RDDs in memory.This
allows many Spark applications to share RDDs since they
are now resident outside the address space of the
application. Related HDFS-5851 is planned for Hadoop
3.0 https://issues.apache.org/jira/browse/HDFS-5851
44
3. Integration:
• Out of the box, Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD. Example: HBaseTest.scala from
Spark Code.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apach
e/spark/examples/HBaseTest.scala
• There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore: Spark-HBase Connector
https://github.com/nerdammer/spark-hbase-connector
• SparkOnHBase is a project for HBase integration with
Spark. Status: Still in experimentation and no timetable for
possible support. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-
labs-sparkonhbase/
45
3. Integration:
• Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs, write Spark
RDDs to Cassandra tables, and execute arbitrary CQL
queries in your Spark applications. Supports also
integration of Spark Streaming with Cassandra
https://github.com/datastax/spark-cassandra-connector
• Spark + Cassandra using Deep: The integration
is not based on the Cassandra's Hadoop interface.
http://stratio.github.io/deep-spark/
• Getting Started with Apache Spark and Cassandra
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
• ‘Cassandra’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/20-cassandra
46
3. Integration:
• Benchmark of Spark & Cassandra Integration
using different approaches.
http://www.stratio.com/deep-vs-datastax/
• Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandra.
http://tuplejump.github.io/calliope/
• Cassandra storage backend with Spark is opening many new
avenues.
• Kindling: An Introduction to Spark with Cassandra
(Part 1) http://planetcassandra.org/blog/kindling-an-introduction-to-
spark-with-cassandra/
47
3. Integration:
• MongoDB is not directly served by Spark, although
it can be used from Spark via an official Mongo-
Hadoop connector.
• MongoDB-Spark Demo
https://github.com/crcsmnky/mongodb-spark-demo
• MongoDB and Hadoop: Driving Business Insights
http://www.slideshare.net/mongodb/mongodb-and-hadoop-driving-business-
insights
• Spark SQL also provides indirect support via its
support for reading and writing JSON text files.
https://github.com/mongodb/mongo-hadoop
48
3. Integration:
• There is also NSMC: Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
• GitHub https://github.com/spirom/spark-mongodb-connector
• Using MongoDB with Hadoop & Spark
• https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
• http://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-2-hive-
example Part 2
• http://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
• Interesting blog on Using Spark with MongoDB without
Hadoop
http://tugdualgrall.blogspot.fr/2014/11/big-data-is-hadoop-good-way-to-start.html
49
3. Integration:
• Neo4j is a highly scalable, robust (fully ACID), native graph
database.
• Getting Started with Apache Spark and Neo4j Using
Docker Compose. By Kenny Bastani, March 10, 2015
http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-docker.html
• Categorical PageRank Using Neo4j and Apache Spark.
By Kenny Bastani, January 19, 2015
http://www.kennybastani.com/2015/01/categorical-pagerank-neo4j-spark.html
• Using Apache Spark and Neo4j for Big Data Graph
Analytics. By Kenny Bastani, November 3, 2014
http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html
50
3. Integration: YARN
• YARN: Yet Another Resource Negotiator, Implicit
reference to Mesos as the Resource Negotiator!
• Integration still improving.
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%
20summary%20~%20yarn%20AND%20status%20%3D%20OPEN%20ORDER%20
BY%20priority%20DESC%0A
• Some issues are critical ones.
• Running Spark on YARN
http://spark.apache.org/docs/latest/running-on-yarn.html
• Get the most out of Spark on YARN
https://www.youtube.com/watch?v=Vkx-TiQ_KDU
51
3. Integration:
• Spark SQL provides built in support for Hive
tables:
• Import relational data from Hive tables
• Run SQL queries over imported data
• Easily write RDDs out to Hive tables
• Hive 0.13 is supported in Spark 1.2.0.
• Support of ORCFile (Optimized Row Columnar
file) format is targeted in Spark 1.3.0 Spark-2883
https://issues.apache.org/jira/browse/SPARK-2883
• Hive can be used both for analytical queries and
for fetching dataset machine learning algorithms
in MLlib.
52
3. Integration:
• Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration.
http://drill.apache.org
• Drill and Spark Integration is work in progress in 2015 to
address new use cases:
• Use a Drill query (or view) as the input to Spark. Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark.
• Use Drill to query Spark RDDs. Use BI tools to query
in-memory data in Spark. Embed Drill execution in a
Spark data pipeline.
Source: What's Coming in 2015 for
Drill?http://drill.apache.org/blog/2014/12/16/whats-coming-in-2015/
53
3. Integration:
• Apache Kafka is a high throughput distributed
messaging system. http://kafka.apache.org/
• Spark Streaming integrates natively with Kafka:
Spark Streaming + Kafka Integration Guide
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
• Tutorial: Integrating Kafka and Spark Streaming:
Code Examples and State of the Game
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-
example-tutorial/
• ‘Kafka’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/24-kafka
54
3. Integration:
• Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem. http://flume.apache.org/
• Spark Streaming integrates natively with
Flume. There are two approaches to this:
• Approach 1: Flume-style Push-based Approach
• Approach 2 (Experimental): Pull-based
Approach using a Custom Sink.
• Spark Streaming + Flume Integration Guide
https://spark.apache.org/docs/latest/streaming-flume-integration.html
55
3. Integration:
• Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data.
• Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD. No more DDL. Just point Spark
SQL to JSON files and query. Starting Spark 1.3,
SchemaRDD will be renamed to DataFrame.
• An introduction to JSON support in Spark SQL,
February 2, 2015 http://databricks.com/blog/2015/02/02/an-introduction-to-json-
support-in-spark-sql.html
56
3. Integration:
• Apache Parquet is a columnar storage format
available to any project in the Hadoop ecosystem,
regardless of the choice of data processing
framework, data model or programming language.
http://parquet.incubator.apache.org/
• Built in support in Spark SQL allows to:
• Import relational data from Parquet files
• Run SQL queries over imported data
• Easily write RDDs out to Parquet files
http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
• This is an illustrating example of integration of
Parquet and Spark SQL
http://www.infoobjects.com/spark-sql-parquet/
57
3. Integration:
• Spark SQL Avro Library for querying Avro data
with Spark SQL. This library requires Spark 1.2+.
https://github.com/databricks/spark-avro
• This is an example of using Avro and Parquet in Spark
SQL.
http://www.infoobjects.com/spark-with-avro/
• Avro/Spark Use case:
http://www.slideshare.net/DavidSmelker/bdbdug-data-types-jan-2015
• Problem
• Various inbound data sets
• Data Layout can change without notice
• New data sets can be added without notice
Result
• Leverage Spark to dynamically split the data
• Leverage Avro to store the data in a compact binary format
58
3. Integration: Kite SDK
• The Kite SDK provides high level abstractions to
work with datasets on Hadoop, hiding many of
the details of compression codecs, file formats,
partitioning strategies, etc.
http://kitesdk.org/docs/current/
• Spark support has been added to Kite 0.16
release, so Spark jobs can read and write to Kite
datasets.
• Kite Java Spark Demo
https://github.com/kite-sdk/kite-examples/tree/master/spark
59
3. Integration:
• Elasticsearch is a real-time distributed search and analytics
engine. http://www.elasticsearch.org
• Apache Spark Support in Elasticsearch was added in 2.1
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html
• Deep-Spark provides also an integration with Spark.
https://github.com/Stratio/deep-spark
• elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark, in the form of RDD that can
read data from Elasticsearch. Also, any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents. https://github.com/elastic/elasticsearch-hadoop
• Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearch.
http://www.intellilink.co.jp/article/column/bigdata-kk02.html
60
3. Integration:
• Apache Solr, added a Spark-based indexing tool for
fast and easy indexing, ingestion, and serving
searchable complex data. “CrunchIndexerTool on
Spark”
• Solr-on-Spark solution using Apache Solr, Spark,
Crunch, and Morphlines:
• Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
• Update and delete existing documents in Solr at scale
• Ingesting HDFS data into Solr using Spark
http://www.slideshare.net/whoschek/ingesting-hdfs-
intosolrusingsparktrimmed
61
3. Integration:
• HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive. http://www.gethue.com
• A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive.
• Demo of Spark Igniter http://vimeo.com/83192197
• Big Data Web applications for Interactive Hadoop
https://speakerdeck.com/bigdataspain/big-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III. Spark with Hadoop
1. Evolution
2. Transition
3. Integration
4. Complementarity
5. Key Takeaways
63
4. Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together: each for what it is especially good at,
rather than choosing one of them.
64
Hadoop ecosystem Spark ecosystem
4. Complementarity: + +
• Tachyon is an in-memory distributed file system. By
storing the file-system contents in the main memory of all
cluster nodes, the system achieves higher throughput than
traditional disk-based storage systems like HDFS.
• The Future Architecture of a Data Lake: In-memory Data
Exchange Platform Using Tachyon and Apache Spark
(October 14, 2014)http://blog.pivotal.io/big-data-pivotal/news-2/the-future-
architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-
apache-spark
• Spark and in-memory databases: Tachyon leading the
pack, January 2015
http://dynresmanagement.com/1/post/2015/01/spark-and-in-memory-databases-
tachyon-leading-the-pack.html
65
4. Complementarity: +
• Mesos and YARN can work together: each for what
it is especially good at, rather than choosing one of
the two for Spark deployment.
• Big data developers get the best of YARN’s power
for Hadoop-driven workloads, and Mesos’ ability
to run any other kind of workload, including non-
Hadoop applications like Web applications and other
long-running services.”
• Project Myriad is an open source framework for
running YARN on Mesos
• ‘Myriad’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/41
66
4. Complementarity: +
References:
• Apache Mesos vs. Apache Hadoop YARN
https://www.youtube.com/watch?v=YFC4-gtC19E
• Myriad: A Mesos framework for scaling a YARN
cluster https://github.com/mesos/myriad
• Myriad Project Marries YARN and Apache
Mesos Resource Management
http://ostatic.com/blog/myriad-project-marries-yarn-and-apache-mesos-
resource-management
• YARN vs. MESOS: Can’t We All Just Get
Along? http://strataconf.com/big-data-conference-ca-
2015/public/schedule/detail/40620
67
4. Complementarity: +
• Spark on Tez for efficient ETL:
https://github.com/hortonworks/spark-native-yarn
• Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution, statistics or… HDFS caching).
• Spark execution layer could be leveraged without the
need of a nasty Spark/Hadoop coupling.
• Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters).
• Tez supports enterprise security.
68
4. Complementarity: +
• Data >> RAM: Processing huge data volumes,
much bigger than cluster RAM: Tez might be better,
since it is more “stream oriented” , has more mature
shuffling implementation, closer YARN integration.
• Data << RAM: Since Spark can cache in memory
parsed data, it can be much better when we process
data smaller than cluster’s memory.
• Improving Spark for Data Pipelines with Native
YARN Integration http://hortonworks.com/blog/improving-spark-data-
pipelines-native-yarn-integration/
• Get the most out of Spark on YARN
https://www.youtube.com/watch?v=Vkx-TiQ_KDU
69
4. Complementarity
• Emergence of the ‘Smart Execution Engine’ Layer:
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform, the
attributes of the data and the condition of the cluster.
• Matt Schumpert on Datameer Smart Execution Engine
http://www.infoq.com/articles/datameer-smart-execution-engine Interview on
November 13, 2014 with Matt Schumpert, Director of Product
Management at Datameer.
• The Challenge to Choosing the “Right” Execution
Engine. By Peter Voss | September 30, 2014
http://www.datameer.com/blog/announcements/the-challenge-to-choosing-the-
right-execution-engine.html
70
4. Complementarity
• Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th, 2015 at the Los Angeles
Big Data Users Group.
• http://files.meetup.com/12753252/LA Big Data Users Group Presentation Jan-27-
2015.pdf
• New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption, February 12, 2015
http://www.itbusinessnet.com/article/New-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
• Syncsort Automates Data Migrations Across Multiple Platforms,
February 23, 2015
http://www.itbusinessedge.com/blogs/it-unmasked/syncsort-automates-data-
migrations-across-multiple-platforms.html
• Framework for the Future of Hadoop, March 9, 2015
http://blog.syncsort.com/2015/03/framework-future-hadoop/
71
5. Key Takeaways
1. Evolution: of compute models is still ongoing.
Watch out Apache Flink project for true low-
latency and iterative use cases!
2. Transition: Tools from the Hadoop ecosystem
are still being ported to Spark. Keep watching
general availability and balance risk and
opportunity.
3. Integration: Healthy dose of Hadoop ecosystem
integration with Spark. More integration is on
the way.
4. Complementarity: Components and tools from
Hadoop ecosystem and Spark ecosystem can
work together: each for what it is especially good
at. One size doesn’t fit all!
72
IV. Spark without Hadoop
1. File System
2. Deployment
3. Distributions
4. Alternatives
5. Key Takeaways
73
1. File System
Spark does not require HDFS: Hadoop Distributed File System! Your
‘Big Data’ use case might be implemented without HDFS! For example:
1. Use Spark to process data stored in Cassandra File System (DataStax
CassandraFS) or MongoDB File System (GridFS)
2. Use Spark to read and write data directly to a messaging system like
Kafka if your use case doesn’t need data persistence. Example:
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
3. Use an In-Memory distributed File System such as Spark’s cousin:
Tachyon http://sparkbigdata.com/component/tags/tag/13
4. Use a Non-HDFS file system’ already supported by Spark:
• Amazon S3
• http://databricks.gitbooks.io/databricks-spark-reference-
applications/content/logs_analyzer/chapter2/s3.html
• MapR-FS
• https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots
5. OpenStack Swift (Object Store)
• https://spark.apache.org/docs/latest/storage-openstack-swift.html
• https://www.openstack.org/summit/openstack-paris-summit-2014/session-
videos/presentation/the-perfect-match-apache-spark-meets-swift
74
1. File System
When coupled with its analytics capabilities, file-
system agnostic Spark can only re-ignite this
discussion of HDFS alternatives. Because Hadoop isn’t
perfect: 8 ways to replace HDFS. July 11, 2012
https://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-
replace-hdfs/
A few HDFS alternatives to choose from, include:
• Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage. March 9, 2015
http://www.recorditblog.com/post/apache-spark-on-mesos-running-
on-coreos-and-using-emc-ecs-hdfs-storage/
• Lustre File System - Intel Enterprise Edition for Lustre (IEEL)
(Upcoming support)
http://insidebigdata.com/2014/10/02/replacing-hdfs-lustre-maximum-
performance/
• Quantcast QFS https://www.quantcast.com/engineering/qfs
• …
75
IV. Spark without Hadoop
1. File System
2. Deployment
3. Distributions
4. Alternatives
5. Key Takeaways
76
2. Deployment
While Spark is most often discussed as a replacement for MapReduce
in Hadoop clusters to be deployed on YARN, Spark is actually
agnostic to the underlying infrastructure for clustering, so
alternative deployments are possible:
1. Local: http://sparkbigdata.com/tutorials/51-deployment/121-local
2. Standalone: http://sparkbigdata.com/tutorials/51-deployment/123-standalone
3. Apache Mesos: http://sparkbigdata.com/tutorials/51-deployment/122-mesos
4. Amazon EC2: http://sparkbigdata.com/tutorials/51-deployment/124-amazon-ec2
5. Amazon EMR: http://sparkbigdata.com/tutorials/51-deployment/127-amazon-emr
6. Rackspace: http://sparkbigdata.com/tutorials/51-deployment/138-on-rackspace
7. Google Cloud Platform:http://sparkbigdata.com/tutorials/51-deployment/139-
google-cloud
8. HPC Clusters:
• Setting up Spark on top of Sun/Oracle Grid Engine (PSI) -
http://sparkbigdata.com/tutorials/51-deployment/126-sun-oracle-grid-engine-sge
• Setting up Spark on the Brutus and Euler Clusters (ETH) -
http://sparkbigdata.com/tutorials/51-deployment/128-hpc-cluster
77
IV. Spark without Hadoop
1. File System
2. Deployment
3. Distributions
4. Alternatives
6. Key Takeaways
78
3. Distributions
• Using Spark on a Non-Hadoop distribution:
79
Cloud
• Databricks Cloud is not dependent on
Hadoop. It gets its data from Amazon’s S3
(most commonly), Redshift, Elastic MapReduce.
https://databricks.com/product/databricks-cloud
• Databricks Cloud: From raw data, to insights and
data products in an instant! March 4, 2015
https://databricks.com/blog/2015/03/04/databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instant.html
• Databricks Cloud Announcement and Demo at
Spark Summit 2014, July 2, 2014
https://www.youtube.com/watch?v=dJQ5lV5Tldw
80
DSE:
• DSE: DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform.
Data can be stored in Cassandra File System.
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enter
prise/spark/sparkTOC.html
• Escape from Hadoop: Ultra Fast Data Analysis with
Spark & Cassandra, Piotr Kolaczkowski, September 26, 2014
http://www.slideshare.net/PiotrKolaczkowski/fast-data-analysis-with-spark-4
• Escape from Hadoop: with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson, published on November 24, 2014
http://www.slideshare.net/helenaedelson/escape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
• Stratio is a Big Data platform based on Spark. It
is 100% open source and enterprise ready
http://www.stratio.com
• Streaming-CEP-Engine: Streaming CEP engine
is a Complex Event Processing platform built
on Spark Streaming. It is the result of combining
the power of Spark Streaming as a continuous
computing framework and Siddhi CEP engine as
complex event processing engine.
http://stratio.github.io/streaming-cep-engine/
• ‘Stratio’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/40
82
83
• xPatterns (http://atigeo.com/technology/) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers: Infrastructure, Analytics,
and Applications.
• xPatterns is cloud-based, exceedingly scalable,
and readily interfaces with existing IT systems.
• ‘xPatterns’ Tag at
SparkBigData.comhttp://sparkbigdata.com/component/tags/tag/
39
84
• The BlueData (http://www.bluedata.com/) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments.
• With EPIC software, you can spin up Hadoop
clusters – with the data and analytical tools that
your data scientists need – in minutes rather than
months. https://www.youtube.com/watch?v=SE1OP4ImrxU
• ‘BlueData’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/37
85
• Guavus (http://www.guavus.com) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
World’s Largest Telcos. September 25, 2014 by Eric Carr
http://databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcos.html
• Guavus operational intelligence platform analyzes
streaming data and data at rest.
• The Guavus Reflex 2.0 platform is commercially
compatible with open source Apache Spark.
http://insidebigdata.com/2014/09/26/guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution/
• ‘Guavus’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/38
IV. Spark without Hadoop
1. File System
2. Deployment
3. Distributions
4. Alternatives
5. Key Takeaways
86
4. Alternatives
Hadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark Notebook/ISpark
87

• Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks, such as Spark
and MapReduce. https://http://tachyon-project.org
• Tachyon is Hadoop compatible. Existing Spark
and MapReduce programs can run on top of it
without any code change.
• Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS)
https://amplab.cs.berkeley.edu/software/
88

• Mesos (http://mesos.apache.org/) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution. This leads to considerable performance
improvements, especially for long running Spark jobs.
• Mesos as Data Center “OS”:
• Share datacenter between multiple cluster computing
apps; Provide new abstractions and services
• Mesosphere DCOS: Datacenter services, including
Apache Spark, Apache Cassandra, Apache YARN,
Apache HDFS…
• ‘Mesos’ Tag at SparkBigData.com
http://sparkbigdata.com/component/tags/tag/16-mesos
89
YARN vs. Mesos
Criteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
 Spark Native API
• Spark Native API in Scala, Java and Python.
• Interactive shell in Scala and Python.
• Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API.
• ETL with Spark - First Spark London Meetup,
May 28, 2014
http://www.slideshare.net/rafalkwasny/etl-with-spark-first-spark-london-
meetup
• ‘Spark Core’ Tag at
SparkBigData.comhttp://sparkbigdata.com/component/tags/tag/
11-core-spark
91
 Spark SQL
• Spark SQL is a new SQL engine designed from
ground-up for Spark: https://spark.apache.org/sql/
• Spark SQL provides SQL performance and maintains
compatibility with Hive. It supports all existing Hive data
formats, user-defined functions (UDF), and the Hive
metastore.
• Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema, such as JSON, Parquet, Hive, or
EDWs. It unifies SQL and sophisticated analysis,
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics.
92
 Spark MLlib
93
‘Spark MLlib ’ Tag at
SparkBigData.comhttp://sparkbigdata.com/component/tags/tag/5-mllib
 Spark Streaming
94
‘Spark Streaming ’ Tag at http://sparkbigdata.com/component/tags/tag/3-
spark-streaming
Storm vs. Spark Streaming
Criteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerance–
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala, Java,
Python
95
 GraphX
96
‘GraphX’ Tag at
SparkBigData.comhttp://sparkbigdata.com/component
 Notebook
97
• Zeppelin http://zeppelin-project.org, is a web-based
notebook that enables interactive data analytics.
Has built-in Apache Spark support.
• Spark Notebook is an interactive web-based
editor that can combine Scala code, SQL
queries, Markup or even JavaScript in a
collaborative manner. https://github.com/andypetrella/spark-
notebook
• ISpark is an Apache Spark-shell backend for
IPython https://github.com/tribbloid/ISpark
IV. Spark on Non-Hadoop
1. File System
2. Deployment
3. Distributions
4. Alternatives
5. Key Takeaways
98
6. Key Takeaways
1. File System: Spark is File System Agnostic.
Bring Your Own Storage!
2. Deployment: Spark is Cluster Infrastructure
Agnostic. Choose your deployment.
3. Distributions: You are no longer tied to Hadoop
for Big Data processing. Spark distributions as
service in the cloud or imbedded in Non-Hadoop
distributions are emerging!
4. Alternatives: Do your due diligence based on
your own use case and research pros and cons
before picking a specific tool or switching from one
tool to another.
99
IV. More Q&A
100
http://www.SparkBigData.com
sbaltagi@gmail.com
https://www.linkedin.com/in/slimbal
tagi
@SlimBaltagi
http://www.slideshare.net/sbaltagi

More Related Content

What's hot

Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldLester Martin
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
 

What's hot (20)

Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 

Viewers also liked

Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: AmadeusFlink Forward
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAMFlink Forward
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital OneFlink Forward
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Carol Smith
 

Viewers also liked (14)

Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: Amadeus
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAM
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital One
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
 

Similar to Spark or Hadoop: Is it an either-or proposition

Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkTaras Matyashovsky
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupMark Kerzner
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleMark Kerzner
 
Why contribute to open source projects
Why contribute to open source projectsWhy contribute to open source projects
Why contribute to open source projectsKranti Parisa
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionDataWorks Summit/Hadoop Summit
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Edureka!
 

Similar to Spark or Hadoop: Is it an either-or proposition (20)

Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - Altiscale
 
Why contribute to open source projects
Why contribute to open source projectsWhy contribute to open source projects
Why contribute to open source projects
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
View on big data technologies
View on big data technologiesView on big data technologies
View on big data technologies
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
 

More from Slim Baltagi

How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Modern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesModern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesSlim Baltagi
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeSlim Baltagi
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiSlim Baltagi
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Slim Baltagi
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceSlim Baltagi
 

More from Slim Baltagi (13)

How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Modern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesModern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetes
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to Finance
 

Recently uploaded

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 

Recently uploaded (20)

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 

Spark or Hadoop: Is it an either-or proposition

  • 1. Spark or Hadoop: Is it an either-or proposition? By Slim Baltagi (@SlimBaltagi) Big Data Practice Director Advanced Analytics LLC OR XOR ?? Los Angeles Spark Users Group March 12, 2015
  • 2. Your Presenter – Slim Baltagi 2 • Sr. Big Data Solutions Architect living in Chicago. • Over 17 years of IT and business experiences. • Over 4 years of Big Data experience working on over 12 Hadoop projects. • Speaker at Big Data events. • Creator and maintainer of the Apache Spark Knowledge Base: http://www.SparkBigData.com with over 4,000 categorized Apache Spark web resources. @SlimBaltagi https://www.linkedin.com/in/slimbalta gi sbaltagi@gmail.com Disclaimer: This is a vendor-independent talk that expresses my own opinions. I am not endorsing nor promoting any product or vendor mentioned in this talk.
  • 3. Agenda I. Motivation II. Big Data, Typical Big Data Stack, Apache Hadoop, Apache Spark III. Spark with Hadoop IV. Spark without Hadoop V. More Q&A 3
  • 4. I. Motivation 1. News 2. Surveys 3. Vendors 4. Analysts 5. Key Takeaways 4
  • 5. 1. News • Is it Spark 'vs' OR 'and' Hadoop? • Apache Spark: Hadoop friend or foe? • Apache Spark: killer or savior of Apache Hadoop? • Apache Spark's Marriage To Hadoop Will Be Bigger Than Kim And Kanye. • Adios Hadoop, Hola Spark! • Apache Spark: Moving on from Hadoop • Apache Spark Continues to Spread Beyond Hadoop. • Escape From Hadoop! • Spark promises to end up Hadoop, but in a good way 5
  • 6. 2. Surveys • "Hadoop's historic focus on batch processing of data was well supported by MapReduce, but there is an appetite for more flexible developer tools to support the larger market of 'mid-size' datasets and use cases that call for real-time processing.” 2015 Apache Spark Survey by Typesafe. January 27, 2015. http://www.marketwired.com/press-release/survey-indicates-apache-spark- gaining-developer-adoption-as-big-datas-projects-1986162.htm • Apache Spark: Preparing for the Next Wave of Reactive Big Data, January 27, 2015 by Typesafe http://typesafe.com/blog/apache-spark-preparing-for-the-next-wave-of-reactive- big-data 6
  • 7. Apache Spark Survey 2015 by Typesafe - Quick Snapshot 7
  • 8. 3. Vendors 8 • Spark and Hadoop: Working Together. January 21, 2014 by Ion Stoica https://databricks.com/blog/2014/01/21/spark-and- hadoop.html • Uniform API for diverse workloads over diverse storage systems and runtimes. Source: Slide 16 of ‘Spark's Role in the Big Data Ecosystem (Spark Summit 2014). November 2014. Matei Zahariahttp://www.slideshare.net/databricks/spark-summit2014 • "The goal of Apache Spark is to have one engine for all data sources, workloads and environments.” Source: Slide 15 of ‘New Directions for Apache Spark in 2015, February 20, 2015. Strata + Hadoop Summit. Matei Zaharia http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
  • 9. 3. Vendors • “Spark is already an excellent piece of software and is advancing very quickly. No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ” Source: MapReduce and Spark, December, 30,2013 http://vision.cloudera.com/mapreduce-spark/ • “Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics on all your data.” http://www.cloudera.com/content/cloudera/en/products-and- services/cdh/spark.html 9
  • 10. 3. Vendors • “Apache Spark is a general-purpose engine for large- scale data processing. Spark supports rapid application development for big data and allows for code reuse across batch, interactive and streaming applications. Spark also provides advanced execution graphs with in- memory pipelining to speed up end-to-end application performance.” https://www.mapr.com/products/apache-spark • MapR Adds Complete Apache Spark Stack to its Distribution for Hadoop https://www.mapr.com/company/press-releases/mapr-adds-complete-apache- spark-stack-its-distribution-hadoop 10
  • 11. 3. Vendors • “Apache Spark provides an elegant, attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast, in- memory data processing.” http://hortonworks.com/hadoop/spark/ • Hortonworks: A shared vision for Apache Spark on Hadoop. October 21, 2014https://databricks.com/blog/2014/10/31/hortonworks-a-shared-vision-for- apache-spark-on-hadoop.html • “At Hortonworks, we love Spark and want to help our customers leverage all its benefits.” October 30th, 2014 http://hortonworks.com/blog/improving-spark-data-pipelines-native-yarn- integration/ 11
  • 12. 4. Analysts • Is Apache Spark replacing Hadoop or complementing existing Hadoop practice? • Both are already happening: • With uncertainty about “what is Hadoop” there is no reason to think solution stacks built on Spark, not positioned as Hadoop, will not continue to proliferate as the technology matures. • At the same time, Hadoop distributions are all embracing Spark and including it in their offerings. Source: Hadoop Questions from Recent Webinar Span Spectrum. February 25, 2015.http://blogs.gartner.com/merv-adrian/2015/02/25/hadoop- questions-from-recent-webinar-span-spectrum/ 12
  • 13. 4. Analysts • “After hearing the confusion between Spark and Hadoop one too many times, I was inspired to write a report, The Hadoop Ecosystem Overview, Q4 2104. • For those that have day jobs that don’t include constantly tracking Hadoop evolution, I dove in and worked with Hadoop vendors and trusted consultants to create a framework. • We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require it.” Source: Elephants, Pigs, Rhinos and Giraphs; Oh My! – It's Time To Get A Handle On Hadoop. Posted by Brian Hopkins on November 26, 2014 http://blogs.forrester.com/brian_hopkins/14-11-26- elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop 13
  • 14. 5. Key Takeaways 1. News: Big Data is no longer a Hadoop monopoly! 2. Surveys: Listen to what Spark developers are saying! 3. Vendors: <Hadoop Vendor>-tinted goggles!? FUD is still being ‘offered’ by some Hadoop vendors. Claims need to be contextualized. 4. Analysts: Thorough understanding of the market dynamics !? 14
  • 15. II. Big Data, Typical Big Data Stack, Hadoop, Spark, 1. Big Data 2. Typical Big Data Stack 3. Apache Hadoop 4. Apache Spark 5. Key Takeaways 15
  • 16. 1. Big Data • Big Data is still one of the most inflated buzzword of the last years. • Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate. http://en.wikipedia.org/wiki/Big_data • Hadoop is becoming a traditional tool. Above definition is inadequate!? • “Big Data refers to datasets and flows large enough that has outpaced our capability to store, process, analyze, and understand.” Amir H. Payberah, Swedish Institute of Computer Science (SICS). 16
  • 17. 2. Typical Big Data Stack 17
  • 18. 3. Apache Hadoop • Apache Hadoop as an example of a Typical Big Data Stack. • Hadoop ecosystem = Hadoop Stack + many other tools (either open source and free or commercial ones). • Big Data Ecosystem Dataset http://bigdata.andreamostosi.name/ Incomplete but a useful list of Big Data related projects packed into a JSON dataset. • "Hadoop's Impact on Data Management's Future" - Amr Awadallah (Strata + Hadoop 2015). February 19, 2015: Watch video at 2:36 on ‘Hadoop Isn’t Just Hadoop Anymore’ for a picture representing the evolution of Apache Hadoop. https://www.youtube.com/watch?v=1KvTZZAkHy0 18
  • 19. 4. Apache Spark • Apache Spark as an example of a Typical Big Data Stack. • Apache Spark provides you Big Data computing and more: • BYOS: Bring Your Own Storage. • BYOC: Bring Your Own Cluster. • Spark Core: http://sparkbigdata.com/component/tags/tag/11-core-spark • Spark Streaming: http://sparkbigdata.com/component/tags/tag/3-spark- streaming • Spark SQL: http://sparkbigdata.com/component/tags/tag/4-spark-sql • MLlib (Machine Learning) http://sparkbigdata.com/component/tags/tag/5- mllib • GraphX: http://sparkbigdata.com/component/tags/tag/6-graphx • Spark ecosystem is emerging fast with roots from BDAS: Berkley Data Analytics Stack and new tools from both the open source community and commercial one. I’m compiling a list. Stay tuned! 19
  • 20. 5. Key Takeaways 1. Big Data: Still one of the most inflated buzzword!? 2. Typical Big Data Stack: Big Data Stacks look similar on paper. Aren’t they!? 3. Apache Hadoop: Hadoop is no longer ‘synonymous’ of Big Data! 4. Apache Spark: Emergence of the Apache Spark ecosystem. 20
  • 21. III. Spark with Hadoop 1. Evolution 2. Transition 3. Integration 4. Complementarity 5. Key Takeaways 21
  • 22. 1. Evolution of Programming APIs • MapReduce in Java is like assembly code of Big Data! http://wiki.apache.org/hadoop/WordCount • Pig http://pig.apache.org • Hive http://hive.apache.org • Scoobi: A Scala productivity framework for Hadoop https://github.com/NICTA/scoobi • Cascading http://www.cascading.org/ • Scalding: A Scala API for Cascading http://twitter.com/scalding • Crunch http://crunch.apache.org • Scrunch http://crunch.apache.org/scrunch.html 22
  • 23. 1. Evolution of Compute Models When the Apache Hadoop project started in 2007, MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop. Now we have, in addition to MapReduce v2, Tez, Spark and Flink: 23 • Batch • Batch • Interactive • Batch • Interactive • Near-Real time • Batch • Interactive • Real-Time • Iterative • 1st Generation • 2nd Generation • 3rd Generation • 4th Generation
  • 24. 1. Evolution: • This is how Hadoop MapReduce is branding itself: “A YARN-based system for parallel processing of large data sets. http://hadoop.apache.org • Batch, Scalability, Abstractions ( See slide on evolution of Programming APIs), User Defined Functions (UDFs)… • Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job. In practice, most problems don't fit neatly into a single MR job. • Need to integrate many disparate tools for advanced Big Data Analytics for Queries, Streaming Analytics, Machine Learning and Graph Analytics. 24
  • 25. 1. Evolution: • Tez: Hindi for “speed” • This is how Apache Tez is branding itself: “The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop YARN.” Source: http://tez.apache.org/ • Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. 25
  • 26. 1. Evolution: • ‘Spark’ for lightning fast speed. • This is how Apache Spark is branding itself: “Apache Spark™ is a fast and general engine for large-scale data processing.” https://spark.apache.org • Apache Spark is a general purpose cluster computing framework, its execution model supports wide variety of use cases: batch, interactive, near-real time. • The rapid in-memory processing of resilient distributed datasets (RDDs) is the “core capability” of Apache Spark. 26
  • 27. 1. Evolution: Apache Flink • Flink: German for “nimble, swift, speedy” • This is how Apache Flink is branding itself: “Fast and reliable large-scale data processing engine” • Apache Flink http://flink.apache.org/ offers: • Batch and Streaming in the same system • Beyond DAGs (Cyclic operator graphs) • Powerful, expressive APIs • Inside-the-system iterations • Full Hadoop compatibility • Automatic, language independent optimizer • ‘Flink’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/27-flink 27
  • 28. Hadoop MapReduce vs. Tez vs. Spark Criteria License Open Source Apache 2.0, version 2.x Open Source, Apache 2.0, version 0.x Open Source, Apache 2.0, version 1.x Processing Model On-Disk (Disk- based parallelization), Batch On-Disk, Batch, Interactive In-Memory, On-Disk, Batch, Interactive, Streaming (Near Real- Time) Language written in Java Java Scala API [Java, Python, Scala], User-Facing Java,[ ISV/Engine/Tool builder] [Scala, Java, Python], User-Facing Libraries None, separate tools None [Spark Core, Spark Streaming, Spark SQL, MLlib, GraphX] 28
  • 29. Hadoop MapReduce vs. Tez vs. Spark Criteria Installation Bound to Hadoop Bound to Hadoop Isn’t bound to Hadoop Ease of Use Difficult to program, needs abstractions No Interactive mode except Hive, Pig Difficult to program No Interactive mode except Hive, Pig Easy to program, no need of abstractions Interactive mode Compatibilit y to data types and data sources is same to data types and data sources is same to data types and data sources is same YARN integration YARN application Ground up YARN application Spark is moving towards YARN 29
  • 30. Hadoop MapReduce vs. Tez vs. Spark Criteria Deployment YARN YARN [Standalone, YARN*, SIMR, Mesos, …] Performance - Good performance when data fits into memory - performance degradation otherwise Security More features and projects More features and projects Still in its infancy 30 * Partial support
  • 31. IV. Spark with Hadoop 1. Evolution 2. Transition 3. Integration 4. Complementarity 5. Key Takeaways 31
  • 32. 2. Transition • Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine: 1. You can often reuse your mapper and reducer functions and just call them in Spark, from Java or Scala. 2. You can translate your code from MapReduce to Apache Spark. How-to: Translate from MapReduce to Apache Spark http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to- apache-spark/ 32
  • 33. 2. Transition 3. The following tools originally based on Hadoop MapReduce are being ported to Apache Spark: • Pig, Hive, Sqoop, Cascading, Crunch, Mahout, … 33
  • 34.  Pig on Spark (Spork) • Run Pig with “–x spark” option for an easy migration without development effort. • Speed up your existing pig scripts on Spark ( Query, Logical Plan, Physical Pan) • Leverage new Spark specific operators in Pig such as Cache • Still leverage many existing Pig UDF libraries • Pig on Spark Umbrella Jira (Status: Passed end-to-end test cases on Pig, still Open) https://issues.apache.org/jira/browse/PIG-4059 • Fix outstanding issues and address additional Spark functionality through the community • ‘Pig on Spark’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/19 34
  • 35.  Hive on Spark (Currently in Beta, Expected in Hive 1.1.0) • New alternative to using MapReduce or Tez: hive> set hive.execution.engine=spark; • Help existing Hive applications running on MapReduce or Tez easily migrate to Spark without development effort. • Exposes Spark users to a viable, feature-rich de facto standard SQL tool on Hadoop. • Performance benefits especially for Hive queries, involving multiple reducer stages. • Hive on Spark Umbrella Jira (Status: Open). Q1 2015 https://issues.apache.org/jira/browse/HIVE-7292 35
  • 36. Hive on Spark (Currently in Beta, Expected in Hive 1.1.0) • Design http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark- motivations-and-design-principles/ • Getting Started https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Start ed • Hive on Spark, February 11, 2015, Szehon Ho, Clouderahttp://www.slideshare.net/trihug/trihug-feb-hive-on-spark • Hive on spark is blazing fast... or is it? Carter Shanklin and Mostapah Mokhtar (Hortonworks). February 20, 2015. http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final • ‘Hive on Spark’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/12 36
  • 37.  Sqoop on Spark (Expected in Sqoop 2) • Sqoop ( a.k.a from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop. • The next version of Sqoop, referred to as Sqoop2 supports data transfer across any two data sources. • Sqoop 2 Proposal is still under discussion.https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Pro posal • Sqoop2: Support Sqoop on Spark Execution Engine (Jira Status: Work In Progress). The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs. https://issues.apache.org/jira/browse/SQOOP-1532 37
  • 38. (Expected in 3.1 release) • Cascading http://www.cascading.org is an application development platform for building data applications on Hadoop. • Support for Apache Spark is on the roadmap and will be available in Cascading 3.1 release. Source: http://www.cascading.org/new-fabric-support/ • Spark-scalding is a library that aims to make the transition from Cascading/Scalding to Spark a little easier by adding support for Cascading Taps, Scalding Sources and the Scalding Fields API in Spark. Source: http://scalding.io/2014/10/running-scalding-on-apache-spark/ 38
  • 39. Apache Crunch • The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. https://crunch.apache.org • Apache Crunch 0.11 releases with a SparkPipeline class, making it easy to migrate data processing applications from MapReduce to Spark. https://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/impl/spark/Spark Pipeline.html • Running Crunch with Spark http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2- x/topics/cdh_ig_running_crunch_with_spark.html 39
  • 40. (Expec (Expected in Mahout 1.0 ) • Mahout News: 25 April 2014 - Goodbye MapReduce: Apache Mahout, the original Machine Learning (ML) library for Hadoop since 2009, is rejecting new MapReduce algorithm implementations.http://mahout.apache.org • Integration of Mahout and Spark: • Reboot with new Mahout Scala DSL for Distributed Machine Learning on Spark: Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark. • Mahout Interactive Shell: Interactive REPL shell for Spark optimized Mahout DSL. http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html 40
  • 41. (Expected in Mahout 1.0 ) • Playing with Mahout's Spark Shell https://mahout.apache.org/users/sparkbindings/play-with-shell.html • Mahout scala and spark bindings. Dmitriy Lyubimov, April 2014 http://www.slideshare.net/DmitriyLyubimov/mahout-scala-and-spark-bindings • Co-occurrence Based Recommendations with Mahout, Scala and Spark. Published on May 30, 2014 http://www.slideshare.net/sscdotopen/cooccurrence-based-recommendations- with-mahout-scala-and-spark • Mahout 1.0 Features by Engine (unreleased)- MapReduce, Spark, H2O, Flink http://mahout.apache.org/users/basics/algorithms.html 41
  • 42. III. Spark with Hadoop 1. Evolution 2. Transition 3. Integration 4. Complementarity 5. Key Takeaways 42
  • 43. 3. Integration Service Open Source Tool Storage/Servi ng Layer Data Formats Data Ingestion Services Resource Management Search SQL 43
  • 44. 3. Integration: • Spark was designed to read and write data from and to HDFS, as well as other storage systems supported by Hadoop API, such as your local file system, Hive, HBase, Cassandra and Amazon’s S3. • Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory https://issues.apache.org/jira/browse/SPARK-1767 • Use DDM: Discardable Distributed Memory http://hortonworks.com/blog/ddm/ to store RDDs in memory.This allows many Spark applications to share RDDs since they are now resident outside the address space of the application. Related HDFS-5851 is planned for Hadoop 3.0 https://issues.apache.org/jira/browse/HDFS-5851 44
  • 45. 3. Integration: • Out of the box, Spark can interface with HBase as it has full support for Hadoop InputFormats via newAPIHadoopRDD. Example: HBaseTest.scala from Spark Code. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apach e/spark/examples/HBaseTest.scala • There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore: Spark-HBase Connector https://github.com/nerdammer/spark-hbase-connector • SparkOnHBase is a project for HBase integration with Spark. Status: Still in experimentation and no timetable for possible support. http://blog.cloudera.com/blog/2014/12/new-in-cloudera- labs-sparkonhbase/ 45
  • 46. 3. Integration: • Spark Cassandra Connector This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications. Supports also integration of Spark Streaming with Cassandra https://github.com/datastax/spark-cassandra-connector • Spark + Cassandra using Deep: The integration is not based on the Cassandra's Hadoop interface. http://stratio.github.io/deep-spark/ • Getting Started with Apache Spark and Cassandra http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/ • ‘Cassandra’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/20-cassandra 46
  • 47. 3. Integration: • Benchmark of Spark & Cassandra Integration using different approaches. http://www.stratio.com/deep-vs-datastax/ • Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandra. http://tuplejump.github.io/calliope/ • Cassandra storage backend with Spark is opening many new avenues. • Kindling: An Introduction to Spark with Cassandra (Part 1) http://planetcassandra.org/blog/kindling-an-introduction-to- spark-with-cassandra/ 47
  • 48. 3. Integration: • MongoDB is not directly served by Spark, although it can be used from Spark via an official Mongo- Hadoop connector. • MongoDB-Spark Demo https://github.com/crcsmnky/mongodb-spark-demo • MongoDB and Hadoop: Driving Business Insights http://www.slideshare.net/mongodb/mongodb-and-hadoop-driving-business- insights • Spark SQL also provides indirect support via its support for reading and writing JSON text files. https://github.com/mongodb/mongo-hadoop 48
  • 49. 3. Integration: • There is also NSMC: Native Spark MongoDB Connector for reading and writing MongoDB collections directly from Apache Spark (still experimental) • GitHub https://github.com/spirom/spark-mongodb-connector • Using MongoDB with Hadoop & Spark • https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1- introduction-setup PART 1 • http://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-2-hive- example Part 2 • http://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-3-spark- example-key-takeaways PART 3 • Interesting blog on Using Spark with MongoDB without Hadoop http://tugdualgrall.blogspot.fr/2014/11/big-data-is-hadoop-good-way-to-start.html 49
  • 50. 3. Integration: • Neo4j is a highly scalable, robust (fully ACID), native graph database. • Getting Started with Apache Spark and Neo4j Using Docker Compose. By Kenny Bastani, March 10, 2015 http://www.kennybastani.com/2015/03/spark-neo4j-tutorial-docker.html • Categorical PageRank Using Neo4j and Apache Spark. By Kenny Bastani, January 19, 2015 http://www.kennybastani.com/2015/01/categorical-pagerank-neo4j-spark.html • Using Apache Spark and Neo4j for Big Data Graph Analytics. By Kenny Bastani, November 3, 2014 http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html 50
  • 51. 3. Integration: YARN • YARN: Yet Another Resource Negotiator, Implicit reference to Mesos as the Resource Negotiator! • Integration still improving. https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND% 20summary%20~%20yarn%20AND%20status%20%3D%20OPEN%20ORDER%20 BY%20priority%20DESC%0A • Some issues are critical ones. • Running Spark on YARN http://spark.apache.org/docs/latest/running-on-yarn.html • Get the most out of Spark on YARN https://www.youtube.com/watch?v=Vkx-TiQ_KDU 51
  • 52. 3. Integration: • Spark SQL provides built in support for Hive tables: • Import relational data from Hive tables • Run SQL queries over imported data • Easily write RDDs out to Hive tables • Hive 0.13 is supported in Spark 1.2.0. • Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 1.3.0 Spark-2883 https://issues.apache.org/jira/browse/SPARK-2883 • Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib. 52
  • 53. 3. Integration: • Drill is intended to achieve the sub-second latency needed for interactive data analysis and exploration. http://drill.apache.org • Drill and Spark Integration is work in progress in 2015 to address new use cases: • Use a Drill query (or view) as the input to Spark. Drill extracts and pre-processes data from various data sources and turns it into input to Spark. • Use Drill to query Spark RDDs. Use BI tools to query in-memory data in Spark. Embed Drill execution in a Spark data pipeline. Source: What's Coming in 2015 for Drill?http://drill.apache.org/blog/2014/12/16/whats-coming-in-2015/ 53
  • 54. 3. Integration: • Apache Kafka is a high throughput distributed messaging system. http://kafka.apache.org/ • Spark Streaming integrates natively with Kafka: Spark Streaming + Kafka Integration Guide http://spark.apache.org/docs/latest/streaming-kafka-integration.html • Tutorial: Integrating Kafka and Spark Streaming: Code Examples and State of the Game http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration- example-tutorial/ • ‘Kafka’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/24-kafka 54
  • 55. 3. Integration: • Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem. http://flume.apache.org/ • Spark Streaming integrates natively with Flume. There are two approaches to this: • Approach 1: Flume-style Push-based Approach • Approach 2 (Experimental): Pull-based Approach using a Custom Sink. • Spark Streaming + Flume Integration Guide https://spark.apache.org/docs/latest/streaming-flume-integration.html 55
  • 56. 3. Integration: • Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data. • Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD. No more DDL. Just point Spark SQL to JSON files and query. Starting Spark 1.3, SchemaRDD will be renamed to DataFrame. • An introduction to JSON support in Spark SQL, February 2, 2015 http://databricks.com/blog/2015/02/02/an-introduction-to-json- support-in-spark-sql.html 56
  • 57. 3. Integration: • Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. http://parquet.incubator.apache.org/ • Built in support in Spark SQL allows to: • Import relational data from Parquet files • Run SQL queries over imported data • Easily write RDDs out to Parquet files http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files • This is an illustrating example of integration of Parquet and Spark SQL http://www.infoobjects.com/spark-sql-parquet/ 57
  • 58. 3. Integration: • Spark SQL Avro Library for querying Avro data with Spark SQL. This library requires Spark 1.2+. https://github.com/databricks/spark-avro • This is an example of using Avro and Parquet in Spark SQL. http://www.infoobjects.com/spark-with-avro/ • Avro/Spark Use case: http://www.slideshare.net/DavidSmelker/bdbdug-data-types-jan-2015 • Problem • Various inbound data sets • Data Layout can change without notice • New data sets can be added without notice Result • Leverage Spark to dynamically split the data • Leverage Avro to store the data in a compact binary format 58
  • 59. 3. Integration: Kite SDK • The Kite SDK provides high level abstractions to work with datasets on Hadoop, hiding many of the details of compression codecs, file formats, partitioning strategies, etc. http://kitesdk.org/docs/current/ • Spark support has been added to Kite 0.16 release, so Spark jobs can read and write to Kite datasets. • Kite Java Spark Demo https://github.com/kite-sdk/kite-examples/tree/master/spark 59
  • 60. 3. Integration: • Elasticsearch is a real-time distributed search and analytics engine. http://www.elasticsearch.org • Apache Spark Support in Elasticsearch was added in 2.1 http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html • Deep-Spark provides also an integration with Spark. https://github.com/Stratio/deep-spark • elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark, in the form of RDD that can read data from Elasticsearch. Also, any RDD can be saved to Elasticsearch as long as its content can be translated into documents. https://github.com/elastic/elasticsearch-hadoop • Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch. http://www.intellilink.co.jp/article/column/bigdata-kk02.html 60
  • 61. 3. Integration: • Apache Solr, added a Spark-based indexing tool for fast and easy indexing, ingestion, and serving searchable complex data. “CrunchIndexerTool on Spark” • Solr-on-Spark solution using Apache Solr, Spark, Crunch, and Morphlines: • Migrate ingestion of HDFS data into Solr from MapReduce to Spark • Update and delete existing documents in Solr at scale • Ingesting HDFS data into Solr using Spark http://www.slideshare.net/whoschek/ingesting-hdfs- intosolrusingsparktrimmed 61
  • 62. 3. Integration: • HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive. http://www.gethue.com • A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive. • Demo of Spark Igniter http://vimeo.com/83192197 • Big Data Web applications for Interactive Hadoop https://speakerdeck.com/bigdataspain/big-data-web-applications-for-interactive- hadoop-by-enrico-berti-at-big-data-spain-2014 62
  • 63. III. Spark with Hadoop 1. Evolution 2. Transition 3. Integration 4. Complementarity 5. Key Takeaways 63
  • 64. 4. Complementarity Components of Hadoop ecosystem and Spark ecosystem can work together: each for what it is especially good at, rather than choosing one of them. 64 Hadoop ecosystem Spark ecosystem
  • 65. 4. Complementarity: + + • Tachyon is an in-memory distributed file system. By storing the file-system contents in the main memory of all cluster nodes, the system achieves higher throughput than traditional disk-based storage systems like HDFS. • The Future Architecture of a Data Lake: In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14, 2014)http://blog.pivotal.io/big-data-pivotal/news-2/the-future- architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and- apache-spark • Spark and in-memory databases: Tachyon leading the pack, January 2015 http://dynresmanagement.com/1/post/2015/01/spark-and-in-memory-databases- tachyon-leading-the-pack.html 65
  • 66. 4. Complementarity: + • Mesos and YARN can work together: each for what it is especially good at, rather than choosing one of the two for Spark deployment. • Big data developers get the best of YARN’s power for Hadoop-driven workloads, and Mesos’ ability to run any other kind of workload, including non- Hadoop applications like Web applications and other long-running services.” • Project Myriad is an open source framework for running YARN on Mesos • ‘Myriad’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/41 66
  • 67. 4. Complementarity: + References: • Apache Mesos vs. Apache Hadoop YARN https://www.youtube.com/watch?v=YFC4-gtC19E • Myriad: A Mesos framework for scaling a YARN cluster https://github.com/mesos/myriad • Myriad Project Marries YARN and Apache Mesos Resource Management http://ostatic.com/blog/myriad-project-marries-yarn-and-apache-mesos- resource-management • YARN vs. MESOS: Can’t We All Just Get Along? http://strataconf.com/big-data-conference-ca- 2015/public/schedule/detail/40620 67
  • 68. 4. Complementarity: + • Spark on Tez for efficient ETL: https://github.com/hortonworks/spark-native-yarn • Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution, statistics or… HDFS caching). • Spark execution layer could be leveraged without the need of a nasty Spark/Hadoop coupling. • Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters). • Tez supports enterprise security. 68
  • 69. 4. Complementarity: + • Data >> RAM: Processing huge data volumes, much bigger than cluster RAM: Tez might be better, since it is more “stream oriented” , has more mature shuffling implementation, closer YARN integration. • Data << RAM: Since Spark can cache in memory parsed data, it can be much better when we process data smaller than cluster’s memory. • Improving Spark for Data Pipelines with Native YARN Integration http://hortonworks.com/blog/improving-spark-data- pipelines-native-yarn-integration/ • Get the most out of Spark on YARN https://www.youtube.com/watch?v=Vkx-TiQ_KDU 69
  • 70. 4. Complementarity • Emergence of the ‘Smart Execution Engine’ Layer: Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform, the attributes of the data and the condition of the cluster. • Matt Schumpert on Datameer Smart Execution Engine http://www.infoq.com/articles/datameer-smart-execution-engine Interview on November 13, 2014 with Matt Schumpert, Director of Product Management at Datameer. • The Challenge to Choosing the “Right” Execution Engine. By Peter Voss | September 30, 2014 http://www.datameer.com/blog/announcements/the-challenge-to-choosing-the- right-execution-engine.html 70
  • 71. 4. Complementarity • Operating in a Multi-execution Engine Hadoop Environment by Erik Halseth of Datameer on January 27th, 2015 at the Los Angeles Big Data Users Group. • http://files.meetup.com/12753252/LA Big Data Users Group Presentation Jan-27- 2015.pdf • New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption, February 12, 2015 http://www.itbusinessnet.com/article/New-Syncsort-Big-Data-Software- Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366 • Syncsort Automates Data Migrations Across Multiple Platforms, February 23, 2015 http://www.itbusinessedge.com/blogs/it-unmasked/syncsort-automates-data- migrations-across-multiple-platforms.html • Framework for the Future of Hadoop, March 9, 2015 http://blog.syncsort.com/2015/03/framework-future-hadoop/ 71
  • 72. 5. Key Takeaways 1. Evolution: of compute models is still ongoing. Watch out Apache Flink project for true low- latency and iterative use cases! 2. Transition: Tools from the Hadoop ecosystem are still being ported to Spark. Keep watching general availability and balance risk and opportunity. 3. Integration: Healthy dose of Hadoop ecosystem integration with Spark. More integration is on the way. 4. Complementarity: Components and tools from Hadoop ecosystem and Spark ecosystem can work together: each for what it is especially good at. One size doesn’t fit all! 72
  • 73. IV. Spark without Hadoop 1. File System 2. Deployment 3. Distributions 4. Alternatives 5. Key Takeaways 73
  • 74. 1. File System Spark does not require HDFS: Hadoop Distributed File System! Your ‘Big Data’ use case might be implemented without HDFS! For example: 1. Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS) 2. Use Spark to read and write data directly to a messaging system like Kafka if your use case doesn’t need data persistence. Example: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html 3. Use an In-Memory distributed File System such as Spark’s cousin: Tachyon http://sparkbigdata.com/component/tags/tag/13 4. Use a Non-HDFS file system’ already supported by Spark: • Amazon S3 • http://databricks.gitbooks.io/databricks-spark-reference- applications/content/logs_analyzer/chapter2/s3.html • MapR-FS • https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots 5. OpenStack Swift (Object Store) • https://spark.apache.org/docs/latest/storage-openstack-swift.html • https://www.openstack.org/summit/openstack-paris-summit-2014/session- videos/presentation/the-perfect-match-apache-spark-meets-swift 74
  • 75. 1. File System When coupled with its analytics capabilities, file- system agnostic Spark can only re-ignite this discussion of HDFS alternatives. Because Hadoop isn’t perfect: 8 ways to replace HDFS. July 11, 2012 https://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to- replace-hdfs/ A few HDFS alternatives to choose from, include: • Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS storage. March 9, 2015 http://www.recorditblog.com/post/apache-spark-on-mesos-running- on-coreos-and-using-emc-ecs-hdfs-storage/ • Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support) http://insidebigdata.com/2014/10/02/replacing-hdfs-lustre-maximum- performance/ • Quantcast QFS https://www.quantcast.com/engineering/qfs • … 75
  • 76. IV. Spark without Hadoop 1. File System 2. Deployment 3. Distributions 4. Alternatives 5. Key Takeaways 76
  • 77. 2. Deployment While Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN, Spark is actually agnostic to the underlying infrastructure for clustering, so alternative deployments are possible: 1. Local: http://sparkbigdata.com/tutorials/51-deployment/121-local 2. Standalone: http://sparkbigdata.com/tutorials/51-deployment/123-standalone 3. Apache Mesos: http://sparkbigdata.com/tutorials/51-deployment/122-mesos 4. Amazon EC2: http://sparkbigdata.com/tutorials/51-deployment/124-amazon-ec2 5. Amazon EMR: http://sparkbigdata.com/tutorials/51-deployment/127-amazon-emr 6. Rackspace: http://sparkbigdata.com/tutorials/51-deployment/138-on-rackspace 7. Google Cloud Platform:http://sparkbigdata.com/tutorials/51-deployment/139- google-cloud 8. HPC Clusters: • Setting up Spark on top of Sun/Oracle Grid Engine (PSI) - http://sparkbigdata.com/tutorials/51-deployment/126-sun-oracle-grid-engine-sge • Setting up Spark on the Brutus and Euler Clusters (ETH) - http://sparkbigdata.com/tutorials/51-deployment/128-hpc-cluster 77
  • 78. IV. Spark without Hadoop 1. File System 2. Deployment 3. Distributions 4. Alternatives 6. Key Takeaways 78
  • 79. 3. Distributions • Using Spark on a Non-Hadoop distribution: 79
  • 80. Cloud • Databricks Cloud is not dependent on Hadoop. It gets its data from Amazon’s S3 (most commonly), Redshift, Elastic MapReduce. https://databricks.com/product/databricks-cloud • Databricks Cloud: From raw data, to insights and data products in an instant! March 4, 2015 https://databricks.com/blog/2015/03/04/databricks-cloud-from-raw-data-to- insights-and-data-products-in-an-instant.html • Databricks Cloud Announcement and Demo at Spark Summit 2014, July 2, 2014 https://www.youtube.com/watch?v=dJQ5lV5Tldw 80
  • 81. DSE: • DSE: DataStax Enterprise built on Apache Cassandra presents itself as a Non-Hadoop Big Data Platform. Data can be stored in Cassandra File System. http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enter prise/spark/sparkTOC.html • Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra, Piotr Kolaczkowski, September 26, 2014 http://www.slideshare.net/PiotrKolaczkowski/fast-data-analysis-with-spark-4 • Escape from Hadoop: with Apache Spark and Cassandra with the Spark Cassandra Connector Helena Edelson, published on November 24, 2014 http://www.slideshare.net/helenaedelson/escape-from-hadoop-with-apache- spark-and-cassandra-41950082 81
  • 82. • Stratio is a Big Data platform based on Spark. It is 100% open source and enterprise ready http://www.stratio.com • Streaming-CEP-Engine: Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming. It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine. http://stratio.github.io/streaming-cep-engine/ • ‘Stratio’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/40 82
  • 83. 83 • xPatterns (http://atigeo.com/technology/) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers: Infrastructure, Analytics, and Applications. • xPatterns is cloud-based, exceedingly scalable, and readily interfaces with existing IT systems. • ‘xPatterns’ Tag at SparkBigData.comhttp://sparkbigdata.com/component/tags/tag/ 39
  • 84. 84 • The BlueData (http://www.bluedata.com/) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments. • With EPIC software, you can spin up Hadoop clusters – with the data and analytical tools that your data scientists need – in minutes rather than months. https://www.youtube.com/watch?v=SE1OP4ImrxU • ‘BlueData’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/37
  • 85. 85 • Guavus (http://www.guavus.com) embeds Apache Spark into its Operational Intelligence Platform Deployed at the World’s Largest Telcos. September 25, 2014 by Eric Carr http://databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-into-its- operational-intelligence-platform-deployed-at-the-worlds-largest-telcos.html • Guavus operational intelligence platform analyzes streaming data and data at rest. • The Guavus Reflex 2.0 platform is commercially compatible with open source Apache Spark. http://insidebigdata.com/2014/09/26/guavus-databricks-announce-reflex- platform-now-certified-spark-distribution/ • ‘Guavus’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/38
  • 86. IV. Spark without Hadoop 1. File System 2. Deployment 3. Distributions 4. Alternatives 5. Key Takeaways 86
  • 87. 4. Alternatives Hadoop Ecosystem Spark Ecosystem Component HDFS Tachyon YARN Mesos Tools Pig Spark native API Hive Spark SQL Mahout MLlib Storm Spark Streaming Giraph GraphX HUE Spark Notebook/ISpark 87
  • 88.  • Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory- speed across cluster frameworks, such as Spark and MapReduce. https://http://tachyon-project.org • Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change. • Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) https://amplab.cs.berkeley.edu/software/ 88
  • 89.  • Mesos (http://mesos.apache.org/) enables fine grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution. This leads to considerable performance improvements, especially for long running Spark jobs. • Mesos as Data Center “OS”: • Share datacenter between multiple cluster computing apps; Provide new abstractions and services • Mesosphere DCOS: Datacenter services, including Apache Spark, Apache Cassandra, Apache YARN, Apache HDFS… • ‘Mesos’ Tag at SparkBigData.com http://sparkbigdata.com/component/tags/tag/16-mesos 89
  • 90. YARN vs. Mesos Criteria Resource sharing Yes Yes Written in Java C++ Scheduling Memory only CPU and Memory Running tasks Unix processes Linux Container groups Requests Specific requests and locality preference More generic but more coding for writing frameworks Maturity Less mature Relatively more mature 90
  • 91.  Spark Native API • Spark Native API in Scala, Java and Python. • Interactive shell in Scala and Python. • Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API. • ETL with Spark - First Spark London Meetup, May 28, 2014 http://www.slideshare.net/rafalkwasny/etl-with-spark-first-spark-london- meetup • ‘Spark Core’ Tag at SparkBigData.comhttp://sparkbigdata.com/component/tags/tag/ 11-core-spark 91
  • 92.  Spark SQL • Spark SQL is a new SQL engine designed from ground-up for Spark: https://spark.apache.org/sql/ • Spark SQL provides SQL performance and maintains compatibility with Hive. It supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore. • Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema, such as JSON, Parquet, Hive, or EDWs. It unifies SQL and sophisticated analysis, allowing users to mix and match SQL and more imperative programming APIs for advanced analytics. 92
  • 93.  Spark MLlib 93 ‘Spark MLlib ’ Tag at SparkBigData.comhttp://sparkbigdata.com/component/tags/tag/5-mllib
  • 94.  Spark Streaming 94 ‘Spark Streaming ’ Tag at http://sparkbigdata.com/component/tags/tag/3- spark-streaming
  • 95. Storm vs. Spark Streaming Criteria Processing Model Record at a time Mini batches Latency Sub second Few seconds Fault tolerance– every record processed At least one ( may be duplicates) Exactly one Batch Framework integration Not available Core Spark API Supported languages Any programming language Scala, Java, Python 95
  • 96.  GraphX 96 ‘GraphX’ Tag at SparkBigData.comhttp://sparkbigdata.com/component
  • 97.  Notebook 97 • Zeppelin http://zeppelin-project.org, is a web-based notebook that enables interactive data analytics. Has built-in Apache Spark support. • Spark Notebook is an interactive web-based editor that can combine Scala code, SQL queries, Markup or even JavaScript in a collaborative manner. https://github.com/andypetrella/spark- notebook • ISpark is an Apache Spark-shell backend for IPython https://github.com/tribbloid/ISpark
  • 98. IV. Spark on Non-Hadoop 1. File System 2. Deployment 3. Distributions 4. Alternatives 5. Key Takeaways 98
  • 99. 6. Key Takeaways 1. File System: Spark is File System Agnostic. Bring Your Own Storage! 2. Deployment: Spark is Cluster Infrastructure Agnostic. Choose your deployment. 3. Distributions: You are no longer tied to Hadoop for Big Data processing. Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging! 4. Alternatives: Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another. 99