SlideShare a Scribd company logo
1 of 91
Download to read offline
From Hadoop to Spark
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark
HI, I’m Sujee Maniyam
•  Founder / Principal @ ElephantScale
•  Consulting & Training in Big Data
•  Spark / Hadoop / NoSQL /
Data Science
•  Author
–  “Hadoop illuminated” open source book
–  “HBase Design Patterns”
•  Open Source contributor: github.com/sujee
•  sujee@elephantscale.com
•  www.ElephantScale.com
(c) ElephantScale.com 2015
Spark Training
available!
2
Webinar Audience
u  I am already using Hadoop,
Should I go to Spark?
u  I am thinking about Hadoop,
should I skip Hadoop and go to Spark ?
(c) ElephantScale.com 2015 3
Webinar Outline
u  Intro: what is Hadoop and what is Spark?
u  Capabilities and advantages of Spark & Hadoop
u  Best use cases for Spark / Hadoop
u  From Hadoop to Spark – how to?
Webinar: From Hadoop to Spark(c) ElephantScale.com 2015 4
Introduction
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark
Hadoop in 20 Seconds
u  ‘The Original’ Big data platform
u  Very well field tested
u  Scales to peta-bytes of data
u  Enables analytics at massive scale
(c) ElephantScale.com 2015 6
Hadoop Eco System
BatchReal Time
(c) ElephantScale.com 2015 7
Hadoop Ecosystem – by function
u  HDFS
– provides distributed storage
u  Map Reduce
– Provides distributed computing
u  Pig
– High level MapReduce
u  Hive
– SQL layer over Hadoop
u  HBase
– NoSQL storage for real-time queries
(c) ElephantScale.com 2015 8
Hadoop Extended Eco-System
(c) ElephantScale.com 2015
Source : hortonworks
9
Hadoop : Use Cases
u  Two modes : Batch & Real Time
u  Batch use case
– Analytics at large scale (Terra bytes to peta bytes scale)
– Analytics times can be minutes / hours.
Depends on
•  Size of data being analyzed
•  And type of query
– Examples:
•  Large ETL work loads
•  “Analyze clickstream data and calculate top page visits”
•  “Combine purchase data and click-data and figure out discounts to
apply”
(c) ElephantScale.com 2015 10
Hadoop Use Cases
u  Real Time Use Cases do not rely on Map Reduce
u  Instead we use HBase
– A real-time NoSQL datastore built on Hadoop
u  Example : Tracking Sensor data
– Store data from millions of sensor
– Could be billions of data points
– “Find latest reading from a sensor”
– This query must be done in
real time (in milli-seconds)
u  “Needle in HayStack” scenarios
– We look for one / few records within
billions
(c) ElephantScale.com 2015 11
Hadoop Reference Architecture (Example)
(c) ElephantScale.com 2015 12
Source : hortonworks
Data Spectrum
(c) ElephantScale.com 2015 13
Big Data Analytics Evolution (v1)
u  Decision times : batch ( hours / days)
u  Use cases:
– Modeling
– ETL
– Reporting
(c) ElephantScale.com 2015 14
Moving Towards Fast Data (v2)
u  Decision time : (near) real time
– seconds (or milli seconds)
u  Use Cases
– Alerts (medical / security)
– Fraud detection
(c) ElephantScale.com 2015 15
Current Big Data Processing Challenges
u  Processing needs outpacing 1st generation tools
u  Beyond Batch
– Not every one has terra-bytes of data to process
– Small – Medium data sets (few hundred gigs) are more prevalent
– Data may not be on disk
•  In memory
•  Coming via streaming channels
u  MapReduce (MR)’s limitations
– Batch processing doesn't fit all needs
– Not effective for ‘iterative programming’ (machine learning
algorithms ..etc)
– High latency for streaming needs
u  Spark is a 2nd generation tool addressing these needs
16(c) ElephantScale.com 2015
What is Spark?
u  Open source cluster computing engine
– Very fast: In-memory ops 100x faster than MR
•  On-disk ops 10x faster than MR
– General purpose: MR, SQL, streaming, machine learning,
analytics
– Compatible: Runs over Hadoop, Mesos, Yarn, standalone
•  Works with HDFS, S3, Cassandra, HBase, …
– Easier to code: Word count in 2 lines
u  Spark's roots:
– Came out of Berkeley AMP Lab
– Now top-level Apache project
– Version 1.5 released in Sept 2015
“First Big Data platform to integrate batch, streaming and interactive
computations in a unified framework” – stratio.com
(c) ElephantScale.com 2015 17
Spark Illustrated
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema /
sql
Real Time
Machine
Learning
Standalone YARN MESOS
Cluster
managers
GraphX
Graph
processing
HDFSS3 Cassandra ???
Data
Storage
(c) ElephantScale.com 2015 18
Spark Core
u  Basic building blocks for distributed compute engine
– Task schedulers and memory management
– Fault recovery (recovers missing pieces on node failure)
– Storage system interfaces
u  Defines Spark API and data model
u  Data Model: RDD (Resilient Distributed Dataset)
– Distributed collection of items
– Can be worked on in parallel
– Easily created from many data sources (Any HDFS InputSource)
u  Spark API: Scala, Python, and Java
– Compact API for working with RDD and interacting with Spark
– Much easier to use than MapReduce API
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201519
Spark Components
u  Spark SQL: Structured data
– Supports SQL and HQL (Hive Query Language)
– Data sources include Hive tables, JSON, CSV, Parquet (1)
u  Spark Streaming: Live streams of data in real-time
– Low latency, high throughput (1000s events / sec)
– Log files, stock ticks, sensor data / IOT (Internet of Things) …
u  ML Lib: Machine Learning at scale
– Classification/regression, collaborative filtering …
– Model evaluation and data import
u  GraphX: Graph manipulation, graph-parallel computation
– Social network friendships, link data, …
– Graph manipulation and operations and common algorithms
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201520
Spark : 'Unified' Stack
u  Spark components support multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
u  All modules are tightly integrated
– Facilitates rich applications
u  Spark can be the only stack you need !
– No need to run multiple clusters (Hadoop cluster, Storm cluster,
etc.)
Session 2: Introduction to Spark
(c)
Elephant
Scale.co
m 201521
Hypo-meter J
(c) ElephantScale.com 2015 22
Spark Job Trends
(c) ElephantScale.com 2015 23
Hadoop and Spark Comparison
Introduction
Hadoop and Spark Comparison
Going from Hadoop to Spark
Session 2: Introduction to Spark
Spark Benchmarks
Source : stratio.com
(c) ElephantScale.com 2015 25
Spark Code / Activity
(c)
ElephantS
cale.com
2015
Source : stratio.com
26
Timeline : Hadoop & Spark
(c) ElephantScale.com 2015 27
Hadoop Vs. Spark
Hadoop
Spark
Source : http://www.kwigger.com/mit-skifte-til-mac/
(c) ElephantScale.com 2015 28
Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Up 10x faster for data on disk
- Up to 100x faster for data in
memory
Mostly Java Compact code
Java, Python, Scala supported
No unified shell Shell for ad-hoc exploration
(c) ElephantScale.com 2015 29
Spark Is Better Fit for Iterative Workloads
(c) ElephantScale.com 2015 30
Spark Programming Model
u  More generic than MapReduce
(c) ElephantScale.com 2015 31
Is Spark Replacing Hadoop?
u  Spark runs on Hadoop / YARN
u  Can access data in HDFS
u  Use YARN for clustering
u  Spark programming model is more flexible than MapReduce
u  Spark is really great if data fits in memory (few hundred gigs),
u  Spark is ‘storage agnostic’ (see next slide)
(c) ElephantScale.com 2015 32
Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???
(c) ElephantScale.com 2015 33
Spark & Hadoop
Use Case Hadoop Spark
Batch processing Hadoop’s MapReduce
(Java, Pig, Hive)
Spark RDDs
(java / scala / python)
SQL querying Hadoop : Hive Spark SQL
Stream Processing / Real
Time processing
Storm
Kafka
Spark Streaming
Machine Learning Mahout Spark ML Lib
Real time lookups HBase (NoSQL) No native Spark
component.
But Spark can query data
in NoSQL stores
(c) ElephantScale.com 2015 34
Hadoop + Yarn : OS for Distributed Compute
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications
(or at least, that’s the idea)
(c) ElephantScale.com 2015 35
Hadoop & Spark Future ???
(c) ElephantScale.com 2015 36
Going from Hadoop to Spark
Introduction
Hadoop and Spark Comparison
Going from Hadoop to Spark
Session 2: Introduction to Spark
Why Move From Hadoop to Spark?
u  Spark is ‘easier’ than Hadoop
u  ‘friendlier’ for data scientists / analysts
– Interactive shell
•  fast development cycles
•  adhoc exploration
u  API supports multiple languages
– Java, Scala, Python
u  Great for small (Gigs) to medium (100s of Gigs) data
(c) ElephantScale.com 2015 38
Spark : ‘Unified’ Stack
u  Spark supports multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
u  All modules are tightly integrated
– Facilitates rich applications
u  Spark can be the only stack you need !
– No need to run multiple clusters
(Hadoop cluster, Storm cluster, … etc.)
Image: buymeposters.com
(c) ElephantScale.com 2015 39
Migrating From Hadoop à Spark
Functionality Hadoop Spark
Distributed Storage -  HDFS
-  Cloud storage
(Amazon S3)
-  HDFS
-  Cloud storage
(Amazon S3)
-  Distributed File
system (NFS /
Ceph)
-  Distributed NoSQL
(Cassandra)
-  Tachyon (in
memory)
SQL querying Hive Spark SQL (Data
frames)
ETL work flow Pig -  Spork : Pig on
Spark
-  Mix of Spark SQL +
RDD programming
Machine Learning Mahout ML Lib
NoSQL DB HBase ???
(c) ElephantScale.com 2015 40
Things to Consider When Moving From
Hadoop to Spark
1.  Data size
2.  File System
3.  Analytics
A.  SQL
B.  ETL
C.  Machine Learning
(c) ElephantScale.com 2015 41
Data Size : “You Don’t Have Big Data”
(c) ElephantScale.com 2015 42
Data Size (T-shirt sizing)
Image credit : blog.trumpi.co.za
10 G + 100 G +
1 TB + 100 TB + PB +
< few G
Hadoop / Spark
Spark
(c) ElephantScale.com 2015 43
Data Size
u  Lot of Spark adoption at SMALL – MEDIUM scale
– Good fit
– Data might fit in memory !!
u  Applications
– Iterative workloads (Machine learning, etc.)
– Streaming
(c) ElephantScale.com 2015 44
Decision : Data Size
(c) ElephantScale.com 2015 45
Data Size
< 1 TB
(Spark)
> 1 TB
(Hadoop /
Spark)
Decision : File System
(c) ElephantScale.com 2015 46
“What kind of
file system do I
need for Spark”
File System
u  Hadoop = Storage + Compute
u  Spark = Compute only
u  Spark needs a distributed FS
u  File system choices for Spark
– HDFS - Hadoop File System
•  Reliable
•  Good performance (data locality)
•  Field tested for PB of data
– S3 : Amazon
•  Reliable cloud storage
•  Huge scale
– NFS : Network File System (‘shared FS across machines)
– Tachyon (in memory - experimental)
(c) ElephantScale.com 2015 47
Spark File Systems
(c) ElephantScale.com 2015 48
File Systems For Spark
HDFS NFS Amazon S3
Data locality High
(best)
Local enough None
(ok)
Throughput High
(best)
Medium
(good)
Low
(ok)
Latency Low
(best)
Low High
Reliability Very High
(replicated)
Low Very High
Cost Varies Varies $30 / TB / Month
(c) ElephantScale.com 2015 49
File Systems Throughput Comparison
u  Data : 10G + (11.3 G)
u  Each file : ~1+ G ( x 10)
u  400 million records total
u  Partition size : 128 M
u  On HDFS & S3
u  Cluster :
– 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )
– Hadoop cluster , Horton Works HDP v2.2
– Spark : on same 8 nodes, stand-alone, v 1.2
(c) ElephantScale.com 2015 50
HDFS Vs. S3 (lower is better)
(c)
ElephantS
cale.com
201551
HDFS Vs. S3 (lower is better)
(c)
ElephantS
cale.com
201552
HDFS Vs. S3 Conclusions
HDFS S3
Data locality à much higher
throughput
Data is streamed à lower
throughput
Need to maintain an Hadoop cluster No Hadoop cluster to maintain à
convenient
Large data sets (TB + ) Good use case:
-  Smallish data sets (few gigs)
-  Load once and cache and re-use
(c) ElephantScale.com 2015 53
Decision : File Systems
(c) ElephantScale.com 2015 54
Already have
Hadoop?
NO
HDFS
S3
NFS (Ceph)
Cassandra
(real time)
YES
use HDFS
Next Decision : SQL
(c) ElephantScale.com 2015 55
“We use SQL heavily for data
mining.
We are using Hive / Impala on
Hadoop.
Is Spark right for us?”
SQL in Hadoop / Spark
Hadoop Spark
Engine -  Hive (on Map Reduce or
Tez on Hortonworks)
-  Impala (Cloudera)
-  Spark SQL using
Dataframes
-  Hive context
Language HiveQL - HiveQL
- RDD programming in
Java / Python / Scala
Scale Terabytes / Petabytes Gigabytes / Terabytes /
Petabytes
Inter operability Data stored in HDFS -  Hive tables
-  File system
Formats CSV, JSON, Parquet CSV, JSON, Parquet
(c) ElephantScale.com 2015 56
Dataframes Vs. RDDs
u  RDDs have data
u  DataFrames also have schema
u  Dataframes Used to be called ‘schemaRDD’
u  Unified way to load / save data in multiple formats
u  Provides high level operations
– Count / sum / average
– Select columns & filter them
57
(c)
Elephant
Scale.co
m 2015
// load json data
df = sqlContext.read
.format(“json”)
.load(“/data/data.json”)
// save as parquet (faster queries)
df.write
.format(“parquet”)
.saveAsTable(“/data/datap/”)
Supported Formats
58
(c)
Elephant
Scale.co
m 2015
Creating a DataFrame From JSON
{"name": "John", "age": 35 }
{"name": "Jane", "age": 40 }
{"name": "Mike", "age": 20 }
{"name": "Sue", "age": 52 }
(c)
Elephant
Scale.co
m 201559 Session 6: Spark SQL
scala> val peopleDF = sqlContext.read.json("people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> peopleDF.printSchema()
root
|-- age: long(nullable = true)
|-- name: string (nullable = true)
scala> peopleDF.show()
+---+----+
|age|name|
+---+----+
| 35|John|
| 40|Jane|
| 20|Mike|
| 52| Sue|
+---+----+
Querying Using SQL
u  A DataFrame can be registered as a temporary table
– You can then use SQL to query it, as shown below
– This is handled similarly to DSL queries - building up an AST and
sending it to Catalyst
(c)
Elephant
Scale.co
m 201560 Session 6: Spark SQL
scala> df.registerTempTable("people")
scala> sqlContext.sql("select * from people").show()
name age
John 35
Jane 40
Mike 20
Sue 52
scala> sqlContext.sql("select * from people where age > 35").show()
name age
Jane 40
Sue 52
Going From Hive à Spark
u  Spark natively supports querying data stored in Hive tables!
u  Handy to use in an existing Hadoop cluster !!
61 Session 6: Spark SQL
HIVE
Hive> select customer_id, SUM(cost) as total from billing group by
customer_id order by total DESC LIMIT 10;
SPARK
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val top10 = hiveCtx.sql(
"select customer_id, SUM(cost) as total from billing group by
customer_id order by total DESC LIMIT 10")
top10.collect()
(c) ElephantScale.com 2015
Spark SQL Vs. Hive
(c)
ElephantS
cale.com
2015
Fast on same
HDFS data !
62
Spark SQL Vs. Hive
(c)
ElephantS
cale.com
2015
Fast on
same data
on HDFS
63
Decision : SQL
Using Hive?
Yes
Spark Using
HiveContext
NO
Spark SQL with
Dataframes
(c) ElephantScale.com 2015 64
Next Decision : ETL
(c) ElephantScale.com 2015 65
“we do lot of ETL work on
our Hadoop cluster.
Using tools like Pig /
Cascading
Can we use Spark? “
ETL on Hadoop / Spark
ETL Hadoop Spark
ETL Tools Pig, Cascading, Oozie -  Native RDD
programming
(Scala, Java,
Python)
-  Cascading?
Pig High level ETL workflow Spork : Pig on Spark
Cascading High level Spark-scalding
Cask Works Works
(c) ElephantScale.com 2015 66
Data Transformation on Spark
u  Dataframes are great for high level manipulation of data
– High level operations : Join / Union …etc
– Joining / Merging disparate data sets
– Can read and understand multitude of data formats (JSON /
Parquet ..etc)
– Very easy to program
u  RDD APIs allow low level programming
– Complex manipulations
– Lookups
– Supports multiple lanaguages (Java / Scala / Python)
u  High level libraries are emerging
– Tresata
– CASK
(c) ElephantScale.com 2015 67
Decisions : ETL
Current ETL
Pig
Spork RDD API
Dataframes
Cascading
Cascading on
Spark
RDD
Data frames
Java
MapReduce
/ Custom
RDD
Dataframes
(c) ElephantScale.com 2015 68
Decision : Machine Learning
(c) ElephantScale.com 2015 69
Can we use Spark for
Machine Learning?
YES
Machine Learning : Hadoop / Spark
Hadoop Spark
Tool Mahout MLLib
API Java Java / Scala / Python
Iterative Algorithms Slower Very fast
(in memory)
In Memory processing No YES
Mahout runs on Hadoop
or on Spark
New and young lib
Latest news! Mahout only accepts new
code that runs on Spark
Mahout & MLLib on Spark
Future? Many opinions
(c) ElephantScale.com 2015 70
Decision : In Memory Process
(c) ElephantScale.com 2015 71
How can we do in-
memory processing
using Spark?
Numbers Every One Should Know by Jeff
Dean, Fellow @ Google
Operation Cost (in nano seconds)
L1 cache reference 0.5
Branch mispredict (cpu) 5
L2 cache reference 7
Mutex lock/unlock 100
Main memory reference 100
Compress 1K bytes with Zippy 10,000
Send 2K bytes over 1 Gbps network 20,000
Read 1 MB sequentially from memory 250,000 0.25 ms
Round trip within same datacenter 500,000 0.5 ms
Disk seek 10,000,000 10 ms
Read 1 MB sequentially from network 10,000,000 10 ms
Read 1 MB sequentially from disk 30,000,000 30 ms
Send packet CA->Netherlands->CA 150,000,000 150 ms
(c) ElephantScale.com 2015 72
Spark Caching
u  Caching is pretty effective (small / medium data sets)
u  Cached data can not be shared across applications
(each application executes in its own sandbox)
(c) ElephantScale.com 2015 73
Caching Results
Cached!
(c) ElephantScale.com 2015 74
Caching Results
Cached!
(c) ElephantScale.com 2015 75
Sharing Cached Data
u  By default Spark applications can not share cached data
– Running in isolation
u  1) ‘spark job server’
– Multiplexer
– All requests are executed through same ‘context’
– Provides web-service interface
u  2) Tachyon
– Distributed In-memory file system
– Memory is the new disk!
– Out of AMP lab , Berkeley
– Early stages (very promising)
(c) ElephantScale.com 2015 76
Spark Job Server
(c) ElephantScale.com 2015 77
Spark Job Server
u  Open sourced from Ooyala
u  ‘Spark as a Service’ – simple REST interface to launch jobs
u  Sub-second latency !
u  Pre-load jars for even faster spinup
u  Share cached RDDs across requests (NamedRDD)
u  https://github.com/spark-jobserver/spark-jobserver
(c) ElephantScale.com 2015 78
App1 :
sharedCtx.saveRDD(“my cached rdd”, rdd1)
App2:
RDD rdd2 = sharedCtx.loadRDD (“my cached rdd”)
Tachyon + Spark
(c) ElephantScale.com 2015 79
How to Get Spark?
Session 2: Introduction to Spark
Getting Spark
(c) ElephantScale.com 2015 81
Running Hadoop?
NO
Need HDFS?
YES
Install HDFS +
YARN + Spark
NO
Environment ?
Production
Spark + Mesos +
S3
Testing
Spark (standalone)
+ NFS or S3
YES
Install Spark on
Hadoop cluster
Spark Cluster Setup 1 : Simple
u  Great for POCs / experimentation
u  No dependencies
u  Using Spark’s ‘stand alone’ manager
(c) ElephantScale.com 2015 82
Spark Cluster Setup 2 : Production
u  Works well with Hadoop eco system (HDFS / Hive ..etc)
u  Best way to adopt Spark on Hadoop
u  Uses YARN as cluster manager
(c) ElephantScale.com 2015 83
Spark Cluster Setup 3 : Production
u  Uses Mesos as cluster manager
(c) ElephantScale.com 2015 84
Hadoop -> Spark Case Study
Session 2: Introduction to Spark
Use Case 1 : Moving to Cloud
(c) ElephantScale.com 2015 86
Use Case 1 : Lessons Learned
u  Size
– Small Hadoop cluster (8 nodes)
– Smallish data : 50G – 300G
– Data for processing : few Gigs per query
u  Good !
– Only one moving part 'spark'
– No Hadoop cluster to maintain
– S3 was a dependable storage (passive)
– Query response time gone from minutes to seconds (b/c we went
from MR à Spark)
u  Not so good
– We lost data locality of HDFS
(ok for small/medium data sets)
(c) ElephantScale.com 2015 87
Use Case 2 : Persistent Caching in Spark
u  Can we improve latency in this setup?
u  Caching will help
u  How ever, in Spark cached data can not be shared across
applications L
(c) ElephantScale.com 2015 88
Use Case 2 : Persistent Caching in Spark
u  Spark Job Server to rescue !
(c) ElephantScale.com 2015 89
Final Thoughts
u  Already on Hadoop?
– Try Spark side-by-side
– Process some data in HDFS
– Try Spark SQL for Hive tables
u  Contemplating Hadoop?
– Try Spark (standalone)
– Choose NFS or S3 file system
u  Take advantage of caching
– Iterative loads
– Spark Job server
– Tachyon
(c) ElephantScale.com 2015 90
Thanks and questions?
Sujee Maniyam
Founder / Principal @ ElephantScale
Expert Consulting + Training in Big Data technologies
sujee@elephantscale.com
Elephantscale.com
Sign up for upcoming trainings : ElephantScale.com/training
(c) ElephantScale.com 2015 91

More Related Content

What's hot

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesSriram Krishnan
 

What's hot (20)

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
 

Viewers also liked

Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of ThingsSujee Maniyam
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big DataSujee Maniyam
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
Porting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpPorting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpMark Kerzner
 
Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Mark Kerzner
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)Syaifuddin Ismail
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscapeSujee Maniyam
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupMark Kerzner
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data editionMark Kerzner
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Futuretcloudcomputing-tw
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHortonworks
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Ashutosh Sonaliya
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemrhatr
 

Viewers also liked (20)

Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of Things
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Porting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdpPorting your hadoop app to horton works hdp
Porting your hadoop app to horton works hdp
 
Zeta architecture -2015
Zeta architecture -2015Zeta architecture -2015
Zeta architecture -2015
 
Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS Night owl by Boyd Meyer of PROS
Night owl by Boyd Meyer of PROS
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscape
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop MeetupHadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Future
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
 

Similar to Hadoop to spark_v2

Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerSynerzip
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 

Similar to Hadoop to spark_v2 (20)

Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark Kerzner
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 

More from elephantscale

How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certificationelephantscale
 
Building a Big Data Team
Building a Big Data TeamBuilding a Big Data Team
Building a Big Data Teamelephantscale
 
Petrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinPetrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinelephantscale
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
Oil & Gas Big Data use cases
Oil & Gas Big Data use casesOil & Gas Big Data use cases
Oil & Gas Big Data use caseselephantscale
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Thingselephantscale
 

More from elephantscale (8)

AI for Kids
AI for KidsAI for Kids
AI for Kids
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certification
 
Building a Big Data Team
Building a Big Data TeamBuilding a Big Data Team
Building a Big Data Team
 
Petrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinPetrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultin
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Oil & Gas Big Data use cases
Oil & Gas Big Data use casesOil & Gas Big Data use cases
Oil & Gas Big Data use cases
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Reference architecture for Internet Of Things
Reference architecture for Internet Of ThingsReference architecture for Internet Of Things
Reference architecture for Internet Of Things
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Hadoop to spark_v2

  • 1. From Hadoop to Spark Introduction Hadoop and Spark Comparison From Hadoop to Spark
  • 2. HI, I’m Sujee Maniyam •  Founder / Principal @ ElephantScale •  Consulting & Training in Big Data •  Spark / Hadoop / NoSQL / Data Science •  Author –  “Hadoop illuminated” open source book –  “HBase Design Patterns” •  Open Source contributor: github.com/sujee •  sujee@elephantscale.com •  www.ElephantScale.com (c) ElephantScale.com 2015 Spark Training available! 2
  • 3. Webinar Audience u  I am already using Hadoop, Should I go to Spark? u  I am thinking about Hadoop, should I skip Hadoop and go to Spark ? (c) ElephantScale.com 2015 3
  • 4. Webinar Outline u  Intro: what is Hadoop and what is Spark? u  Capabilities and advantages of Spark & Hadoop u  Best use cases for Spark / Hadoop u  From Hadoop to Spark – how to? Webinar: From Hadoop to Spark(c) ElephantScale.com 2015 4
  • 5. Introduction Introduction Hadoop and Spark Comparison From Hadoop to Spark
  • 6. Hadoop in 20 Seconds u  ‘The Original’ Big data platform u  Very well field tested u  Scales to peta-bytes of data u  Enables analytics at massive scale (c) ElephantScale.com 2015 6
  • 7. Hadoop Eco System BatchReal Time (c) ElephantScale.com 2015 7
  • 8. Hadoop Ecosystem – by function u  HDFS – provides distributed storage u  Map Reduce – Provides distributed computing u  Pig – High level MapReduce u  Hive – SQL layer over Hadoop u  HBase – NoSQL storage for real-time queries (c) ElephantScale.com 2015 8
  • 9. Hadoop Extended Eco-System (c) ElephantScale.com 2015 Source : hortonworks 9
  • 10. Hadoop : Use Cases u  Two modes : Batch & Real Time u  Batch use case – Analytics at large scale (Terra bytes to peta bytes scale) – Analytics times can be minutes / hours. Depends on •  Size of data being analyzed •  And type of query – Examples: •  Large ETL work loads •  “Analyze clickstream data and calculate top page visits” •  “Combine purchase data and click-data and figure out discounts to apply” (c) ElephantScale.com 2015 10
  • 11. Hadoop Use Cases u  Real Time Use Cases do not rely on Map Reduce u  Instead we use HBase – A real-time NoSQL datastore built on Hadoop u  Example : Tracking Sensor data – Store data from millions of sensor – Could be billions of data points – “Find latest reading from a sensor” – This query must be done in real time (in milli-seconds) u  “Needle in HayStack” scenarios – We look for one / few records within billions (c) ElephantScale.com 2015 11
  • 12. Hadoop Reference Architecture (Example) (c) ElephantScale.com 2015 12 Source : hortonworks
  • 14. Big Data Analytics Evolution (v1) u  Decision times : batch ( hours / days) u  Use cases: – Modeling – ETL – Reporting (c) ElephantScale.com 2015 14
  • 15. Moving Towards Fast Data (v2) u  Decision time : (near) real time – seconds (or milli seconds) u  Use Cases – Alerts (medical / security) – Fraud detection (c) ElephantScale.com 2015 15
  • 16. Current Big Data Processing Challenges u  Processing needs outpacing 1st generation tools u  Beyond Batch – Not every one has terra-bytes of data to process – Small – Medium data sets (few hundred gigs) are more prevalent – Data may not be on disk •  In memory •  Coming via streaming channels u  MapReduce (MR)’s limitations – Batch processing doesn't fit all needs – Not effective for ‘iterative programming’ (machine learning algorithms ..etc) – High latency for streaming needs u  Spark is a 2nd generation tool addressing these needs 16(c) ElephantScale.com 2015
  • 17. What is Spark? u  Open source cluster computing engine – Very fast: In-memory ops 100x faster than MR •  On-disk ops 10x faster than MR – General purpose: MR, SQL, streaming, machine learning, analytics – Compatible: Runs over Hadoop, Mesos, Yarn, standalone •  Works with HDFS, S3, Cassandra, HBase, … – Easier to code: Word count in 2 lines u  Spark's roots: – Came out of Berkeley AMP Lab – Now top-level Apache project – Version 1.5 released in Sept 2015 “First Big Data platform to integrate batch, streaming and interactive computations in a unified framework” – stratio.com (c) ElephantScale.com 2015 17
  • 18. Spark Illustrated Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning Standalone YARN MESOS Cluster managers GraphX Graph processing HDFSS3 Cassandra ??? Data Storage (c) ElephantScale.com 2015 18
  • 19. Spark Core u  Basic building blocks for distributed compute engine – Task schedulers and memory management – Fault recovery (recovers missing pieces on node failure) – Storage system interfaces u  Defines Spark API and data model u  Data Model: RDD (Resilient Distributed Dataset) – Distributed collection of items – Can be worked on in parallel – Easily created from many data sources (Any HDFS InputSource) u  Spark API: Scala, Python, and Java – Compact API for working with RDD and interacting with Spark – Much easier to use than MapReduce API Session 2: Introduction to Spark (c) Elephant Scale.co m 201519
  • 20. Spark Components u  Spark SQL: Structured data – Supports SQL and HQL (Hive Query Language) – Data sources include Hive tables, JSON, CSV, Parquet (1) u  Spark Streaming: Live streams of data in real-time – Low latency, high throughput (1000s events / sec) – Log files, stock ticks, sensor data / IOT (Internet of Things) … u  ML Lib: Machine Learning at scale – Classification/regression, collaborative filtering … – Model evaluation and data import u  GraphX: Graph manipulation, graph-parallel computation – Social network friendships, link data, … – Graph manipulation and operations and common algorithms Session 2: Introduction to Spark (c) Elephant Scale.co m 201520
  • 21. Spark : 'Unified' Stack u  Spark components support multiple programming models – Map reduce style batch processing – Streaming / real time processing – Querying via SQL – Machine learning u  All modules are tightly integrated – Facilitates rich applications u  Spark can be the only stack you need ! – No need to run multiple clusters (Hadoop cluster, Storm cluster, etc.) Session 2: Introduction to Spark (c) Elephant Scale.co m 201521
  • 23. Spark Job Trends (c) ElephantScale.com 2015 23
  • 24. Hadoop and Spark Comparison Introduction Hadoop and Spark Comparison Going from Hadoop to Spark Session 2: Introduction to Spark
  • 25. Spark Benchmarks Source : stratio.com (c) ElephantScale.com 2015 25
  • 26. Spark Code / Activity (c) ElephantS cale.com 2015 Source : stratio.com 26
  • 27. Timeline : Hadoop & Spark (c) ElephantScale.com 2015 27
  • 28. Hadoop Vs. Spark Hadoop Spark Source : http://www.kwigger.com/mit-skifte-til-mac/ (c) ElephantScale.com 2015 28
  • 29. Comparison With Hadoop Hadoop Spark Distributed Storage + Distributed Compute Distributed Compute Only MapReduce framework Generalized computation Usually data on disk (HDFS) On disk / in memory Not ideal for iterative work Great at Iterative workloads (machine learning ..etc) Batch process - Up 10x faster for data on disk - Up to 100x faster for data in memory Mostly Java Compact code Java, Python, Scala supported No unified shell Shell for ad-hoc exploration (c) ElephantScale.com 2015 29
  • 30. Spark Is Better Fit for Iterative Workloads (c) ElephantScale.com 2015 30
  • 31. Spark Programming Model u  More generic than MapReduce (c) ElephantScale.com 2015 31
  • 32. Is Spark Replacing Hadoop? u  Spark runs on Hadoop / YARN u  Can access data in HDFS u  Use YARN for clustering u  Spark programming model is more flexible than MapReduce u  Spark is really great if data fits in memory (few hundred gigs), u  Spark is ‘storage agnostic’ (see next slide) (c) ElephantScale.com 2015 32
  • 33. Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra ??? (c) ElephantScale.com 2015 33
  • 34. Spark & Hadoop Use Case Hadoop Spark Batch processing Hadoop’s MapReduce (Java, Pig, Hive) Spark RDDs (java / scala / python) SQL querying Hadoop : Hive Spark SQL Stream Processing / Real Time processing Storm Kafka Spark Streaming Machine Learning Mahout Spark ML Lib Real time lookups HBase (NoSQL) No native Spark component. But Spark can query data in NoSQL stores (c) ElephantScale.com 2015 34
  • 35. Hadoop + Yarn : OS for Distributed Compute HDFS YARN Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Storage Cluster Management Applications (or at least, that’s the idea) (c) ElephantScale.com 2015 35
  • 36. Hadoop & Spark Future ??? (c) ElephantScale.com 2015 36
  • 37. Going from Hadoop to Spark Introduction Hadoop and Spark Comparison Going from Hadoop to Spark Session 2: Introduction to Spark
  • 38. Why Move From Hadoop to Spark? u  Spark is ‘easier’ than Hadoop u  ‘friendlier’ for data scientists / analysts – Interactive shell •  fast development cycles •  adhoc exploration u  API supports multiple languages – Java, Scala, Python u  Great for small (Gigs) to medium (100s of Gigs) data (c) ElephantScale.com 2015 38
  • 39. Spark : ‘Unified’ Stack u  Spark supports multiple programming models – Map reduce style batch processing – Streaming / real time processing – Querying via SQL – Machine learning u  All modules are tightly integrated – Facilitates rich applications u  Spark can be the only stack you need ! – No need to run multiple clusters (Hadoop cluster, Storm cluster, … etc.) Image: buymeposters.com (c) ElephantScale.com 2015 39
  • 40. Migrating From Hadoop à Spark Functionality Hadoop Spark Distributed Storage -  HDFS -  Cloud storage (Amazon S3) -  HDFS -  Cloud storage (Amazon S3) -  Distributed File system (NFS / Ceph) -  Distributed NoSQL (Cassandra) -  Tachyon (in memory) SQL querying Hive Spark SQL (Data frames) ETL work flow Pig -  Spork : Pig on Spark -  Mix of Spark SQL + RDD programming Machine Learning Mahout ML Lib NoSQL DB HBase ??? (c) ElephantScale.com 2015 40
  • 41. Things to Consider When Moving From Hadoop to Spark 1.  Data size 2.  File System 3.  Analytics A.  SQL B.  ETL C.  Machine Learning (c) ElephantScale.com 2015 41
  • 42. Data Size : “You Don’t Have Big Data” (c) ElephantScale.com 2015 42
  • 43. Data Size (T-shirt sizing) Image credit : blog.trumpi.co.za 10 G + 100 G + 1 TB + 100 TB + PB + < few G Hadoop / Spark Spark (c) ElephantScale.com 2015 43
  • 44. Data Size u  Lot of Spark adoption at SMALL – MEDIUM scale – Good fit – Data might fit in memory !! u  Applications – Iterative workloads (Machine learning, etc.) – Streaming (c) ElephantScale.com 2015 44
  • 45. Decision : Data Size (c) ElephantScale.com 2015 45 Data Size < 1 TB (Spark) > 1 TB (Hadoop / Spark)
  • 46. Decision : File System (c) ElephantScale.com 2015 46 “What kind of file system do I need for Spark”
  • 47. File System u  Hadoop = Storage + Compute u  Spark = Compute only u  Spark needs a distributed FS u  File system choices for Spark – HDFS - Hadoop File System •  Reliable •  Good performance (data locality) •  Field tested for PB of data – S3 : Amazon •  Reliable cloud storage •  Huge scale – NFS : Network File System (‘shared FS across machines) – Tachyon (in memory - experimental) (c) ElephantScale.com 2015 47
  • 48. Spark File Systems (c) ElephantScale.com 2015 48
  • 49. File Systems For Spark HDFS NFS Amazon S3 Data locality High (best) Local enough None (ok) Throughput High (best) Medium (good) Low (ok) Latency Low (best) Low High Reliability Very High (replicated) Low Very High Cost Varies Varies $30 / TB / Month (c) ElephantScale.com 2015 49
  • 50. File Systems Throughput Comparison u  Data : 10G + (11.3 G) u  Each file : ~1+ G ( x 10) u  400 million records total u  Partition size : 128 M u  On HDFS & S3 u  Cluster : – 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD ) – Hadoop cluster , Horton Works HDP v2.2 – Spark : on same 8 nodes, stand-alone, v 1.2 (c) ElephantScale.com 2015 50
  • 51. HDFS Vs. S3 (lower is better) (c) ElephantS cale.com 201551
  • 52. HDFS Vs. S3 (lower is better) (c) ElephantS cale.com 201552
  • 53. HDFS Vs. S3 Conclusions HDFS S3 Data locality à much higher throughput Data is streamed à lower throughput Need to maintain an Hadoop cluster No Hadoop cluster to maintain à convenient Large data sets (TB + ) Good use case: -  Smallish data sets (few gigs) -  Load once and cache and re-use (c) ElephantScale.com 2015 53
  • 54. Decision : File Systems (c) ElephantScale.com 2015 54 Already have Hadoop? NO HDFS S3 NFS (Ceph) Cassandra (real time) YES use HDFS
  • 55. Next Decision : SQL (c) ElephantScale.com 2015 55 “We use SQL heavily for data mining. We are using Hive / Impala on Hadoop. Is Spark right for us?”
  • 56. SQL in Hadoop / Spark Hadoop Spark Engine -  Hive (on Map Reduce or Tez on Hortonworks) -  Impala (Cloudera) -  Spark SQL using Dataframes -  Hive context Language HiveQL - HiveQL - RDD programming in Java / Python / Scala Scale Terabytes / Petabytes Gigabytes / Terabytes / Petabytes Inter operability Data stored in HDFS -  Hive tables -  File system Formats CSV, JSON, Parquet CSV, JSON, Parquet (c) ElephantScale.com 2015 56
  • 57. Dataframes Vs. RDDs u  RDDs have data u  DataFrames also have schema u  Dataframes Used to be called ‘schemaRDD’ u  Unified way to load / save data in multiple formats u  Provides high level operations – Count / sum / average – Select columns & filter them 57 (c) Elephant Scale.co m 2015 // load json data df = sqlContext.read .format(“json”) .load(“/data/data.json”) // save as parquet (faster queries) df.write .format(“parquet”) .saveAsTable(“/data/datap/”)
  • 59. Creating a DataFrame From JSON {"name": "John", "age": 35 } {"name": "Jane", "age": 40 } {"name": "Mike", "age": 20 } {"name": "Sue", "age": 52 } (c) Elephant Scale.co m 201559 Session 6: Spark SQL scala> val peopleDF = sqlContext.read.json("people.json") peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> peopleDF.printSchema() root |-- age: long(nullable = true) |-- name: string (nullable = true) scala> peopleDF.show() +---+----+ |age|name| +---+----+ | 35|John| | 40|Jane| | 20|Mike| | 52| Sue| +---+----+
  • 60. Querying Using SQL u  A DataFrame can be registered as a temporary table – You can then use SQL to query it, as shown below – This is handled similarly to DSL queries - building up an AST and sending it to Catalyst (c) Elephant Scale.co m 201560 Session 6: Spark SQL scala> df.registerTempTable("people") scala> sqlContext.sql("select * from people").show() name age John 35 Jane 40 Mike 20 Sue 52 scala> sqlContext.sql("select * from people where age > 35").show() name age Jane 40 Sue 52
  • 61. Going From Hive à Spark u  Spark natively supports querying data stored in Hive tables! u  Handy to use in an existing Hadoop cluster !! 61 Session 6: Spark SQL HIVE Hive> select customer_id, SUM(cost) as total from billing group by customer_id order by total DESC LIMIT 10; SPARK val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc) val top10 = hiveCtx.sql( "select customer_id, SUM(cost) as total from billing group by customer_id order by total DESC LIMIT 10") top10.collect() (c) ElephantScale.com 2015
  • 62. Spark SQL Vs. Hive (c) ElephantS cale.com 2015 Fast on same HDFS data ! 62
  • 63. Spark SQL Vs. Hive (c) ElephantS cale.com 2015 Fast on same data on HDFS 63
  • 64. Decision : SQL Using Hive? Yes Spark Using HiveContext NO Spark SQL with Dataframes (c) ElephantScale.com 2015 64
  • 65. Next Decision : ETL (c) ElephantScale.com 2015 65 “we do lot of ETL work on our Hadoop cluster. Using tools like Pig / Cascading Can we use Spark? “
  • 66. ETL on Hadoop / Spark ETL Hadoop Spark ETL Tools Pig, Cascading, Oozie -  Native RDD programming (Scala, Java, Python) -  Cascading? Pig High level ETL workflow Spork : Pig on Spark Cascading High level Spark-scalding Cask Works Works (c) ElephantScale.com 2015 66
  • 67. Data Transformation on Spark u  Dataframes are great for high level manipulation of data – High level operations : Join / Union …etc – Joining / Merging disparate data sets – Can read and understand multitude of data formats (JSON / Parquet ..etc) – Very easy to program u  RDD APIs allow low level programming – Complex manipulations – Lookups – Supports multiple lanaguages (Java / Scala / Python) u  High level libraries are emerging – Tresata – CASK (c) ElephantScale.com 2015 67
  • 68. Decisions : ETL Current ETL Pig Spork RDD API Dataframes Cascading Cascading on Spark RDD Data frames Java MapReduce / Custom RDD Dataframes (c) ElephantScale.com 2015 68
  • 69. Decision : Machine Learning (c) ElephantScale.com 2015 69 Can we use Spark for Machine Learning? YES
  • 70. Machine Learning : Hadoop / Spark Hadoop Spark Tool Mahout MLLib API Java Java / Scala / Python Iterative Algorithms Slower Very fast (in memory) In Memory processing No YES Mahout runs on Hadoop or on Spark New and young lib Latest news! Mahout only accepts new code that runs on Spark Mahout & MLLib on Spark Future? Many opinions (c) ElephantScale.com 2015 70
  • 71. Decision : In Memory Process (c) ElephantScale.com 2015 71 How can we do in- memory processing using Spark?
  • 72. Numbers Every One Should Know by Jeff Dean, Fellow @ Google Operation Cost (in nano seconds) L1 cache reference 0.5 Branch mispredict (cpu) 5 L2 cache reference 7 Mutex lock/unlock 100 Main memory reference 100 Compress 1K bytes with Zippy 10,000 Send 2K bytes over 1 Gbps network 20,000 Read 1 MB sequentially from memory 250,000 0.25 ms Round trip within same datacenter 500,000 0.5 ms Disk seek 10,000,000 10 ms Read 1 MB sequentially from network 10,000,000 10 ms Read 1 MB sequentially from disk 30,000,000 30 ms Send packet CA->Netherlands->CA 150,000,000 150 ms (c) ElephantScale.com 2015 72
  • 73. Spark Caching u  Caching is pretty effective (small / medium data sets) u  Cached data can not be shared across applications (each application executes in its own sandbox) (c) ElephantScale.com 2015 73
  • 76. Sharing Cached Data u  By default Spark applications can not share cached data – Running in isolation u  1) ‘spark job server’ – Multiplexer – All requests are executed through same ‘context’ – Provides web-service interface u  2) Tachyon – Distributed In-memory file system – Memory is the new disk! – Out of AMP lab , Berkeley – Early stages (very promising) (c) ElephantScale.com 2015 76
  • 77. Spark Job Server (c) ElephantScale.com 2015 77
  • 78. Spark Job Server u  Open sourced from Ooyala u  ‘Spark as a Service’ – simple REST interface to launch jobs u  Sub-second latency ! u  Pre-load jars for even faster spinup u  Share cached RDDs across requests (NamedRDD) u  https://github.com/spark-jobserver/spark-jobserver (c) ElephantScale.com 2015 78 App1 : sharedCtx.saveRDD(“my cached rdd”, rdd1) App2: RDD rdd2 = sharedCtx.loadRDD (“my cached rdd”)
  • 79. Tachyon + Spark (c) ElephantScale.com 2015 79
  • 80. How to Get Spark? Session 2: Introduction to Spark
  • 81. Getting Spark (c) ElephantScale.com 2015 81 Running Hadoop? NO Need HDFS? YES Install HDFS + YARN + Spark NO Environment ? Production Spark + Mesos + S3 Testing Spark (standalone) + NFS or S3 YES Install Spark on Hadoop cluster
  • 82. Spark Cluster Setup 1 : Simple u  Great for POCs / experimentation u  No dependencies u  Using Spark’s ‘stand alone’ manager (c) ElephantScale.com 2015 82
  • 83. Spark Cluster Setup 2 : Production u  Works well with Hadoop eco system (HDFS / Hive ..etc) u  Best way to adopt Spark on Hadoop u  Uses YARN as cluster manager (c) ElephantScale.com 2015 83
  • 84. Spark Cluster Setup 3 : Production u  Uses Mesos as cluster manager (c) ElephantScale.com 2015 84
  • 85. Hadoop -> Spark Case Study Session 2: Introduction to Spark
  • 86. Use Case 1 : Moving to Cloud (c) ElephantScale.com 2015 86
  • 87. Use Case 1 : Lessons Learned u  Size – Small Hadoop cluster (8 nodes) – Smallish data : 50G – 300G – Data for processing : few Gigs per query u  Good ! – Only one moving part 'spark' – No Hadoop cluster to maintain – S3 was a dependable storage (passive) – Query response time gone from minutes to seconds (b/c we went from MR à Spark) u  Not so good – We lost data locality of HDFS (ok for small/medium data sets) (c) ElephantScale.com 2015 87
  • 88. Use Case 2 : Persistent Caching in Spark u  Can we improve latency in this setup? u  Caching will help u  How ever, in Spark cached data can not be shared across applications L (c) ElephantScale.com 2015 88
  • 89. Use Case 2 : Persistent Caching in Spark u  Spark Job Server to rescue ! (c) ElephantScale.com 2015 89
  • 90. Final Thoughts u  Already on Hadoop? – Try Spark side-by-side – Process some data in HDFS – Try Spark SQL for Hive tables u  Contemplating Hadoop? – Try Spark (standalone) – Choose NFS or S3 file system u  Take advantage of caching – Iterative loads – Spark Job server – Tachyon (c) ElephantScale.com 2015 90
  • 91. Thanks and questions? Sujee Maniyam Founder / Principal @ ElephantScale Expert Consulting + Training in Big Data technologies sujee@elephantscale.com Elephantscale.com Sign up for upcoming trainings : ElephantScale.com/training (c) ElephantScale.com 2015 91