SlideShare a Scribd company logo
1 of 47
© 2014 MapR Technologies 1© 2014 MapR Technologies
adawar@mapr.com
pat.mcdonough@databricks.com
© 2014 MapR Technologies 2
About MapR and Databricks
• Project leads for Spark,
formerly with UC Berkeley’s
AMPLab
• Founded in June 2013 and
backed by Andreessen
Horowitz
• Strong Engineering focus
* Forrester Wave Big Data Hadoop Solutions, Q1 2014
• Top Ranked distribution for
Hadoop*
• Hundreds of deployments
– 17 of Fortune 100
– Largest deployment in FSI
(1000+ nodes)
• Strong focus on making
Hadoop resilient and
enterprise grade
• Worldwide Presence
© 2014 MapR Technologies 3
Hadoop Evolves
Make it solid
• HA: eliminate SPOFs
• Data Protection: recover
from application/user
errors
• Disaster Recovery: data
center outages
• Enterprise Integration:
breaking the wall that
separates Hadoop from
the rest
• Security & Multi-
tenancy: sharing the
cluster and meeting
SLA’s, secure
authorization, data
governance
Make it do more
(easily)
• Interactive apps (i.e.
SQL)
• Iterative programs
• Streaming apps
• Medium/Small Data
• Architecture: using
memory efficiently
• How many different tools
should it take?
– It’s hard to get
interoperability amongst
different data-parallel models
right
– Learning curves and
operational costs increase
with each new tool
© 2014 MapR Technologies 4
MapR – Top ranked Hadoop distribution
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Batch
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning /
coordination
Savannah*
Mahout
ML, Graph
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integratio
n
& Access
HttpFS
Hue
* Certification/support planned for 2014
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
• High availability
• Data protection
• Disaster recovery
• Standard file
access
• Standard
database access
• Pluggable
services
• Broad developer
support
• Enterprise
security
authorization
• Wire-level
authentication
• Data governance
• Ability to support
predictive
analytics, real-
time database
operations, and
support high
arrival rate data
• Ability to logically
divide a cluster to
support different
use cases, job
types, user
groups, and
administrators
• 2X to 7X higher
performance
• Consistent, low
latency
* Forrester Wave Big Data Hadoop Solutions, Q1 2014
© 2014 MapR Technologies 5
MapR – The Only Distribution to Integrate
the Complete Apache Spark Stack
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Batch
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
ML, Graph
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governan
ce
Tez*
Accumulo*
Hive
Impala
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integratio
n
& Access
HttpFS
Hue
* Certification/support planned for 2014
Shark
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Spark
Spark
Streaming
MLLib
GraphX Shark
© 2014 MapR Technologies 6
Spark on MapR
World-record performance on
disk coupled with in-memory
processing advantages
High Performance
Industry-leading enterprise-grade
High Availability, Data Protection
and Disaster Recovery
Enterprise-grade dependability for
Spark
Strategic partnership with
Databricks to ensure enterprise
support for the entire stack
24/7 Best-in-class Global Support
Spark stack can also be deployed
natively as an independent
standalone service on the MapR
cluster
Can Run Natively on MapR
Apache Spark
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009
in UC Berkeley’s AMP Lab
• Fully open sourced in 2010
• Top-level Apache Project as of
2014
The Spark Community
Spark is The Most Active Open Source
Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear
Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> General execution
graphs
> In-memory storage
Spark: Easy and Fast Big Data
Easy to Develop
> Rich APIs in
Java, Scala, Pytho
n
> Interactive shell
Fast to Run
> General execution
graphs
> In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
Easy: Get Started Immediately
• Multi-language support
• Interactive Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
Java 8 (Coming Soon)
JavaRDD<String> lines = sc.textFile(...)
lines.filter(x -> x.contains(“ERROR”)).count()
Easy: Clean API
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g.
map, filter, groupBy)
• Actions
(e.g.
count, collect, save)
Write programs in terms of transformations on
distributed datasets
Easy: Expressive API
map reduce
Easy: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save ...
Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Easy: Example – Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Easy: Works Well With Hadoop
Data Compatibility
• Access your existing
Hadoop Data
• Use the same data
formats
• Adheres to data locality
for efficient processing
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing
Hadoop cluster or side-
by-side
Easy: User-Driven Roadmap
Language support
> Improved Python
support
> SparkR
> Java 8
> Integrated Schema
and SQL support in
Spark’s APIs
Better ML
> Sparse Data Support
> Model Evaluation
Framework
> Performance Testing
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
Fast: Logistic Regression Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
Fast: Using RAM, Operator Graphs
In-memory Caching
• Data Partitions read from
RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
Fast: Scaling Down
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Executiontime(s)
% of working set in cache
Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Easy: Unified Platform
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Continued innovation bringing new functionality, e.g.:
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
• Tachyon (off-heap RDD caching)
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Hive Compatibility
• Interfaces to access data and code in the Hive
ecosystem:
o Support for writing queries in HQL
o Catalog for that interfaces with the
Hive MetaStore
o Tablescan operator that uses Hive SerDes
o Wrappers for Hive UDFs, UDAFs, UDTFs
Parquet Support
Native support for reading data stored in
Parquet:
• Columnar storage avoids reading
unneeded data.
• Currently only supports flat structures
(nested data on short-term roadmap).
• RDDs can be written to parquet
files, preserving the schema.
Mixing SQL and Machine Learning
val trainingDataTable = sql(""" SELECT
e.action, u.age, u.latitude, u.logitude FROM Users u
JOIN Events e ON u.userId = e.userId""")// Since `sql`
returns an RDD, the results of can be easily used in MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model = new
LogisticRegressionWithSGD().run(trainingData)
Relationship to
Borrows
• Hive data loading code / in-
memory columnar
representation
• hardened spark execution
engine
Adds
• RDD-aware optimizer /
query planner
• execution engine
• language interfaces.
Catalyst/SparkSQL is a nearly from scratch
rewrite that leverages the best parts of Shark
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Spark Streaming
Run a streaming computation as a series of very
small, deterministic batch jobs
34
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
• Chop up the live stream into batches of
½ second or more, leverage RDDs for
micro-batch processing
• Use the same familiar Spark APIs to
process streams
• Combine your batch and online
processing in a single system
• Guarantee exactly-once semantics
DStream of data
Window-based Transformations
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
operation
window length sliding interval
window length
sliding interval
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
MLlib – Machine Learning library
Logis] c*Regression,*Linear*SVM*(+L1,*L2),*Decision*
Trees,*Naive*Bayes"
Linear*Regression*(+Lasso,*Ridge)*
Alterna] ng*Least*Squares*
KZMeans,*SVD*
SGD,*Parallel*Gradient*
Scala,*Java,*PySpark*(0.9)
MLlib
Classifica. on:"
Regression:"
Collabora. ve"Filtering:"
Clustering"/"Explora. on:"
Op. miza. on"Primi. ves:"
Interopera. lity:"
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Enabling users to easily and efficiently
express the entire graph analytics
pipeline
New API
Blurs the distinction
between Tables and
Graphs
New System
Combines Data-Parallel
Graph-Parallel Systems
The GraphX Unified Approach
Easy: Unified Platform
Shark
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Continued innovation bringing new functionality, e.g.,:
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
• Tachyon (off-heap RDD caching)
Use Cases
Interactive Exploratory Analytics
• Leverage Spark’s in-memory caching and efficient
execution to explore large distributed datasets
• Use Spark’s APIs to explore any kind of data
(structured, unstructured, semi-structured, etc.) and
combine programming models
• Execute arbitrary code using a fully-functional interactive
programming environment
• Connect external tools via SQL Drivers
Machine Learning
• Improve performance of iterative algorithms by caching
frequently accessed datasets
• Develop programs that are easy to reason using a fully-
capable functional programming style
• Refine algorithms using the interactive REPL
• Use carefully-curated algorithms out-of-the-box with
MLlib
Power Real-time Dashboards
• Use Spark Streaming to perform low-latency window-
based aggregations
• Combine offline models with streaming data for online
clustering and classification within the dashboard
• Use Spark’s core APIs and/or Spark SQL to give users
large-scale, low-latency drill-down capabilities in
exploring dashboard data
Faster ETL
• Leverage Spark’s optimized scheduling for more efficient
I/O on large datasets, and in-memory processing for
aggregations, shuffles, and more
• Use Spark SQL to perform ETL using a familiar SQL
interface
• Easily port PIG scripts to Spark’s API
• Run existing HIVE queries directly on Spark SQL or
Shark
San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/
© 2014 MapR Technologies 47
Q&A
@mapr maprtech
adawar@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot

Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 

What's hot (20)

Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
PySaprk
PySaprkPySaprk
PySaprk
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 

Viewers also liked

Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Extending Appcelerator Titanium Mobile through Native Modules
Extending Appcelerator Titanium Mobile through Native ModulesExtending Appcelerator Titanium Mobile through Native Modules
Extending Appcelerator Titanium Mobile through Native Modulesomorandi
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Yahoo Developer Network
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Spark Summit
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsVincenzo Gulisano
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming OverviewStratio
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Pancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanPancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanElla Feby
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Spark architecture
Spark architectureSpark architecture
Spark architecturedatamantra
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaSpark Summit
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 

Viewers also liked (20)

Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Extending Appcelerator Titanium Mobile through Native Modules
Extending Appcelerator Titanium Mobile through Native ModulesExtending Appcelerator Titanium Mobile through Native Modules
Extending Appcelerator Titanium Mobile through Native Modules
 
Big Data at Riot Games
Big Data at Riot GamesBig Data at Riot Games
Big Data at Riot Games
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operations
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Pancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanPancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraan
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Not Your Father's Database by Vida Ha
Not Your Father's Database by Vida HaNot Your Father's Database by Vida Ha
Not Your Father's Database by Vida Ha
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 

Similar to Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 

Similar to Let Spark Fly: Advantages and Use Cases for Spark on Hadoop (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 

Recently uploaded (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies adawar@mapr.com pat.mcdonough@databricks.com
  • 2. © 2014 MapR Technologies 2 About MapR and Databricks • Project leads for Spark, formerly with UC Berkeley’s AMPLab • Founded in June 2013 and backed by Andreessen Horowitz • Strong Engineering focus * Forrester Wave Big Data Hadoop Solutions, Q1 2014 • Top Ranked distribution for Hadoop* • Hundreds of deployments – 17 of Fortune 100 – Largest deployment in FSI (1000+ nodes) • Strong focus on making Hadoop resilient and enterprise grade • Worldwide Presence
  • 3. © 2014 MapR Technologies 3 Hadoop Evolves Make it solid • HA: eliminate SPOFs • Data Protection: recover from application/user errors • Disaster Recovery: data center outages • Enterprise Integration: breaking the wall that separates Hadoop from the rest • Security & Multi- tenancy: sharing the cluster and meeting SLA’s, secure authorization, data governance Make it do more (easily) • Interactive apps (i.e. SQL) • Iterative programs • Streaming apps • Medium/Small Data • Architecture: using memory efficiently • How many different tools should it take? – It’s hard to get interoperability amongst different data-parallel models right – Learning curves and operational costs increase with each new tool
  • 4. © 2014 MapR Technologies 4 MapR – Top ranked Hadoop distribution Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Batch Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning / coordination Savannah* Mahout ML, Graph MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integratio n & Access HttpFS Hue * Certification/support planned for 2014 Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability • High availability • Data protection • Disaster recovery • Standard file access • Standard database access • Pluggable services • Broad developer support • Enterprise security authorization • Wire-level authentication • Data governance • Ability to support predictive analytics, real- time database operations, and support high arrival rate data • Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators • 2X to 7X higher performance • Consistent, low latency * Forrester Wave Big Data Hadoop Solutions, Q1 2014
  • 5. © 2014 MapR Technologies 5 MapR – The Only Distribution to Integrate the Complete Apache Spark Stack Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Batch Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout ML, Graph MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governan ce Tez* Accumulo* Hive Impala Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integratio n & Access HttpFS Hue * Certification/support planned for 2014 Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Spark Spark Streaming MLLib GraphX Shark
  • 6. © 2014 MapR Technologies 6 Spark on MapR World-record performance on disk coupled with in-memory processing advantages High Performance Industry-leading enterprise-grade High Availability, Data Protection and Disaster Recovery Enterprise-grade dependability for Spark Strategic partnership with Databricks to ensure enterprise support for the entire stack 24/7 Best-in-class Global Support Spark stack can also be deployed natively as an independent standalone service on the MapR cluster Can Run Natively on MapR
  • 8. Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 • Top-level Apache Project as of 2014
  • 10. Spark is The Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
  • 11. Spark: Easy and Fast Big Data Easy to Develop > Rich APIs in Java, Scala, Pytho n > Interactive shell Fast to Run > General execution graphs > In-memory storage
  • 12. Spark: Easy and Fast Big Data Easy to Develop > Rich APIs in Java, Scala, Pytho n > Interactive shell Fast to Run > General execution graphs > In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 13. Easy: Get Started Immediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 14. Easy: Get Started Immediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); Java 8 (Coming Soon) JavaRDD<String> lines = sc.textFile(...) lines.filter(x -> x.contains(“ERROR”)).count()
  • 15. Easy: Clean API Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  • 18. Easy: Example – Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 19. Easy: Example – Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 20. Easy: Works Well With Hadoop Data Compatibility • Access your existing Hadoop Data • Use the same data formats • Adheres to data locality for efficient processing Deployment Models • “Standalone” deployment • YARN-based deployment • Mesos-based deployment • Deploy on existing Hadoop cluster or side- by-side
  • 21. Easy: User-Driven Roadmap Language support > Improved Python support > SparkR > Java 8 > Integrated Schema and SQL support in Spark’s APIs Better ML > Sparse Data Support > Model Evaluation Framework > Performance Testing
  • 22. Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
  • 23. Fast: Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s
  • 24. Fast: Using RAM, Operator Graphs In-memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 25. Fast: Scaling Down 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
  • 26. Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 27. Easy: Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
  • 29. Hive Compatibility • Interfaces to access data and code in the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs
  • 30. Parquet Support Native support for reading data stored in Parquet: • Columnar storage avoids reading unneeded data. • Currently only supports flat structures (nested data on short-term roadmap). • RDDs can be written to parquet files, preserving the schema.
  • 31. Mixing SQL and Machine Learning val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""")// Since `sql` returns an RDD, the results of can be easily used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)
  • 32. Relationship to Borrows • Hive data loading code / in- memory columnar representation • hardened spark execution engine Adds • RDD-aware optimizer / query planner • execution engine • language interfaces. Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark
  • 34. Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs 34 Spark Spark Streaming batches of X seconds live data stream processed results • Chop up the live stream into batches of ½ second or more, leverage RDDs for micro-batch processing • Use the same familiar Spark APIs to process streams • Combine your batch and online processing in a single system • Guarantee exactly-once semantics
  • 35. DStream of data Window-based Transformations val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval
  • 37. MLlib – Machine Learning library Logis] c*Regression,*Linear*SVM*(+L1,*L2),*Decision* Trees,*Naive*Bayes" Linear*Regression*(+Lasso,*Ridge)* Alterna] ng*Least*Squares* KZMeans,*SVD* SGD,*Parallel*Gradient* Scala,*Java,*PySpark*(0.9) MLlib Classifica. on:" Regression:" Collabora. ve"Filtering:" Clustering"/"Explora. on:" Op. miza. on"Primi. ves:" Interopera. lity:"
  • 39. Enabling users to easily and efficiently express the entire graph analytics pipeline New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems The GraphX Unified Approach
  • 40. Easy: Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.,: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
  • 42. Interactive Exploratory Analytics • Leverage Spark’s in-memory caching and efficient execution to explore large distributed datasets • Use Spark’s APIs to explore any kind of data (structured, unstructured, semi-structured, etc.) and combine programming models • Execute arbitrary code using a fully-functional interactive programming environment • Connect external tools via SQL Drivers
  • 43. Machine Learning • Improve performance of iterative algorithms by caching frequently accessed datasets • Develop programs that are easy to reason using a fully- capable functional programming style • Refine algorithms using the interactive REPL • Use carefully-curated algorithms out-of-the-box with MLlib
  • 44. Power Real-time Dashboards • Use Spark Streaming to perform low-latency window- based aggregations • Combine offline models with streaming data for online clustering and classification within the dashboard • Use Spark’s core APIs and/or Spark SQL to give users large-scale, low-latency drill-down capabilities in exploring dashboard data
  • 45. Faster ETL • Leverage Spark’s optimized scheduling for more efficient I/O on large datasets, and in-memory processing for aggregations, shuffles, and more • Use Spark SQL to perform ETL using a familiar SQL interface • Easily port PIG scripts to Spark’s API • Run existing HIVE queries directly on Spark SQL or Shark
  • 46. San Francisco June 30 – July 2 • Use Cases • Tech Talks • Training http://spark-summit.org/
  • 47. © 2014 MapR Technologies 47 Q&A @mapr maprtech adawar@mapr.com Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  1. The power of MapR begins with the power of open source innovation and community participation.In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop)In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management.MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
  2. The power of MapR begins with the power of open source innovation and community participation.In some cases MapR leads the community in projects like Apache Mahout (machine learning) or Apache Drill (SQL on Hadoop)In other areas, MapR contributes, integrates Apache and other open source software (OSS) projects into the MapR distribution, delivering a more reliable and performant system with lower overall TCO and easier system management.MapR releases a new version with the latest OSS innovations on a monthly basis. We add 2-4 new Apache projects annually as new projects become production ready and based on customer demand.
  3. You can find Project Resources on the Apache Incubator siteYou’ll also find information about the mailing list there (including archives)
  4. One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
  5. Key idea: add “variables” to the “functions” in functional programming
  6. This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  7. Gracefully