SlideShare a Scribd company logo
1 of 25
Download to read offline
Tim Vaillancourt
Sr. Technical Operations Architect
One Tool to Rule Them All: Seamless SQL on
MongoDB, MySQL and Redis with Apache Spark
Staring..
as “the RDBMs”
as “the document-store”
as “the in-memory K/V store”
About Me
• Joined Percona in January 2016
• Sr Technical Operations Architect for MongoDB
• Previous:
• EA DICE (MySQL DBA)
• EA SPORTS (Sys/NoSQL DBA Ops)
• Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops)
• Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc
• 10+ years tuning Linux for database workloads (off and on)
• NOT an Apache Spark expert
Apache Spark
• “…is a fast and general engine for large-scale data processing”
• Written in Scala, utilises Akka streaming framework and runs under Java JVM
• Supports jobs written in Java, Scala, Python and R
• Pluggable datasources for various file types, databases, SaaSs, etc
• Fast and efficient: jobs work on datasources quickly in parallel
• Optional clustering
• Master/Slave Spark cluster, Zookeeper for Master elections
• Slave workers connect to master
• Master distributes tasks evenly to available workers
• Streaming and Machine Learning (MLib) capabilities
• Programatic and SQL(!) querying capabilities
Apache Spark: Software Architecture
• Jobs go to Cluster Master or runs in the
client JVM directly
• Cluster Master directs jobs to nodes
with available resources with messages
• Cluster Master HA
• Slaves reconnect to Master
• Apache Zookeeper for true HA
Apache Spark: Hadoop Comparison
• Hadoop MapReduce
• Batch-based
• Uses data less efficiently
• Relatively hard to develop/maintain
• Spark
• Stream Processing
• Fast/Parallelism
• Prefers memory as much as possible in jobs
• Divides work into many lightweight sub-tasks in threads
• Datasources
• Uses datasource-awareness to scale (eg: indices, shard-awareness, etc)
• Spark allows processing and storage to scale separately
Apache Spark: RDDs and DataFrames
• RDD: Resilient Distributed Dataset
• Original API to access data in Spark
• Lazy: does not access data until a real action is performed
• Spark’s optimiser cannot see inside
• RDDs are slow on Python
• DataFrames API
• Higher level API, focused on the “what” is being done
• Has schemas / table-like
• Interchangeable Programming and SQL APIs
• Much easier to read and comprehend
• Optimises execution plan
Apache Spark: Datasources
• Provides a pluggable mechanism for accessing structured data though Spark SQL
• At least these databases are supported in some way
• MySQL
• MongoDB
• Redis
• Cassandra
• Postgres
• HBase
• HDFS
• File
• S3
• In practice: search GitHub, find .jar file, deploy it!
Apache Spark: SQLContext
• ANSI SQL
• 30+ year old language..
• Easy to understand
• Everyone usually knows it
• Spark SQLContext
• A Spark module for structured data processing, wrapping RDD API
• Uses the same execution engine as the programatic APIs
• Supports:
• JOINs/unions
• EXPLAINs,
• Subqueries,
• ORDER/GROUP/SORT BYs
• Most datatypes you’d expect
Apache Spark: Use Cases
• Business Intelligence/Analytics
• Understand
• Tip: use dedicated replicas for expensive queries!
• Data Summaries and Batch Jobs
• Perform expensive summaries in the background,
save result
• Tip: use burstable/cloud hardware for infrequent
batch jobs
• Real-time Stream Processing
• Process data as it enters your system
So why not Apache Drill?
• A schema-free SQL engine for Hadoop, NoSQL and Cloud Storage
• Drill does not support / work with
• Relational databases (MySQL) or Redis
• No programatic-level querying
• No streaming/continuous query functionality
• I don’t know much about it
The Demo
• Scenario: You run a Weather Station data app that stores data in both an
RDBMs and a document store
• Goal: summarise weather station data stored in an RDBMs and a
Document store
• Min Water Temperature
• Avg Water Temperature
• Max Water Temperature
• Total Sample Count
• Get Top-10 (based on avg water temp)
The Demo
• RDBMs: Percona Server for MySQL 5.7
• Stores the Weather station metadata data (roughly 350 stations: ID,
name, location, etc)
• Document-Store: Percona Server for MongoDB 3.2
• Stores the Weather time-series sample data (roughly 80,000 samples:
various weather readings from stations)
• In-Memory K/V Store: Redis 2.8
• Store summarised Top-10 data for fast querying of min, avg, max
temperature and total sample counts
The Demo
• Apache Spark 1.6.2
Cluster
• 1 x Master
• 2 x Worker/Slaves
• 1 x Pyspark Job
• 1 x Macbook Pro
• 3 x Virtualbox VMs
• Job submitted on Master
The Demo
(Play Demo Video Now)
The Demo: The Pyspark Job
<- SparkContext
and
SQLContext
<- MySQL Table
as SQLContext
Temp Table
The Demo: The Pyspark Job
<- MongoDB Collection
as SQLContext
Temp Table
<- New Redis
Hash Schema
as SQLContext
Temp Table
The Demo: The Pyspark Job
<- From
Redis
<- Aggregation
The Demo: The Pyspark Job
<- Aggregation
The Demo
The Demo
The Demo
The Demo
Questions?
DATABASE PERFORMANCE
MATTERS

More Related Content

What's hot

PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HAharoonm
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALEPostgreSQL Experts, Inc.
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesMydbops
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergendistributed matters
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyAlexander Kukushkin
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterSrihari Sriraman
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldJignesh Shah
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
 
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...PostgreSQL-Consulting
 
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...Ashnikbiz
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Ashnikbiz
 
Do more with Galera Cluster in your OpenStack cloud
Do more with Galera Cluster in your OpenStack cloudDo more with Galera Cluster in your OpenStack cloud
Do more with Galera Cluster in your OpenStack cloudphilip_stoev
 
MySQL Live Migration - Common Scenarios
MySQL Live Migration - Common ScenariosMySQL Live Migration - Common Scenarios
MySQL Live Migration - Common ScenariosMydbops
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogicalUmair Shahid
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Labs
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsMydbops
 

What's hot (20)

PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HA
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike SteenbergenMeet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New Features
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
 
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
 
Do more with Galera Cluster in your OpenStack cloud
Do more with Galera Cluster in your OpenStack cloudDo more with Galera Cluster in your OpenStack cloud
Do more with Galera Cluster in your OpenStack cloud
 
MySQL Live Migration - Common Scenarios
MySQL Live Migration - Common ScenariosMySQL Live Migration - Common Scenarios
MySQL Live Migration - Common Scenarios
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability Methods
 

Similar to One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; pythonMaloy Manna, PMP®
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Similar to One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark (20)

Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

  • 1. Tim Vaillancourt Sr. Technical Operations Architect One Tool to Rule Them All: Seamless SQL on MongoDB, MySQL and Redis with Apache Spark
  • 2. Staring.. as “the RDBMs” as “the document-store” as “the in-memory K/V store”
  • 3. About Me • Joined Percona in January 2016 • Sr Technical Operations Architect for MongoDB • Previous: • EA DICE (MySQL DBA) • EA SPORTS (Sys/NoSQL DBA Ops) • Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops) • Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc • 10+ years tuning Linux for database workloads (off and on) • NOT an Apache Spark expert
  • 4. Apache Spark • “…is a fast and general engine for large-scale data processing” • Written in Scala, utilises Akka streaming framework and runs under Java JVM • Supports jobs written in Java, Scala, Python and R • Pluggable datasources for various file types, databases, SaaSs, etc • Fast and efficient: jobs work on datasources quickly in parallel • Optional clustering • Master/Slave Spark cluster, Zookeeper for Master elections • Slave workers connect to master • Master distributes tasks evenly to available workers • Streaming and Machine Learning (MLib) capabilities • Programatic and SQL(!) querying capabilities
  • 5. Apache Spark: Software Architecture • Jobs go to Cluster Master or runs in the client JVM directly • Cluster Master directs jobs to nodes with available resources with messages • Cluster Master HA • Slaves reconnect to Master • Apache Zookeeper for true HA
  • 6. Apache Spark: Hadoop Comparison • Hadoop MapReduce • Batch-based • Uses data less efficiently • Relatively hard to develop/maintain • Spark • Stream Processing • Fast/Parallelism • Prefers memory as much as possible in jobs • Divides work into many lightweight sub-tasks in threads • Datasources • Uses datasource-awareness to scale (eg: indices, shard-awareness, etc) • Spark allows processing and storage to scale separately
  • 7. Apache Spark: RDDs and DataFrames • RDD: Resilient Distributed Dataset • Original API to access data in Spark • Lazy: does not access data until a real action is performed • Spark’s optimiser cannot see inside • RDDs are slow on Python • DataFrames API • Higher level API, focused on the “what” is being done • Has schemas / table-like • Interchangeable Programming and SQL APIs • Much easier to read and comprehend • Optimises execution plan
  • 8. Apache Spark: Datasources • Provides a pluggable mechanism for accessing structured data though Spark SQL • At least these databases are supported in some way • MySQL • MongoDB • Redis • Cassandra • Postgres • HBase • HDFS • File • S3 • In practice: search GitHub, find .jar file, deploy it!
  • 9. Apache Spark: SQLContext • ANSI SQL • 30+ year old language.. • Easy to understand • Everyone usually knows it • Spark SQLContext • A Spark module for structured data processing, wrapping RDD API • Uses the same execution engine as the programatic APIs • Supports: • JOINs/unions • EXPLAINs, • Subqueries, • ORDER/GROUP/SORT BYs • Most datatypes you’d expect
  • 10. Apache Spark: Use Cases • Business Intelligence/Analytics • Understand • Tip: use dedicated replicas for expensive queries! • Data Summaries and Batch Jobs • Perform expensive summaries in the background, save result • Tip: use burstable/cloud hardware for infrequent batch jobs • Real-time Stream Processing • Process data as it enters your system
  • 11. So why not Apache Drill? • A schema-free SQL engine for Hadoop, NoSQL and Cloud Storage • Drill does not support / work with • Relational databases (MySQL) or Redis • No programatic-level querying • No streaming/continuous query functionality • I don’t know much about it
  • 12. The Demo • Scenario: You run a Weather Station data app that stores data in both an RDBMs and a document store • Goal: summarise weather station data stored in an RDBMs and a Document store • Min Water Temperature • Avg Water Temperature • Max Water Temperature • Total Sample Count • Get Top-10 (based on avg water temp)
  • 13. The Demo • RDBMs: Percona Server for MySQL 5.7 • Stores the Weather station metadata data (roughly 350 stations: ID, name, location, etc) • Document-Store: Percona Server for MongoDB 3.2 • Stores the Weather time-series sample data (roughly 80,000 samples: various weather readings from stations) • In-Memory K/V Store: Redis 2.8 • Store summarised Top-10 data for fast querying of min, avg, max temperature and total sample counts
  • 14. The Demo • Apache Spark 1.6.2 Cluster • 1 x Master • 2 x Worker/Slaves • 1 x Pyspark Job • 1 x Macbook Pro • 3 x Virtualbox VMs • Job submitted on Master
  • 15. The Demo (Play Demo Video Now)
  • 16. The Demo: The Pyspark Job <- SparkContext and SQLContext <- MySQL Table as SQLContext Temp Table
  • 17. The Demo: The Pyspark Job <- MongoDB Collection as SQLContext Temp Table <- New Redis Hash Schema as SQLContext Temp Table
  • 18. The Demo: The Pyspark Job <- From Redis <- Aggregation
  • 19. The Demo: The Pyspark Job <- Aggregation