Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

•

24 likes•4,158 views

This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here’s an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally. By Vida Ha at Spark Summit East 2016.

Engineering

Not Your Father’s Database:
How to Use Apache Spark Properly  
in Your Big Data Architecture
Spark Summit East 2016

About Me
2005 Mobile Web & Voice Search
3

About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics

About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering

This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQL

But the performance is very different…
Is this your Spark infrastructure?
7
SQL
HDFS

Just in Time Data Warehouse w/ Spark
HDFS

Just in Time Data Warehouse w/ Spark
and more…
HDFS

11
Know when to use other data stores  
besides file systems
Today’s Goal

Good: General Purpose Processing
Types of Data Sets to Store in File Systems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
12

Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
13
Good: General Purpose Processing

Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
14
Good: General Purpose Processing

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
15

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have  
to go through all your files to find your row.
16

Bad: Random Access
Solution: If you frequently randomly access your
data, use a database.
• For traditional SQL databases, create an index  
on your key column.
• Key-Value NOSQL stores retrieves the value  
of a key efficiently out of the box.
17

Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
18

Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
19

Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your data.
20

Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locate the row(s) in the files.
• Delete & Insert: Delete the old row and insert a new one.
• Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.
21

Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
22
Incremental
SQL Query
Database
Snapshot
+

Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
23
HDFS

Bad: External Reporting w/ load
Too many concurrent requests will overload Spark.
24
HDFS

Solution: Write out to a DB to handle load.
Bad: External Reporting w/ load
25
HDFS
DB

Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine
learning and data science.
Benefits:
• Built in distributed algorithms.
• In memory capabilities for iterative workloads.
• Data cleansing, featurization, training, testing, etc.
26

Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
27

What's hot

Distributed ML in Apache SparkDatabricks

Enabling exploratory data science with Spark and RDatabricks

New Developments in SparkDatabricks

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Announcing Databricks Cloud (Spark Summit 2014)Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Spark streaming state of the unionDatabricks

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

Introduction to Spark (Intern Event Presentation)Databricks

Spark Summit 2015 keynote: Making Big Data Simple with SparkDatabricks

What to Expect for Big Data and Apache Spark in 2017 Databricks

What's new in pandas and the SciPy stack for financial usersWes McKinney

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

New Directions for Spark in 2015 - Spark Summit EastDatabricks

H2O World - H2O Rains with Databricks CloudSri Ambati

A look ahead at spark 2.0 Databricks

What's hot (20)

Distributed ML in Apache Spark

Enabling exploratory data science with Spark and R

New Developments in Spark

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark Under the Hood - Meetup @ Data Science London

Announcing Databricks Cloud (Spark Summit 2014)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Optimizing Delta/Parquet Data Lakes for Apache Spark

Spark streaming state of the union

Spark Summit EU 2015: Reynold Xin Keynote

Introduction to Spark (Intern Event Presentation)

Spark Summit 2015 keynote: Making Big Data Simple with Spark

What to Expect for Big Data and Apache Spark in 2017

What's new in pandas and the SciPy stack for financial users

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

New Directions for Spark in 2015 - Spark Summit East

H2O World - H2O Rains with Databricks Cloud

A look ahead at spark 2.0

Viewers also liked

Operational Tips for Deploying SparkDatabricks

Apache Spark Use case for Education IndustryVinayak Agrawal

Cancer Outlier Profile Analysis using Apache SparkMahmoud Parsian

How Totango uses Apache SparkOren Raboy

Getting Apache Spark Customers to ProductionCloudera, Inc.

Kodu Game Lab e Project SparkFabrício Catae

SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam

Fighting Fraud with Apache SparkMiklos Christine

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France

Moa: Real Time Analytics for Data StreamsAlbert Bifet

Breakthrough OLAP performance with Cassandra and SparkEvan Chan

Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks

Consumer offset management in KafkaJoel Koshy

Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...MapR Technologies Japan

Lambda Architectures in PracticeC4Media

Spark For Faster Batch ProcessingEdureka!

Not Your Father's Database by Vida HaSpark Summit

Akka Streams and HTTPRoland Kuhn

Viewers also liked (18)

Operational Tips for Deploying Spark

Apache Spark Use case for Education Industry

Cancer Outlier Profile Analysis using Apache Spark

How Totango uses Apache Spark

Getting Apache Spark Customers to Production

Kodu Game Lab e Project Spark

SampleClean: Bringing Data Cleaning into the BDAS Stack

Fighting Fraud with Apache Spark

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Moa: Real Time Analytics for Data Streams

Breakthrough OLAP performance with Cassandra and Spark

Building a Turbo-fast Data Warehousing Platform with Databricks

Consumer offset management in Kafka

Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...

Lambda Architectures in Practice

Spark For Faster Batch Processing

Not Your Father's Database by Vida Ha

Akka Streams and HTTP

Similar to Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

Not Your Father's Database by DatabricksCaserta

Slide 2 collecting, storing and analyzing big dataTrieu Nguyen

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks

Agile data warehousingSneha Challa

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit

NoSQL databasesFilip Ilievski

Demystifying data engineeringThang Bui (Bob)

Etl with apache impala by athemasterAthemaster Co., Ltd.

One Large Data Lake, Hold the HypeJared Winick

One Large Data Lake, Hold the HypeKoverse, Inc.

Bender kuszmaul tutorial-xldb12Atner Yegorov

Data Structures and Algorithms for Big Databasesomnidba

Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA

Hands On: Introduction to the Hadoop EcosystemAdaryl "Bob" Wakefield, MBA

Data modeling trends for analyticsIke Ellis

No sq lv1_0Tuan Luong

Splice machine-bloor-webinar-data-lakesEdgar Alejandro Villegas

Database as a Service on the Oracle Database Appliance PlatformMaris Elsins

The modern analytics architectureJoseph D'Antoni

Connecting Hadoop and OracleTanel Poder

Similar to Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture (20)

Not Your Father's Database by Databricks

Slide 2 collecting, storing and analyzing big data

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...

Agile data warehousing

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)

NoSQL databases

Demystifying data engineering

Etl with apache impala by athemaster

One Large Data Lake, Hold the Hype

Bender kuszmaul tutorial-xldb12

Data Structures and Algorithms for Big Databases

Big Data Technologies and Why They Matter To R Users

Hands On: Introduction to the Hadoop Ecosystem

Data modeling trends for analytics

No sq lv1_0

Splice machine-bloor-webinar-data-lakes

Database as a Service on the Oracle Database Appliance Platform

The modern analytics architecture

Connecting Hadoop and Oracle

Recently uploaded

8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191

Indian Dairy Industry Present Status and.pptMadan Karki

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721

Transport layer issues and challenges - GuideGOPINATHS437943

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423

Past, Present and Future of Generative AIabhishek36461

Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr

Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12

Electronically Controlled suspensions system .pdfme23b1001

Oxy acetylene welding presentation note.eptoze12

Earthing details of Electrical Substationstephanwindworld

lifi-technology with integration of IOT.pptxsomshekarkn64

Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla

Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

Recently uploaded (20)

8251 universal synchronous asynchronous receiver transmitter

Indian Dairy Industry Present Status and.ppt

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync

Transport layer issues and challenges - Guide

IVE Industry Focused Event - Defence Sector 2024

Vishratwadi & Ghorpadi Bridge Tender documents

Past, Present and Future of Generative AI

Class 1 | NFPA 72 | Overview Fire Alarm System

Instrumentation, measurement and control of bio process parameters ( Temperat...

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf

Concrete Mix Design - IS 10262-2019 - .pptx

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE

Electronically Controlled suspensions system .pdf

Oxy acetylene welding presentation note.

Earthing details of Electrical Substation

lifi-technology with integration of IOT.pptx

Solving The Right Triangles PowerPoint 2.ppt

Arduino_CSE ece ppt for working and principal of arduino.ppt

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

1. Not Your Father’s Database: How to Use Apache Spark Properly   in Your Big Data Architecture Spark Summit East 2016

2. Not Your Father’s Database: How to Use Apache Spark Properly   in Your Big Data Architecture Spark Summit East 2016

3. About Me 2005 Mobile Web & Voice Search 3

4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics

5. About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering

6. This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL

7. But the performance is very different… Is this your Spark infrastructure? 7 SQL HDFS

8. Just in Time Data Warehouse w/ Spark HDFS

9. Just in Time Data Warehouse w/ Spark HDFS

10. Just in Time Data Warehouse w/ Spark and more… HDFS

11. 11 Know when to use other data stores   besides file systems Today’s Goal

12. Good: General Purpose Processing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12

13. Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing

14. Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing

15. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 15

16. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have   to go through all your files to find your row. 16

17. Bad: Random Access Solution: If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index   on your key column. • Key-Value NOSQL stores retrieves the value   of a key efficiently out of the box. 17

18. Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18

19. Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19

20. Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20

21. Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21

22. Use Case: Up-to-date, live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +

23. Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 23 HDFS

24. Bad: External Reporting w/ load Too many concurrent requests will overload Spark. 24 HDFS

25. Solution: Write out to a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB

26. Good: Machine Learning & Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26

27. Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27

28. Thank you

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

Similar to Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture