SlideShare a Scribd company logo
1 of 27
Download to read offline
Using Apache Spark and MySQL
for Data Analysis
Alexander Rubin, Sveta Smirnova
Percona
February, 4, 2017
www.percona.com
Agenda
• Why Spark?
• Spark Examples
– Wikistats analysis with Spark
www.percona.com
Data /
SQL / Protocol
SQL/
App
What is Spark anyway?
Nodes
Parallel Compute only
Local
FS
?
www.percona.com
• In memory processing with caching
• Massively Parallel
• Direct access to data sources (i.e.MySQL)
>>> df = sqlContext.load(source="jdbc",
url="jdbc:mysql://localhost?user=root",
dbtable="ontime.ontime_sm”)
• Can store data in Hadoop HDFS / S3 /
local Filesystem
• Native Python and R integration
Why Spark?
www.percona.com
Spark vs MySQL
www.percona.com
Spark vs. MySQL for BigData
Indexes
Partitioning
“Sharding”
Full table scan
Partitioning
Map/Reduce
www.percona.com
Spark (vs. MySQL)
• No indexes
• All processing is full scan
• BUT: distributed and parallel
• No transactions
• High latency (usually)
MySQL:
1 query = 1 CPU core
www.percona.com
Indexes (BTree) for Big Data
challenge
• Creating an index for Petabytes of data?
• Updating an index for Petabytes of data?
• Reading a terabyte index?
• Random read of Petabyte?
Full scan in parallel is better for big data
www.percona.com
ETL / Pipeline
1. Extract data from
external source
2. Transform before
loading
3. Load data into
MySQL
1. Extract data from
external source
2. Load data or rsync to
all spark nodes
3. Transform
data/Analyze
data/Visualize data;
Parallelism
www.percona.com
Schema on Read
Schema on Write
• Load data infile will
verify the input (validate)
• … indirect data
conversion
• ... or fail if number of
cols is wrong
Schema on Read
• No “load data” per se,
nothing to validate here
• … Create external table or
read csv
• ... will validate on “read”/
select
www.percona.com
Example:
Loading wikistat into MySQL
1. Extract data
from external
source and
uncompress!
2. Load data into
MySQL and
Transform
Wikipedia page counts –
download, >10TB
load data local infile '$file'
into table wikistats.wikistats_full
CHARACTER SET latin1
FIELDS TERMINATED BY ' '
(project_name, title, num_requests,
content_size)
set request_date =
STR_TO_DATE('$datestr',
'%Y%m%d %H%i%S'),
title_md5=unhex(md5(title));
http://dumps.wikimedia.org/other/pagecounts-raw/
www.percona.com
Load timing per hour of wikistat
• InnoDB: 52.34 sec
• MyISAM: 11.08 sec (+ indexes)
• 1 hour of wikistats =1 minute
• 1 year will load in 6 days
– (8765.81 hours in 1 year)
• 6 year = > 1 month to load
Not even counting
the insert time
degradation…
www.percona.com
Loading wikistat as is into
Spark
• Just copy files to storage (AWS S3 / local /
etc)…
– And create SQL structure
• Or read csv, aggregate/filter in Spark and
– load the aggregated data into MySQL
www.percona.com
Loading wikistat as is into
Spark
• How fast to search?
– Depends upon the number of nodes
• 1000 nodes spark cluster
– 4.5 TB, 104 Billion records
– Exec time: 45 sec
– Scanning 4.5TB of data
• http://spark-summit.org/wp-content/uploads/2014/07/Building-
1000-node-Spark-Cluster-on-EMR.pdf
www.percona.com
Pipelines: MySQL vs Spark
www.percona.com
Spark and WikiStats: load pipeline
Row(project=p[0],
url=urllib.unquote(p[1]).lower(),
num_requests=int(p[2]),
content_size=int(p[3])))
www.percona.com
Save results to MySQL
group_res = sqlContext.sql(
"SELECT '"+ mydate + "' as mydate,
url,
count(*) as cnt,
sum(num_requests) as tot_visits
FROM wikistats
GROUP BY url")
# Save to MySQL
mysql_url="jdbc:mysql://localhost?user=wikistats&password=
wikistats”
group_res.write.jdbc(url=mysql_url,
table="wikistats.wikistats_by_day_spark",
mode="append")
www.percona.com
Multi-Threaded Inserts
www.percona.com
PySpark: CPU
Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st
Cpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
...
Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers
www.percona.com
Monitoring your jobs
www.percona.com
www.percona.com
mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM
wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like
'%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by
lower(url) order by max_visits desc limit 10;
+--------------------------------------------------------+------------+----------+
| lurl | max_visits | count(*) |
+--------------------------------------------------------+------------+----------+
| heath_ledger | 4247338 | 131 |
| cloverfield | 3846404 | 131 |
| barack_obama | 2238406 | 153 |
| 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 |
| the_dark_knight_(film) | 1417186 | 64 |
| martin_luther_king,_jr. | 1394934 | 136 |
| deaths_in_2008 | 1372510 | 67 |
| united_states | 1357253 | 167 |
| scientology | 1349654 | 108 |
| portal:current_events | 1261538 | 125 |
+--------------------------------------------------------+------------+----------+
10 rows in set (1 hour 22 min 10.02 sec)
Search the WikiStats in MySQL
10 most frequently queried wiki pages in January 2008
www.percona.com
Search the WikiStats in SparkSQL
spark-sql> CREATE TEMPORARY TABLE wikistats_parquet
USING org.apache.spark.sql.parquet
OPTIONS (
path "/ssd/wikistats_parquet_bydate"
);
Time taken: 3.466 seconds
spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM
wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%'
and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url)
order by max_visits desc limit 10;
heath_ledger 4247335 42
cloverfield 3846400 42
barack_obama 2238402 53
1925_in_baseball#negro_league_baseball_final_standings 1791341 11
the_dark_knight_(film) 1417183 36
martin_luther_king,_jr. 1394934 46
deaths_in_2008 1372510 38
united_states 1357251 55
scientology 1349650 44
portal:current_events 1261305 44
Time taken: 1239.014 seconds, Fetched 10 row(s)
10 most frequently queried wiki pages in January 2008
20 min
www.percona.com
Apache Drill
Treat any datasource
as a table (even it is
not)
Querying MongoDB
with SQL
www.percona.com
Magic?
!=
www.percona.com
Recap…
1. Search full dataset
• May be pre-filtered
• Not aggregated
2. No parallelism
3. Based on index?
4. InnoDB<> Columnar
5. Partitioning?
1. Dataset is already
– Filtered (only site=“en”)
– Aggregated (group by url)
2. Parallelism (+)
3. Not Based on index
4. Columnar (+)
5. Partitioning (+)
www.percona.com
Thank you!
https://www.linkedin.com/in/alexanderrubin
Alexander Rubin

More Related Content

What's hot

Practical Kerberos with Apache HBase
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBaseJosh Elser
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesInfluxData
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into CassandraBrian Hess
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 
Exploring KSQL Patterns
Exploring KSQL PatternsExploring KSQL Patterns
Exploring KSQL Patternsconfluent
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0Minwoo Kim
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusion
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusionAdvanced caching techniques with ehcache, big memory, terracotta, and coldfusion
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusionColdFusionConference
 

What's hot (20)

Practical Kerberos with Apache HBase
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBase
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Exploring KSQL Patterns
Exploring KSQL PatternsExploring KSQL Patterns
Exploring KSQL Patterns
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusion
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusionAdvanced caching techniques with ehcache, big memory, terracotta, and coldfusion
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusion
 

Viewers also liked

Эффективная отладка репликации MySQL
Эффективная отладка репликации MySQLЭффективная отладка репликации MySQL
Эффективная отладка репликации MySQLSveta Smirnova
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesFromDual GmbH
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability Mydbops
 
What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...Sveta Smirnova
 
Hbase源码初探
Hbase源码初探Hbase源码初探
Hbase源码初探zhaolinjnu
 
MySQL High Availability Deep Dive
MySQL High Availability Deep DiveMySQL High Availability Deep Dive
MySQL High Availability Deep Divehastexo
 
2010丹臣的思考
2010丹臣的思考2010丹臣的思考
2010丹臣的思考zhaolinjnu
 
MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)frogd
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability SolutionsLenz Grimmer
 
Why MySQL High Availability Matters
Why MySQL High Availability MattersWhy MySQL High Availability Matters
Why MySQL High Availability MattersMatt Lord
 
The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!Boris Hristov
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteKenny Gryp
 
Lessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting ReplicationLessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting ReplicationSveta Smirnova
 
Advanced mysql replication techniques
Advanced mysql replication techniquesAdvanced mysql replication techniques
Advanced mysql replication techniquesGiuseppe Maxia
 

Viewers also liked (20)

Эффективная отладка репликации MySQL
Эффективная отладка репликации MySQLЭффективная отладка репликации MySQL
Эффективная отладка репликации MySQL
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability
 
What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...
 
SQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and ProfitSQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and Profit
 
Hbase源码初探
Hbase源码初探Hbase源码初探
Hbase源码初探
 
MySQL High Availability Deep Dive
MySQL High Availability Deep DiveMySQL High Availability Deep Dive
MySQL High Availability Deep Dive
 
2010丹臣的思考
2010丹臣的思考2010丹臣的思考
2010丹臣的思考
 
Requirements the Last Bottleneck
Requirements the Last BottleneckRequirements the Last Bottleneck
Requirements the Last Bottleneck
 
MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)
 
Extensible Data Modeling
Extensible Data ModelingExtensible Data Modeling
Extensible Data Modeling
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
 
Why MySQL High Availability Matters
Why MySQL High Availability MattersWhy MySQL High Availability Matters
Why MySQL High Availability Matters
 
Mysql For Developers
Mysql For DevelopersMysql For Developers
Mysql For Developers
 
Redis介绍
Redis介绍Redis介绍
Redis介绍
 
The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
 
Lessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting ReplicationLessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting Replication
 
Explain
ExplainExplain
Explain
 
Advanced mysql replication techniques
Advanced mysql replication techniquesAdvanced mysql replication techniques
Advanced mysql replication techniques
 

Similar to Using Apache Spark and MySQL for Data Analysis

Fact-Based Monitoring
Fact-Based MonitoringFact-Based Monitoring
Fact-Based MonitoringDatadog
 
Fact based monitoring
Fact based monitoringFact based monitoring
Fact based monitoringDatadog
 
Cassandra
CassandraCassandra
Cassandraexsuns
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
ClickHouse 2018.  How to stop waiting for your queries to complete and start ...ClickHouse 2018.  How to stop waiting for your queries to complete and start ...
ClickHouse 2018. How to stop waiting for your queries to complete and start ...Altinity Ltd
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL AzureIke Ellis
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)Julien SIMON
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
backgroundcommunicationandwaitevents-180124221026.pdf
backgroundcommunicationandwaitevents-180124221026.pdfbackgroundcommunicationandwaitevents-180124221026.pdf
backgroundcommunicationandwaitevents-180124221026.pdfssuser785ce21
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxPythian
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 

Similar to Using Apache Spark and MySQL for Data Analysis (20)

Fact-Based Monitoring
Fact-Based MonitoringFact-Based Monitoring
Fact-Based Monitoring
 
Fact based monitoring
Fact based monitoringFact based monitoring
Fact based monitoring
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Cassandra
CassandraCassandra
Cassandra
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
ClickHouse 2018.  How to stop waiting for your queries to complete and start ...ClickHouse 2018.  How to stop waiting for your queries to complete and start ...
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
backgroundcommunicationandwaitevents-180124221026.pdf
backgroundcommunicationandwaitevents-180124221026.pdfbackgroundcommunicationandwaitevents-180124221026.pdf
backgroundcommunicationandwaitevents-180124221026.pdf
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 

More from Sveta Smirnova

MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?Sveta Smirnova
 
Database in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and MonitoringDatabase in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and MonitoringSveta Smirnova
 
MySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to HaveMySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to HaveSveta Smirnova
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersMySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersSveta Smirnova
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOpsSveta Smirnova
 
MySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации баговMySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации баговSveta Smirnova
 
MySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your BusinessMySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your BusinessSveta Smirnova
 
Introduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]sIntroduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]sSveta Smirnova
 
Производительность MySQL для DevOps
 Производительность MySQL для DevOps Производительность MySQL для DevOps
Производительность MySQL для DevOpsSveta Smirnova
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOpsSveta Smirnova
 
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB ClusterHow to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB ClusterSveta Smirnova
 
How to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tearsHow to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tearsSveta Smirnova
 
Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...Sveta Smirnova
 
How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?Sveta Smirnova
 
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения PerconaСовременному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения PerconaSveta Smirnova
 
How to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with GaleraHow to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with GaleraSveta Smirnova
 
How Safe is Asynchronous Master-Master Setup?
 How Safe is Asynchronous Master-Master Setup? How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?Sveta Smirnova
 
Introduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]sIntroduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]sSveta Smirnova
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Sveta Smirnova
 
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...Sveta Smirnova
 

More from Sveta Smirnova (20)

MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
 
Database in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and MonitoringDatabase in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and Monitoring
 
MySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to HaveMySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to Have
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersMySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for Developers
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
MySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации баговMySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации багов
 
MySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your BusinessMySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your Business
 
Introduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]sIntroduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]s
 
Производительность MySQL для DevOps
 Производительность MySQL для DevOps Производительность MySQL для DevOps
Производительность MySQL для DevOps
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB ClusterHow to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
 
How to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tearsHow to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tears
 
Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...
 
How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?
 
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения PerconaСовременному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
 
How to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with GaleraHow to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with Galera
 
How Safe is Asynchronous Master-Master Setup?
 How Safe is Asynchronous Master-Master Setup? How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?
 
Introduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]sIntroduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]s
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?
 
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
 

Recently uploaded

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 

Recently uploaded (20)

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 

Using Apache Spark and MySQL for Data Analysis

  • 1. Using Apache Spark and MySQL for Data Analysis Alexander Rubin, Sveta Smirnova Percona February, 4, 2017
  • 2. www.percona.com Agenda • Why Spark? • Spark Examples – Wikistats analysis with Spark
  • 3. www.percona.com Data / SQL / Protocol SQL/ App What is Spark anyway? Nodes Parallel Compute only Local FS ?
  • 4. www.percona.com • In memory processing with caching • Massively Parallel • Direct access to data sources (i.e.MySQL) >>> df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost?user=root", dbtable="ontime.ontime_sm”) • Can store data in Hadoop HDFS / S3 / local Filesystem • Native Python and R integration Why Spark?
  • 6. www.percona.com Spark vs. MySQL for BigData Indexes Partitioning “Sharding” Full table scan Partitioning Map/Reduce
  • 7. www.percona.com Spark (vs. MySQL) • No indexes • All processing is full scan • BUT: distributed and parallel • No transactions • High latency (usually) MySQL: 1 query = 1 CPU core
  • 8. www.percona.com Indexes (BTree) for Big Data challenge • Creating an index for Petabytes of data? • Updating an index for Petabytes of data? • Reading a terabyte index? • Random read of Petabyte? Full scan in parallel is better for big data
  • 9. www.percona.com ETL / Pipeline 1. Extract data from external source 2. Transform before loading 3. Load data into MySQL 1. Extract data from external source 2. Load data or rsync to all spark nodes 3. Transform data/Analyze data/Visualize data; Parallelism
  • 10. www.percona.com Schema on Read Schema on Write • Load data infile will verify the input (validate) • … indirect data conversion • ... or fail if number of cols is wrong Schema on Read • No “load data” per se, nothing to validate here • … Create external table or read csv • ... will validate on “read”/ select
  • 11. www.percona.com Example: Loading wikistat into MySQL 1. Extract data from external source and uncompress! 2. Load data into MySQL and Transform Wikipedia page counts – download, >10TB load data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title)); http://dumps.wikimedia.org/other/pagecounts-raw/
  • 12. www.percona.com Load timing per hour of wikistat • InnoDB: 52.34 sec • MyISAM: 11.08 sec (+ indexes) • 1 hour of wikistats =1 minute • 1 year will load in 6 days – (8765.81 hours in 1 year) • 6 year = > 1 month to load Not even counting the insert time degradation…
  • 13. www.percona.com Loading wikistat as is into Spark • Just copy files to storage (AWS S3 / local / etc)… – And create SQL structure • Or read csv, aggregate/filter in Spark and – load the aggregated data into MySQL
  • 14. www.percona.com Loading wikistat as is into Spark • How fast to search? – Depends upon the number of nodes • 1000 nodes spark cluster – 4.5 TB, 104 Billion records – Exec time: 45 sec – Scanning 4.5TB of data • http://spark-summit.org/wp-content/uploads/2014/07/Building- 1000-node-Spark-Cluster-on-EMR.pdf
  • 16. www.percona.com Spark and WikiStats: load pipeline Row(project=p[0], url=urllib.unquote(p[1]).lower(), num_requests=int(p[2]), content_size=int(p[3])))
  • 17. www.percona.com Save results to MySQL group_res = sqlContext.sql( "SELECT '"+ mydate + "' as mydate, url, count(*) as cnt, sum(num_requests) as tot_visits FROM wikistats GROUP BY url") # Save to MySQL mysql_url="jdbc:mysql://localhost?user=wikistats&password= wikistats” group_res.write.jdbc(url=mysql_url, table="wikistats.wikistats_by_day_spark", mode="append")
  • 19. www.percona.com PySpark: CPU Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st Cpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st ... Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers
  • 22. www.percona.com mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10; +--------------------------------------------------------+------------+----------+ | lurl | max_visits | count(*) | +--------------------------------------------------------+------------+----------+ | heath_ledger | 4247338 | 131 | | cloverfield | 3846404 | 131 | | barack_obama | 2238406 | 153 | | 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 | | the_dark_knight_(film) | 1417186 | 64 | | martin_luther_king,_jr. | 1394934 | 136 | | deaths_in_2008 | 1372510 | 67 | | united_states | 1357253 | 167 | | scientology | 1349654 | 108 | | portal:current_events | 1261538 | 125 | +--------------------------------------------------------+------------+----------+ 10 rows in set (1 hour 22 min 10.02 sec) Search the WikiStats in MySQL 10 most frequently queried wiki pages in January 2008
  • 23. www.percona.com Search the WikiStats in SparkSQL spark-sql> CREATE TEMPORARY TABLE wikistats_parquet USING org.apache.spark.sql.parquet OPTIONS ( path "/ssd/wikistats_parquet_bydate" ); Time taken: 3.466 seconds spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10; heath_ledger 4247335 42 cloverfield 3846400 42 barack_obama 2238402 53 1925_in_baseball#negro_league_baseball_final_standings 1791341 11 the_dark_knight_(film) 1417183 36 martin_luther_king,_jr. 1394934 46 deaths_in_2008 1372510 38 united_states 1357251 55 scientology 1349650 44 portal:current_events 1261305 44 Time taken: 1239.014 seconds, Fetched 10 row(s) 10 most frequently queried wiki pages in January 2008 20 min
  • 24. www.percona.com Apache Drill Treat any datasource as a table (even it is not) Querying MongoDB with SQL
  • 26. www.percona.com Recap… 1. Search full dataset • May be pre-filtered • Not aggregated 2. No parallelism 3. Based on index? 4. InnoDB<> Columnar 5. Partitioning? 1. Dataset is already – Filtered (only site=“en”) – Aggregated (group by url) 2. Parallelism (+) 3. Not Based on index 4. Columnar (+) 5. Partitioning (+)