4. www.percona.com
• In memory processing with caching
• Massively Parallel
• Direct access to data sources (i.e.MySQL)
>>> df = sqlContext.load(source="jdbc",
url="jdbc:mysql://localhost?user=root",
dbtable="ontime.ontime_sm”)
• Can store data in Hadoop HDFS / S3 /
local Filesystem
• Native Python and R integration
Why Spark?
7. www.percona.com
Spark (vs. MySQL)
• No indexes
• All processing is full scan
• BUT: distributed and parallel
• No transactions
• High latency (usually)
MySQL:
1 query = 1 CPU core
8. www.percona.com
Indexes (BTree) for Big Data
challenge
• Creating an index for Petabytes of data?
• Updating an index for Petabytes of data?
• Reading a terabyte index?
• Random read of Petabyte?
Full scan in parallel is better for big data
9. www.percona.com
ETL / Pipeline
1. Extract data from
external source
2. Transform before
loading
3. Load data into
MySQL
1. Extract data from
external source
2. Load data or rsync to
all spark nodes
3. Transform
data/Analyze
data/Visualize data;
Parallelism
10. www.percona.com
Schema on Read
Schema on Write
• Load data infile will
verify the input (validate)
• … indirect data
conversion
• ... or fail if number of
cols is wrong
Schema on Read
• No “load data” per se,
nothing to validate here
• … Create external table or
read csv
• ... will validate on “read”/
select
11. www.percona.com
Example:
Loading wikistat into MySQL
1. Extract data
from external
source and
uncompress!
2. Load data into
MySQL and
Transform
Wikipedia page counts –
download, >10TB
load data local infile '$file'
into table wikistats.wikistats_full
CHARACTER SET latin1
FIELDS TERMINATED BY ' '
(project_name, title, num_requests,
content_size)
set request_date =
STR_TO_DATE('$datestr',
'%Y%m%d %H%i%S'),
title_md5=unhex(md5(title));
http://dumps.wikimedia.org/other/pagecounts-raw/
12. www.percona.com
Load timing per hour of wikistat
• InnoDB: 52.34 sec
• MyISAM: 11.08 sec (+ indexes)
• 1 hour of wikistats =1 minute
• 1 year will load in 6 days
– (8765.81 hours in 1 year)
• 6 year = > 1 month to load
Not even counting
the insert time
degradation…
13. www.percona.com
Loading wikistat as is into
Spark
• Just copy files to storage (AWS S3 / local /
etc)…
– And create SQL structure
• Or read csv, aggregate/filter in Spark and
– load the aggregated data into MySQL
14. www.percona.com
Loading wikistat as is into
Spark
• How fast to search?
– Depends upon the number of nodes
• 1000 nodes spark cluster
– 4.5 TB, 104 Billion records
– Exec time: 45 sec
– Scanning 4.5TB of data
• http://spark-summit.org/wp-content/uploads/2014/07/Building-
1000-node-Spark-Cluster-on-EMR.pdf
17. www.percona.com
Save results to MySQL
group_res = sqlContext.sql(
"SELECT '"+ mydate + "' as mydate,
url,
count(*) as cnt,
sum(num_requests) as tot_visits
FROM wikistats
GROUP BY url")
# Save to MySQL
mysql_url="jdbc:mysql://localhost?user=wikistats&password=
wikistats”
group_res.write.jdbc(url=mysql_url,
table="wikistats.wikistats_by_day_spark",
mode="append")
22. www.percona.com
mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM
wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like
'%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by
lower(url) order by max_visits desc limit 10;
+--------------------------------------------------------+------------+----------+
| lurl | max_visits | count(*) |
+--------------------------------------------------------+------------+----------+
| heath_ledger | 4247338 | 131 |
| cloverfield | 3846404 | 131 |
| barack_obama | 2238406 | 153 |
| 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 |
| the_dark_knight_(film) | 1417186 | 64 |
| martin_luther_king,_jr. | 1394934 | 136 |
| deaths_in_2008 | 1372510 | 67 |
| united_states | 1357253 | 167 |
| scientology | 1349654 | 108 |
| portal:current_events | 1261538 | 125 |
+--------------------------------------------------------+------------+----------+
10 rows in set (1 hour 22 min 10.02 sec)
Search the WikiStats in MySQL
10 most frequently queried wiki pages in January 2008
23. www.percona.com
Search the WikiStats in SparkSQL
spark-sql> CREATE TEMPORARY TABLE wikistats_parquet
USING org.apache.spark.sql.parquet
OPTIONS (
path "/ssd/wikistats_parquet_bydate"
);
Time taken: 3.466 seconds
spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM
wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%'
and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url)
order by max_visits desc limit 10;
heath_ledger 4247335 42
cloverfield 3846400 42
barack_obama 2238402 53
1925_in_baseball#negro_league_baseball_final_standings 1791341 11
the_dark_knight_(film) 1417183 36
martin_luther_king,_jr. 1394934 46
deaths_in_2008 1372510 38
united_states 1357251 55
scientology 1349650 44
portal:current_events 1261305 44
Time taken: 1239.014 seconds, Fetched 10 row(s)
10 most frequently queried wiki pages in January 2008
20 min
26. www.percona.com
Recap…
1. Search full dataset
• May be pre-filtered
• Not aggregated
2. No parallelism
3. Based on index?
4. InnoDB<> Columnar
5. Partitioning?
1. Dataset is already
– Filtered (only site=“en”)
– Aggregated (group by url)
2. Parallelism (+)
3. Not Based on index
4. Columnar (+)
5. Partitioning (+)