SlideShare a Scribd company logo
1 of 32
© Hitachi, Ltd. 2016. All rights reserved.
Hitachi, Ltd. OSS Solution Center
2016/10/27
Yusuke Furuyama
Yang Xie
Leveraging smart meter data
for electric utilities:
Comparison of Spark SQL with Hive
© Hitachi, Ltd. 2016. All rights reserved.
1. Leveraging smart meter data[Sample use case for electric utilities]
2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)
3. Additional evaluation with Spark 2.0
Contents
1
4. Summary
2© Hitachi, Ltd. 2016. All rights reserved.
1. Leveraging smart meter data
[Sample use case for electric utilities]
3© Hitachi, Ltd. 2016. All rights reserved.
1-1 Hitachi’s Social innovation business approach
4© Hitachi, Ltd. 2016. All rights reserved.
1-2 Situation of electric utilities and their needs
• Electric utilities have to adapt to competitive free market
• Request for price cut of power transmission fee from government
 Liberalization of the retail Electric Power Market
 Needs to cut the cost for transmission and distribution equipment
• Transmission and distribution equipment was replaced periodically
• Decide the timing of replace by the condition of equipment
Maintenance team needs: Obtain the load status of each equipment
Electricity rate
Power plant
Transmission
Lines
Substation Distribution
Lines
Home
Transmission and
Distribution Companies
Power
transmission feeElectric Power
Generation
Companies
Electricity retailers
5© Hitachi, Ltd. 2016. All rights reserved.
1-3 Situation of electric utilities and their needs (future)
• Decreasing nuclear plant as a stable power supplier
• Increasing renewable energy supply
 Unstable power supply
 Needs for high level Demand Response
• Rates by time zone (current demand response)
• Many and small renewable energy suppliers
• Near real-time demand response for each distribution system
Planning team needs: Obtain near real-time load status of each equipment
Electricity rate
Power plant
Transmission
Lines
Substation Distribution
Lines
Home
Transmission and
Distribution Companies
Power
transmission feeElectric Power
Generation
Companies
Electricity retailers
6© Hitachi, Ltd. 2016. All rights reserved.
1-4 Leveraging big data for electric utilities
Power Grid
Transformer
(500/Distribution Line)
Switch
(20/Distribution Line)
Meter Data
Management System Data from smart meters(every 30min.)
Analyze the data from smart meters to grasp the load status of equipment
 Meet the needs of electric utilities
Substation
(1,000)
Distribution Lines
(5/Substation)
Smart Meter
(4/Trans)
・・・0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
Data Analysis
System
Total
10,000,000
Concern about performance
needed for near real-time processing
7© Hitachi, Ltd. 2016. All rights reserved.
Data processing platform (Hadoop, Spark)
Data Analysis System
Target of this session
Planner
(Equipment/Demand
response)
Meter Data
Management System
1-5 System Component
Raw Data Processed Data
Data visualization tool (Pentaho)
8© Hitachi, Ltd. 2016. All rights reserved.
2. Performance evaluation of MapReduce and Spark 1.6
(using Hive and Spark SQL)
9© Hitachi, Ltd. 2016. All rights reserved.
2-1 Use Case for electric utilities
 Points of analysis
Needs Point of analysis
Find the equipment that needs to be replaced Find the equipment that has heavy workload
Estimate the timing for replacement Check the trend of load status
Select the proper capacity of new equipment Extract the peak of the load
 Needs of electric utilities (recap)
Needs
・Analyze the data from smart meters to grasp the load status of equipment
・Near real-time
10© Hitachi, Ltd. 2016. All rights reserved.
2-2 Contents of performance evaluation
 Point of evaluation for near real-time processing
Check if MapReduce and Spark can process the data in 30min.
Point of analysis Aggregate per
Find the equipment has heavy
workload
Equipment
Check the trend of load status Term
Extract the peak of the load Time zone
 Items for evaluation
・Distribution system ・Substation ・Switch
・Transformer ・Meter
・1day ・1month(30days) ・1year(365days)
・Specific 30min of each day ・24h
• Concern about performance for processing 10,000,000 meters
• Data comes from smart meter every 30min (48/Day, spec of smart meter)
11© Hitachi, Ltd. 2016. All rights reserved.
2-3 Target of performance evaluation
 Target of evaluation
Data Processing Platform
(MapReduce, Spark 1.6)
Data Analysis System
End
Data visualization tool(Pentaho)
Meter Data
Management System
Target
Aggregation batch
(Hive / Spark SQL)
Start
Time from start of aggregation batch for meter data to end of the batch
12© Hitachi, Ltd. 2016. All rights reserved.
2-4 Target of performance evaluation
 System Configuration  Spec
Master Node
CPU Core 2
Memory 8 GB
Capacity of disk 80 GB
# of disk 1
Per slave node total
CPU Core 16 64
Memory 128 GB 512 GB
Capacity of disk 900 GB -
# of disk 6 24
Total capacity
of disks
5.4 TB
(5,400 GB)
21.6 TB
(21,600 GB)
10Gbps LAN
10Gbps SW
1Gbps LAN
disk disk ・・・ disk
4 Slave Nodes
(Physical Machines)
1 Master Node
(Virtual Machine)
13© Hitachi, Ltd. 2016. All rights reserved.
2-5 Dataset
 Smart meter data
Term # of records Size (CSV) Size (ORC File)
365days (1year) 3,650 million 2.475 TB 1.325 TB
30days (1month) 300 million 0.205 TB 0.158 TB
1day 10 million 0.007 TB 0.005 TB
0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. infometer1
0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. infometer2
・
・
・
0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. info
meter
10,000,000
Smart meter data /day
Data size
48 columns (every 30min)
10,000,000 records/day
14© Hitachi, Ltd. 2016. All rights reserved.
2-6 Contents of performance evaluation (recap)
 Items for evaluation
 Target of evaluation
Point of analysis Aggregate per
Find the equipment has heavy
workload
Equipment
Check the trend of load status Term
Extract the peak of the load Time zone
・Distribution system ・Substation ・Switch
・Transformer ・Meter
・1day ・1month(30days) ・1year(365days)
・Specific 30min of each day ・24h
 Point of evaluation
Time from start of aggregation batch for meter data to end of the batch
Check if MapReduce and Spark can process the data in 30min.
15© Hitachi, Ltd. 2016. All rights reserved.
Time from start of aggregation batch for meter data to end of the batch
2-6 Contents of performance evaluation (recap)
 Point of evaluation
Check if MapReduce and Spark can process the data in 30min.
 Items for evaluation
Point of analysis Aggregate per
Find the equipment has heavy
workload
Equipment
Check the trend of load status Term
Extract the peak of the load Time zone
・Distribution system ・Substation ・Switch
・Transformer ・Meter
・1day ・1month(30days) ・1year(365days)
・Specific 30min of each day ・24h
+ File type
・Text (CSV)
・ORCFile (Column-based)
 Target of evaluation
16© Hitachi, Ltd. 2016. All rights reserved.
2-7 Comparison of txt with ORCFile (MapReduce)
• Couldn’t finish processing in 30min
• Performance improvement by ORCFile
62 sec
224 sec
3286 sec
99 sec
338 sec
6962 sec
0 sec 2000 sec 4000 sec 6000 sec
1day
30days
365days
Hive + TXT
Hive + ORCFile
31 sec
69 sec
492 sec
32 sec
138 sec
3015 sec
0 sec 2000 sec 4000 sec
1day
30days
365days
Hive + TXT
Hive + ORCFile
30min 30min
Result (24h) Result (0.5h)
17© Hitachi, Ltd. 2016. All rights reserved.
2-8 Comparison of txt with ORCFile (Spark 1.6)
28 sec
50 sec
993 sec
47 sec
154 sec
1408 sec
0 sec 400 sec 800 sec 1200 sec 1600 sec
1day
30days
365days
Spark SQL + TXT
Spark SQL + ORCFile
• Could finish processing in 30min (1800s)
• Performance improvement by ORCFile
26 sec
32 sec
169 sec
30 sec
111 sec
1263 sec
0 sec 400 sec 800 sec 1200 sec 1600 sec
1day
30days
365days
Spark SQL + TXT
Spark SQL + ORCFile
Result (24h) Result (0.5h)
18© Hitachi, Ltd. 2016. All rights reserved.
2-9 Review of the results
Why the processing was fast with ORCFile
 Processing 0.5h data
・
・
・
・・・
Smart meter data
・
・
・
0:00 23:30
0:00
0:30
0:00
0:00
0:00
0:00
0:00
0:00
0:00
Use 1
column
only
 Processing 24h data
・
・
・
・
・
・
+
・・・ ・
・
・
・
・
・
0:30 23:30 SUM/day
+
Use all 48
columns
+
• Processing big data with ORCFile was more effective
than processing small data
Results
• Processing 0.5h data with ORCFile was more
effective than processing 24h data
Compression
Reading specific column
Features of ORC File
+
+
+
+
+
+
+
+
+
SUM/day
SUM/day
SUM/day
SUM/day
Smart meter data
0:00
0:30
0:30
0:30
0:30
23:30
23:30
23:30
23:30
0:00
0:00
0:00
0:00
0:00
0:30
0:30
0:30
0:30
23:30
23:30
23:30
23:30
19© Hitachi, Ltd. 2016. All rights reserved.
32 sec
104 sec
69 sec
556 sec
0 sec 500 sec 1000 sec
Distribution
system
Per
equipment
Hive
Spark SQL
2-10 Comparison of MapReduce with Spark 1.6
50 sec
145 sec
224 sec
718 sec
0 sec 500 sec 1000 sec
Distribution
system
Per
equipment
Hive
Spark SQL
993 sec
1359 sec
3286 sec
4131 sec
0 sec 1000 sec 2000 sec 3000 sec 4000 sec
Distribution
system
Per
equipment
Hive
Spark SQL
・Could finish processing the data
from 10,000,000 meter in 30min
using Spark 1.6!
・Spark’s good performance
with “per equipment” processing
Result (30days / 24h) Result (30days / 0.5h)
Result (365days / 24h)
30min
80%
down
78%
down
81%
down
54%
down
20© Hitachi, Ltd. 2016. All rights reserved.
Trans
2,500,000
Switch
100,000
2-11 Review of the results
 Why the processing per equipment was more effective than the processing
for entire distribution system when using spark?
 Per equipment ・
・
・
・・・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・
・・・
・・・
0:00
0:00
0:00
0:00
0:00
0:30
0:30
0:30
0:30
0:30
23:30
23:30
23:30
23:30
23:30
Trans1
Trans2
Trans3
Switch1
Switch2
Line1 Substa.1
10005000
・
・
・
・
・
・
・・・
・・・
・・・
0:00
0:00
0:00 0:30
23:30
23:30
23:30
・・・
0:00
0:00
0:30
0:30
23:30
23:30
 For entire
distribution
system
JOIN
・Less disk I/O than MapReduce
・Smaller data (including
re-distributing data) than total
memory of cluster
Smart meter data Transformer Switch Distribution
Line
Substation
Distribution
System
Distribution
System
JOIN
JOIN JOIN
JOIN
Smart meter data
0:30
0:30
21© Hitachi, Ltd. 2016. All rights reserved.
3. Additional evaluation with Spark 2.0
22© Hitachi, Ltd. 2016. All rights reserved.
3-1 Evaluation environment
 System Configuration  Spec
Master Node
CPU Core 16
Memory 12 GB
Capacity of disk 900 GB
# of disk 8
Per slave node total
CPU Core 20 120
Memory 384 GB 2304 GB
Capacity of disk 1200 GB -
# of disk 10 60
Total capacity
of disks
12.0 TB
(12,000 GB)
72.0 TB
(72,000 GB)
10Gbps LAN
10Gbps SW
6 Slave Nodes
(Physical)
1 Master Node
(Physical)
disk disk ・・・ disk
23© Hitachi, Ltd. 2016. All rights reserved.
3-2 Comparison of Spark 2.0 with Spark 1.6
316 sec
322 sec
0 sec 100 sec 200 sec 300 sec 400 sec
Per
Equipment
Spark 1.6
Spark 2.0
87 sec
102 sec
0 sec 30 sec 60 sec 90 sec 120 sec
Per
Equipment
Spark 1.6
Spark 2.0
71 sec
89 sec
0 sec 20 sec 40 sec 60 sec 80 sec 100 sec
Per
Equipment
Spark 1.6
Spark 2.0
・Performance improvement roughly
20% (including disk I/O)
・More effective with small data
Result (365days / 24h) Result (30days / 24h)
Result (1day / 24h)
Time from start of aggregation batch for meter data to end of the batch
24© Hitachi, Ltd. 2016. All rights reserved.
Demo
25© Hitachi, Ltd. 2016. All rights reserved.
Data visualization tool (Pentaho)
Aggregated Data
(29 days)
Demo: Data aggregation and visualization
② Execute aggregation batch
(Spark SQL)
① Show 29 days data
Aggregated Data
(30 days)
③ Show 30 days data
Raw Data (1day)
④ Outlier detected!!
26© Hitachi, Ltd. 2016. All rights reserved.
4. Summary
27© Hitachi, Ltd. 2016. All rights reserved.
4 Summary
 Leveraging data from 10,000,000 smart meters for electric utilities
- Built data analysis system
- Concern about performance
 Evaluate the performance of batch processing
- Spark could process the data from
10,000,000 meters in 30min (4node)
 Evaluate the performance of Spark 2.0
- Performance improvement
roughly 20% (compared to 1.6)
Data Processing Platform
(MapReduce, Spark)
Data Analysis System
Data visualization tool(Pentaho)
Aggregation batch
(Hive / Spark SQL)
Meter Data
Management System
© Hitachi, Ltd. 2016. All rights reserved.
Hitachi, Ltd. OSS Solution Center
Comparison of Spark SQL with Hive
Leveraging smart meter data for electric utilities:
2016/10/27
Yusuke Furuyama
Yang Xie
END
28
29© Hitachi, Ltd. 2016. All rights reserved.
他社商品名、商標等の引用に関する表示
 Hadoop、HiveおよびSparkは、Apache Software Foundationの米国およびその他の国における登録商標または商
標です。
 その他記載の会社名、製品名などは、それぞれの会社の商標もしくは登録商標です。
31© Hitachi, Ltd. 2016. All rights reserved.
Appendix. Difficulty with large data to be shuffled
 Attempted to aggregate raw (48 columns) meter data per equipment
- Extremely slow (Spark 2.0) or Job failed (Spark 1.6)
- Processing: Iteration of JOIN and GROUP BY+SUM
- Huge data to be shuffled (spilled out from page cache)
76 sec 223 sec
13281 sec
85 sec
236 sec
3650 sec
0 sec
2000 sec
4000 sec
6000 sec
8000 sec
10000 sec
12000 sec
14000 sec
1day 30days 365days
Processingtime
Target data
Before
adding
After
adding
 Add HDFS disks as disks for shuffle
- Performance Improved (365days)
- Performance degraded (1day/30days)
・Data for Spark (including temporary data)
should be smaller than memory.
・Had better to process as a trial
to estimate
Adding disks for shuffle (Spark 2.0.0)
Heavy load on a local disk (OS disk) by shuffle

More Related Content

What's hot

Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?DataWorks Summit
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemDataWorks Summit/Hadoop Summit
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...DataWorks Summit/Hadoop Summit
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightDataWorks Summit
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin DataWorks Summit/Hadoop Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...DataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 

What's hot (20)

How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 

Similar to Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

Industrial Automation System introduction
Industrial Automation System  introductionIndustrial Automation System  introduction
Industrial Automation System introductionMd. Mashiur Rahman
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache ...
Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache ...Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache ...
Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache ...Hitachi, Ltd. OSS Solution Center.
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
 
Oracle Database Performance Tuning Basics
Oracle Database Performance Tuning BasicsOracle Database Performance Tuning Basics
Oracle Database Performance Tuning Basicsnitin anjankar
 
Use Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersDatabricks
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
TIBCO presentation at the Chief Analytics Officer Forum East Coast 2016 (#CAO...
TIBCO presentation at the Chief Analytics Officer Forum East Coast 2016 (#CAO...TIBCO presentation at the Chief Analytics Officer Forum East Coast 2016 (#CAO...
TIBCO presentation at the Chief Analytics Officer Forum East Coast 2016 (#CAO...Chief Analytics Officer Forum
 

Similar to Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive (20)

Industrial Automation System introduction
Industrial Automation System  introductionIndustrial Automation System  introduction
Industrial Automation System introduction
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
 
Application of postgre sql to large social infrastructure
Application of postgre sql to large social infrastructureApplication of postgre sql to large social infrastructure
Application of postgre sql to large social infrastructure
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
IoT Analytics
IoT AnalyticsIoT Analytics
IoT Analytics
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache ...
Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache ...Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache ...
Lessons Learned from Leveraging Real-Time Power Consumption Data with Apache ...
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Oracle Database Performance Tuning Basics
Oracle Database Performance Tuning BasicsOracle Database Performance Tuning Basics
Oracle Database Performance Tuning Basics
 
Use Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data Clusters
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
TIBCO presentation at the Chief Analytics Officer Forum East Coast 2016 (#CAO...
TIBCO presentation at the Chief Analytics Officer Forum East Coast 2016 (#CAO...TIBCO presentation at the Chief Analytics Officer Forum East Coast 2016 (#CAO...
TIBCO presentation at the Chief Analytics Officer Forum East Coast 2016 (#CAO...
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

  • 1. © Hitachi, Ltd. 2016. All rights reserved. Hitachi, Ltd. OSS Solution Center 2016/10/27 Yusuke Furuyama Yang Xie Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive
  • 2. © Hitachi, Ltd. 2016. All rights reserved. 1. Leveraging smart meter data[Sample use case for electric utilities] 2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL) 3. Additional evaluation with Spark 2.0 Contents 1 4. Summary
  • 3. 2© Hitachi, Ltd. 2016. All rights reserved. 1. Leveraging smart meter data [Sample use case for electric utilities]
  • 4. 3© Hitachi, Ltd. 2016. All rights reserved. 1-1 Hitachi’s Social innovation business approach
  • 5. 4© Hitachi, Ltd. 2016. All rights reserved. 1-2 Situation of electric utilities and their needs • Electric utilities have to adapt to competitive free market • Request for price cut of power transmission fee from government  Liberalization of the retail Electric Power Market  Needs to cut the cost for transmission and distribution equipment • Transmission and distribution equipment was replaced periodically • Decide the timing of replace by the condition of equipment Maintenance team needs: Obtain the load status of each equipment Electricity rate Power plant Transmission Lines Substation Distribution Lines Home Transmission and Distribution Companies Power transmission feeElectric Power Generation Companies Electricity retailers
  • 6. 5© Hitachi, Ltd. 2016. All rights reserved. 1-3 Situation of electric utilities and their needs (future) • Decreasing nuclear plant as a stable power supplier • Increasing renewable energy supply  Unstable power supply  Needs for high level Demand Response • Rates by time zone (current demand response) • Many and small renewable energy suppliers • Near real-time demand response for each distribution system Planning team needs: Obtain near real-time load status of each equipment Electricity rate Power plant Transmission Lines Substation Distribution Lines Home Transmission and Distribution Companies Power transmission feeElectric Power Generation Companies Electricity retailers
  • 7. 6© Hitachi, Ltd. 2016. All rights reserved. 1-4 Leveraging big data for electric utilities Power Grid Transformer (500/Distribution Line) Switch (20/Distribution Line) Meter Data Management System Data from smart meters(every 30min.) Analyze the data from smart meters to grasp the load status of equipment  Meet the needs of electric utilities Substation (1,000) Distribution Lines (5/Substation) Smart Meter (4/Trans) ・・・0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 Data Analysis System Total 10,000,000 Concern about performance needed for near real-time processing
  • 8. 7© Hitachi, Ltd. 2016. All rights reserved. Data processing platform (Hadoop, Spark) Data Analysis System Target of this session Planner (Equipment/Demand response) Meter Data Management System 1-5 System Component Raw Data Processed Data Data visualization tool (Pentaho)
  • 9. 8© Hitachi, Ltd. 2016. All rights reserved. 2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)
  • 10. 9© Hitachi, Ltd. 2016. All rights reserved. 2-1 Use Case for electric utilities  Points of analysis Needs Point of analysis Find the equipment that needs to be replaced Find the equipment that has heavy workload Estimate the timing for replacement Check the trend of load status Select the proper capacity of new equipment Extract the peak of the load  Needs of electric utilities (recap) Needs ・Analyze the data from smart meters to grasp the load status of equipment ・Near real-time
  • 11. 10© Hitachi, Ltd. 2016. All rights reserved. 2-2 Contents of performance evaluation  Point of evaluation for near real-time processing Check if MapReduce and Spark can process the data in 30min. Point of analysis Aggregate per Find the equipment has heavy workload Equipment Check the trend of load status Term Extract the peak of the load Time zone  Items for evaluation ・Distribution system ・Substation ・Switch ・Transformer ・Meter ・1day ・1month(30days) ・1year(365days) ・Specific 30min of each day ・24h • Concern about performance for processing 10,000,000 meters • Data comes from smart meter every 30min (48/Day, spec of smart meter)
  • 12. 11© Hitachi, Ltd. 2016. All rights reserved. 2-3 Target of performance evaluation  Target of evaluation Data Processing Platform (MapReduce, Spark 1.6) Data Analysis System End Data visualization tool(Pentaho) Meter Data Management System Target Aggregation batch (Hive / Spark SQL) Start Time from start of aggregation batch for meter data to end of the batch
  • 13. 12© Hitachi, Ltd. 2016. All rights reserved. 2-4 Target of performance evaluation  System Configuration  Spec Master Node CPU Core 2 Memory 8 GB Capacity of disk 80 GB # of disk 1 Per slave node total CPU Core 16 64 Memory 128 GB 512 GB Capacity of disk 900 GB - # of disk 6 24 Total capacity of disks 5.4 TB (5,400 GB) 21.6 TB (21,600 GB) 10Gbps LAN 10Gbps SW 1Gbps LAN disk disk ・・・ disk 4 Slave Nodes (Physical Machines) 1 Master Node (Virtual Machine)
  • 14. 13© Hitachi, Ltd. 2016. All rights reserved. 2-5 Dataset  Smart meter data Term # of records Size (CSV) Size (ORC File) 365days (1year) 3,650 million 2.475 TB 1.325 TB 30days (1month) 300 million 0.205 TB 0.158 TB 1day 10 million 0.007 TB 0.005 TB 0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. infometer1 0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. infometer2 ・ ・ ・ 0:00-0:30 power usage 0:30-1:00 power usage ・・・ 23:30-0:00 power usage Meter mgmt. info meter 10,000,000 Smart meter data /day Data size 48 columns (every 30min) 10,000,000 records/day
  • 15. 14© Hitachi, Ltd. 2016. All rights reserved. 2-6 Contents of performance evaluation (recap)  Items for evaluation  Target of evaluation Point of analysis Aggregate per Find the equipment has heavy workload Equipment Check the trend of load status Term Extract the peak of the load Time zone ・Distribution system ・Substation ・Switch ・Transformer ・Meter ・1day ・1month(30days) ・1year(365days) ・Specific 30min of each day ・24h  Point of evaluation Time from start of aggregation batch for meter data to end of the batch Check if MapReduce and Spark can process the data in 30min.
  • 16. 15© Hitachi, Ltd. 2016. All rights reserved. Time from start of aggregation batch for meter data to end of the batch 2-6 Contents of performance evaluation (recap)  Point of evaluation Check if MapReduce and Spark can process the data in 30min.  Items for evaluation Point of analysis Aggregate per Find the equipment has heavy workload Equipment Check the trend of load status Term Extract the peak of the load Time zone ・Distribution system ・Substation ・Switch ・Transformer ・Meter ・1day ・1month(30days) ・1year(365days) ・Specific 30min of each day ・24h + File type ・Text (CSV) ・ORCFile (Column-based)  Target of evaluation
  • 17. 16© Hitachi, Ltd. 2016. All rights reserved. 2-7 Comparison of txt with ORCFile (MapReduce) • Couldn’t finish processing in 30min • Performance improvement by ORCFile 62 sec 224 sec 3286 sec 99 sec 338 sec 6962 sec 0 sec 2000 sec 4000 sec 6000 sec 1day 30days 365days Hive + TXT Hive + ORCFile 31 sec 69 sec 492 sec 32 sec 138 sec 3015 sec 0 sec 2000 sec 4000 sec 1day 30days 365days Hive + TXT Hive + ORCFile 30min 30min Result (24h) Result (0.5h)
  • 18. 17© Hitachi, Ltd. 2016. All rights reserved. 2-8 Comparison of txt with ORCFile (Spark 1.6) 28 sec 50 sec 993 sec 47 sec 154 sec 1408 sec 0 sec 400 sec 800 sec 1200 sec 1600 sec 1day 30days 365days Spark SQL + TXT Spark SQL + ORCFile • Could finish processing in 30min (1800s) • Performance improvement by ORCFile 26 sec 32 sec 169 sec 30 sec 111 sec 1263 sec 0 sec 400 sec 800 sec 1200 sec 1600 sec 1day 30days 365days Spark SQL + TXT Spark SQL + ORCFile Result (24h) Result (0.5h)
  • 19. 18© Hitachi, Ltd. 2016. All rights reserved. 2-9 Review of the results Why the processing was fast with ORCFile  Processing 0.5h data ・ ・ ・ ・・・ Smart meter data ・ ・ ・ 0:00 23:30 0:00 0:30 0:00 0:00 0:00 0:00 0:00 0:00 0:00 Use 1 column only  Processing 24h data ・ ・ ・ ・ ・ ・ + ・・・ ・ ・ ・ ・ ・ ・ 0:30 23:30 SUM/day + Use all 48 columns + • Processing big data with ORCFile was more effective than processing small data Results • Processing 0.5h data with ORCFile was more effective than processing 24h data Compression Reading specific column Features of ORC File + + + + + + + + + SUM/day SUM/day SUM/day SUM/day Smart meter data 0:00 0:30 0:30 0:30 0:30 23:30 23:30 23:30 23:30 0:00 0:00 0:00 0:00 0:00 0:30 0:30 0:30 0:30 23:30 23:30 23:30 23:30
  • 20. 19© Hitachi, Ltd. 2016. All rights reserved. 32 sec 104 sec 69 sec 556 sec 0 sec 500 sec 1000 sec Distribution system Per equipment Hive Spark SQL 2-10 Comparison of MapReduce with Spark 1.6 50 sec 145 sec 224 sec 718 sec 0 sec 500 sec 1000 sec Distribution system Per equipment Hive Spark SQL 993 sec 1359 sec 3286 sec 4131 sec 0 sec 1000 sec 2000 sec 3000 sec 4000 sec Distribution system Per equipment Hive Spark SQL ・Could finish processing the data from 10,000,000 meter in 30min using Spark 1.6! ・Spark’s good performance with “per equipment” processing Result (30days / 24h) Result (30days / 0.5h) Result (365days / 24h) 30min 80% down 78% down 81% down 54% down
  • 21. 20© Hitachi, Ltd. 2016. All rights reserved. Trans 2,500,000 Switch 100,000 2-11 Review of the results  Why the processing per equipment was more effective than the processing for entire distribution system when using spark?  Per equipment ・ ・ ・ ・・・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・・・ ・・・ 0:00 0:00 0:00 0:00 0:00 0:30 0:30 0:30 0:30 0:30 23:30 23:30 23:30 23:30 23:30 Trans1 Trans2 Trans3 Switch1 Switch2 Line1 Substa.1 10005000 ・ ・ ・ ・ ・ ・ ・・・ ・・・ ・・・ 0:00 0:00 0:00 0:30 23:30 23:30 23:30 ・・・ 0:00 0:00 0:30 0:30 23:30 23:30  For entire distribution system JOIN ・Less disk I/O than MapReduce ・Smaller data (including re-distributing data) than total memory of cluster Smart meter data Transformer Switch Distribution Line Substation Distribution System Distribution System JOIN JOIN JOIN JOIN Smart meter data 0:30 0:30
  • 22. 21© Hitachi, Ltd. 2016. All rights reserved. 3. Additional evaluation with Spark 2.0
  • 23. 22© Hitachi, Ltd. 2016. All rights reserved. 3-1 Evaluation environment  System Configuration  Spec Master Node CPU Core 16 Memory 12 GB Capacity of disk 900 GB # of disk 8 Per slave node total CPU Core 20 120 Memory 384 GB 2304 GB Capacity of disk 1200 GB - # of disk 10 60 Total capacity of disks 12.0 TB (12,000 GB) 72.0 TB (72,000 GB) 10Gbps LAN 10Gbps SW 6 Slave Nodes (Physical) 1 Master Node (Physical) disk disk ・・・ disk
  • 24. 23© Hitachi, Ltd. 2016. All rights reserved. 3-2 Comparison of Spark 2.0 with Spark 1.6 316 sec 322 sec 0 sec 100 sec 200 sec 300 sec 400 sec Per Equipment Spark 1.6 Spark 2.0 87 sec 102 sec 0 sec 30 sec 60 sec 90 sec 120 sec Per Equipment Spark 1.6 Spark 2.0 71 sec 89 sec 0 sec 20 sec 40 sec 60 sec 80 sec 100 sec Per Equipment Spark 1.6 Spark 2.0 ・Performance improvement roughly 20% (including disk I/O) ・More effective with small data Result (365days / 24h) Result (30days / 24h) Result (1day / 24h) Time from start of aggregation batch for meter data to end of the batch
  • 25. 24© Hitachi, Ltd. 2016. All rights reserved. Demo
  • 26. 25© Hitachi, Ltd. 2016. All rights reserved. Data visualization tool (Pentaho) Aggregated Data (29 days) Demo: Data aggregation and visualization ② Execute aggregation batch (Spark SQL) ① Show 29 days data Aggregated Data (30 days) ③ Show 30 days data Raw Data (1day) ④ Outlier detected!!
  • 27. 26© Hitachi, Ltd. 2016. All rights reserved. 4. Summary
  • 28. 27© Hitachi, Ltd. 2016. All rights reserved. 4 Summary  Leveraging data from 10,000,000 smart meters for electric utilities - Built data analysis system - Concern about performance  Evaluate the performance of batch processing - Spark could process the data from 10,000,000 meters in 30min (4node)  Evaluate the performance of Spark 2.0 - Performance improvement roughly 20% (compared to 1.6) Data Processing Platform (MapReduce, Spark) Data Analysis System Data visualization tool(Pentaho) Aggregation batch (Hive / Spark SQL) Meter Data Management System
  • 29. © Hitachi, Ltd. 2016. All rights reserved. Hitachi, Ltd. OSS Solution Center Comparison of Spark SQL with Hive Leveraging smart meter data for electric utilities: 2016/10/27 Yusuke Furuyama Yang Xie END 28
  • 30. 29© Hitachi, Ltd. 2016. All rights reserved. 他社商品名、商標等の引用に関する表示  Hadoop、HiveおよびSparkは、Apache Software Foundationの米国およびその他の国における登録商標または商 標です。  その他記載の会社名、製品名などは、それぞれの会社の商標もしくは登録商標です。
  • 31.
  • 32. 31© Hitachi, Ltd. 2016. All rights reserved. Appendix. Difficulty with large data to be shuffled  Attempted to aggregate raw (48 columns) meter data per equipment - Extremely slow (Spark 2.0) or Job failed (Spark 1.6) - Processing: Iteration of JOIN and GROUP BY+SUM - Huge data to be shuffled (spilled out from page cache) 76 sec 223 sec 13281 sec 85 sec 236 sec 3650 sec 0 sec 2000 sec 4000 sec 6000 sec 8000 sec 10000 sec 12000 sec 14000 sec 1day 30days 365days Processingtime Target data Before adding After adding  Add HDFS disks as disks for shuffle - Performance Improved (365days) - Performance degraded (1day/30days) ・Data for Spark (including temporary data) should be smaller than memory. ・Had better to process as a trial to estimate Adding disks for shuffle (Spark 2.0.0) Heavy load on a local disk (OS disk) by shuffle