Energy companies deal with huge amounts of data and Apache Spark is an ideal platform to develop machine learning applications for forecasting and pricing. In this talk, we will discuss how Apache Spark’s MLlib library can be used to build scalable analytics for clustering, classification and forecasting primarily for energy applications using electricity and weather datasets.Through a demo, we will illustrate a workflow approach to accomplish an end-to-end pipeline from data pre-processing to deployment for the above use-case using PySpark, Python etc.
1. Energy Analytics
with Spark
SRI KRISHNAMURTHY
QUANTUNIVERSITY LLC.
QUANTUNIVERSITY@GMAIL.COM
2015 Copyright QuantUniversity LLC.
2. - ADVISORY SERVICES
- CUSTOM TRAINING PROGRAMS
- PLATFORM FOR LARGE SCALE SIMULATIONS AND ANALYTICS
- ARCHITECTURE REVIEW, TRAINING AND AUDITS
SOON ANALYTICS CERTIFICATE PROGRAM!
3. • Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services customers
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University
Sri Krishnamurthy
Founder and CEO
SPEAKERBIO
4. Agenda
1. Energy Analytics 101
2. A quick introduction to Apache Spark
3. Fundamentals and setup
4. Energy Analytics use-cases
Files for today’s workshop
http://bit.ly/1F6E91O
6. Actionable analytics enables engagement!
1. What if energy companies can understand
customer’s usage better?
2. What if energy companies can understand
what drives customer energy usage ?
3. What if energy companies engage with
customers by providing actionable
analytics so that customers can monitor
and plan for wide swings in energy usage?
7. Customer demand isn’t uniform
What changed from Jan 2nd to Jan 3rd ?
Does this customer heavily use energy from 11-9pm ?
8. Problem
1. Lots of data
2. Utilities can have more than 100K customers.
3. Truly a big data problem with multiple dimensions and dirty data
9. Use case 1 : Customer Segmentation
Segmenting customers
- How can we segment customers into groups ?
- Typically Clustering algorithms like K-means are used
What is K-means ?
http://shabal.in/visuals/kmeans/2.html
10. Use-case 2 – Load Forecasting
Given parameters like Temperature, day of week, month, time of day, can we predict load ?
Typically methods like Regression are used
Load = Function of (Temperature, day of week, month, time of day etc)
11. What is Spark ?
Apache Spark™ is a fast and general engine for large-scale data processing.
Came out of U.C. Berkeley’s AMP Lab
Lightning-fast cluster computing
12. Why Spark ?
Speed
Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Spark has an advanced DAG execution engine that
supports cyclic data flow and in-memory computing.
13. Why Spark ?
text_file = spark.textFile("hdfs://...")
text_file.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
Word count in Spark's Python API
Ease of Use
• Write applications quickly in Java, Scala or
Python,R.
• Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you can
use it interactively from the Scala and Python
shells.
• R support recently added
14. Why Spark ?
Generality
Combine SQL, streaming, and complex
analytics.
Spark powers a stack of high-level tools
including:
1. Spark Streaming: processing real-time data
streams
2. Spark SQL and DataFrames: support for
structured data and relational queries
3. MLlib: built-in machine learning library
4. GraphX: Spark’s new API for graph
processing
15. Why Spark?
Runs Everywhere
Spark runs on Hadoop, Mesos,
standalone, or in the cloud. It can access
diverse data sources including HDFS,
Cassandra, HBase, and S3.
You can run Spark using its standalone
cluster mode, on EC2, on Hadoop YARN,
or on Apache Mesos. Access data
in HDFS, Cassandra, HBase, Hive,Tachyo
n, and any Hadoop data source.
16. Key Features of Spark
• handles batch, interactive, and real-time within a single
framework
• native integration with Java, Python, Scala, R
• programming at a higher level of abstraction
• more general: map/reduce is just one set of supported
constructs
18. How does it work?
Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a
fault-tolerant collection of elements that can be operated on in parallel
Transformations create a new dataset from an existing one. All transformations
in Spark are lazy: they do not compute their results right away – instead they
remember the transformations applied to some base dataset
Actions return a value to the driver program after running a computation on the
dataset
22. Spark Setup
1. Get Spark 1.5 from http://spark.apache.org/
Select the latest Spark release (1.5), a prebuilt package for Hadoop 2.6, and download directly.
Windows
TEST_SPARK_ENV = C:spark-1.5.0-bin-hadoop2.6
SPARK_HOME = % TEST_SPARK_ENV%
Add % TEST_SPARK_ENV%bin to path
Unix/MAC
export SPARK_HOME=/srv/spark
export PATH=$SPARK_HOME/bin:$PATH
23. Set up Ipython Notebook
Install the latest Anaconda install from https://store.continuum.io/cshop/anaconda/
ipython profile create spark1.5.0
Go to:
C:Userssri-dell.ipythonprofile_spark1.5.0startup and create a file:
00-spark1.5.0-setup.py
27. To resolve this:
Get winutils.exe from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and
put it in SPARK_HOME
Add the following environment variables
set HADOOP_HOME=%SPARK_HOME%
set HADOOP_CONF_DIR=%SPARK_HOME%
28. To reduce logging messages
Copy
%SPARK_HOME%/conf/log4j.properties.template
to
%SPARK_HOME%/conf/log4j.properties
Replace INFO to WARN
29. Key Features used in Spark
1. Dataframes
◦ http://spark.apache.org/docs/latest/sql-programming-guide.html
2. Pyspark.ml API
◦ http://spark.apache.org/docs/latest/api/python/index.html
30. Use case 1 : Customer Segmentation
Segmenting customers
- How can we segment customers into groups ?
- Typically Clustering algorithms like K-means are used
What is K-means ?
http://shabal.in/visuals/kmeans/2.html
31. K-means
Given a set of observations (x1, x2, …, xn), where each
observation is a d-dimensional real vector, k-means clustering
aims to partition the n observations into k (≤ n)
sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of
squares (WCSS). In other words, its objective is to find:
where μi is the mean of points in Si.
34. Use-case 2 – Load Forecasting
Given parameters like Temperature, day of week, month, time of day, can we predict load ?
Typically methods like Regression are used
Load = Function of (Temperature, day of week, month, time of day etc)
38. 248 unique customers
30 million records
5 minute intervals
35 different subindustries
US, Canada and Australia
Data Sources
Temperature data for more than 100 weather stations corresponding
To the sites
39. Enernoc dataset
248 unique customers
30 million records
5 minute intervals
35 different subindustries
US, Canada and Australia
40. Demo
Data
Forecast.csv
Goal : To build a Linear Regression model to enable load forecasts given day,hour,month and
temperature
Load Forecasting-Dataframe.ipyb
44. References
1. Spark documentation and http://spark.apache.org/
2. Spark presentations primarily by Spark founders and the Databricks team
45. Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
www.analyticscertificate.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.