8. • 25-40k messages processed per second
• Total size of data 500TB-800TB
Open Energi in the coming year:
9. • 25-40k messages processed per second
• Total size of data 500TB-800TB
Open Energi in the coming year:
Perspective: here’s what “big data” means to Boeing [1]:
• ~64k messages per second from each aircraft
• Total size of data over 100 petabytes
[1]: http://bit.ly/18kQlMn
11. …but after domestic demand-side response (or something else on that scale)
0
20
40
60
80
100
120
Open Energi Boeing
Size of data (PB)
12. Why Hortonworks Data Platform
• Can scale quickly to respond to market demands
• Interoperability with existing code
• Fantastic data integration
• Knowledgeable technical support
• Security and data governance
13. Batch | Our HDP setup
Flume
Asset Data
National
Electricity Data
Market data
Other “live”
timeseries data
Hive
Streaming
Hive
other
Applications
14. Real-time | (Work ongoing)
Asset Data
ML models
HDFS, cache,
Elasticsearch
…
Update ML Models
Correlate Events
Enrich
15. Apache Hive | Example
CREATE EXTERNAL TABLE semi_structured_stuff (...)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = ‘semi/structured',
'es.index.auto.create' = 'false') ;
SELECT something FROM semi_structured_stuff
JOIN metadata m ON …
LEFT JOIN timeseries t ON …
Index semi-structured data
(Elasticsearch)
Use Hive to integrate this with
timeseries data and other metadata
Farm out complex analytics to
Python
SELECT transform(something)
USING ‘insane_maths.py’
AS (result)
16. Benefits
• Reduced storage cost compared to SAN + SQL Server
• Better utilisation of infrastructure thanks to YARN
• Pain-free integration of multiple data sources with external tables
in Hive
• Scale up/down on demand
• Re-use existing Python code = low development overhead
There is a powerful economic case to distribute demand more efficiently using DSR technology, regardless of the future generation mix
The capital cost of building a new peaking power station can be up to £5 million per megawatt of power
The current costs to aggregate a megawatt via Dynamic Demand sit at around £200,000
It provides a no-build approach to capacity challenges which is cleaner, cheaper, more secure and faster than the alternatives.
- Open Energi is turning the energy system on it’s head, so that instead of supply adjusting to meet demand, demand adjusts to meet supply
By harnessing small amounts of flexible energy demand from energy-intensive equipment we can create a virtual power station and displace fossil-fuelled peaking power stations
This is enabling a user-led transformation in how our energy system works, so that businesses and consumers are not only making it happen, but also seeing the benefits
It’s a vital part of our transition to a zero carbon economy because we cannot maximise our use of renewables unless our demand for energy becomes more responsive
Dynamic Demand can deliver approx £85,000 per MW/Yr
FCDM / Static FFR £22,000 - £26,000 per MW/Yr
STOR - £10,000 - £15,000 per MW/Yr
We capture data at finest grain level. Stored as COV.
The challenge is then aggregating multiple timeseries without downsampling. We also need to downsample all these series to multiple resolutions.
They are all irregularly sampled. Hence the challenge, which prevents us from using timeseries databases.
Confidence that our data platform can scale quickly if needed
The markets we operate in are unpredictable
When domestic market takes off, our data could increase by two orders of magnitude!
Fantastic data integration support
Can easily wrap our existing codebase
Reduce our £/GB by 80% for archival data while retaining ability to query
Extensibility
New tools being added to the ecosystem on a regular basis
More and more developers trained in Hadoop ecosystem means easier on-boarding
Knowledgeable support from Hortonworks
Security and governance built into platform
This is ongoing work and in particular we haven’t quite figured out the “asset data” -> storm bit.
Not limited by storage cost – able to enrich data to reduce cost of processing
Better utilisation of infrastructure compared to VMs dedicated to a single service – here YARN means we can really get the most out of everything
Ability to mix Python with SQL means easier/maintainable aggregation/downsampling
Interactive querying of multiple data sources with Spark in Jupyter
Easy ingestion process using multiple Flume agents
Can still use Elasticsearch for small timeseries
Now let’s have a look at where HDP fits in to our big “wheel of data”.
Not limited by storage cost – able to enrich data to reduce cost of processing
Ability to mix Python with SQL means easier/maintainable aggregation/downsampling
Interactive querying of multiple data sources with Spark in Jupyter
Easy ingestion process using multiple Flume agents
Can still use Elasticsearch for small timeseries