Operationalizing models and responding to large volumes of data, fast, requires bolt on systems that can struggle with processing (transforming the data), consistency (always responding to data), and scalability (processing and responding to large volumes of data). If the data volume become too large, these traditional systems fail to deliver their responses resulting in significant losses to organizations. Join this breakout to learn how to overcome the roadblocks.
McKinsey & Company: http://www.mckinsey.com/insights/business_technology/getting_the_cmo_and_cio_to_work_as_partners
“Wall Street’s quest to process data at the speed of light,” Information Week, April 21, 2007
McKinsey & Company: http://www.mckinsey.com/insights/business_technology/getting_the_cmo_and_cio_to_work_as_partners
“Wall Street’s quest to process data at the speed of light,” Information Week, April 21, 2007
Only talking about numerical and categorical data. Not talking text or image because of limitations within Hbase around 1-2G for each entity.
Recommendations is a list of content that based on the entity behavior fits the next best action
Anomaly detection is the inverse of recommendations. They should fit into these recommendations based on their behaviors, but they are outside of the norm. Signal alert.
Model scoring are numerical aggregates that you can send to the right system
Only talking about numerical and categorical data. Not talking text or image because of limitations within Hbase around 1-2G for each entity.
Recommendations is a list of content that based on the entity behavior fits the next best action
Anomaly detection is the inverse of recommendations. They should fit into these recommendations based on their behaviors, but they are outside of the norm. Signal alert.
Model scoring are numerical aggregates that you can send to the right system
Limited Data – Can’t bring in unstructured data, historic data is moved offline because the system can’t scale.
Drill down performance – Individual insight drilldowns take time, ad-hoc queries steal resources from operational analytics workload
Analytic Latency – Processing data volumes at speed is inefficient on traditional systems, latency can’t occur (hurts business or customer relationship)
Data Scale – Bring in structured and unstructured data, keep data online for deeper drill downs
Ad-hoc Queries – Perform drill downs without compromising system performance
Little Latency – Scale data processing to meet analytics SLAs.
Opower Intro: Who is Opower and what does Opower do?
Produce energy insights to help utilities and customers manage energy consumption.
100+millions of meter reads received daily. Millions of individual insight calculations routinely created, from simple trending analytics, to more advanced forecasting/prediction.
Energy saved: 5+TW hours, $500M energy bill savings, >6 billion lbs CO2
Product lines:
Consumer engagement
Energy efficiency
Demand Response
Hadoop-based insights are a critical portion of each of these product lines.
Transition: Some example hadoop-based insights:
Two example of Opower’s personalized insights that use hadoop components: neighbor comparisons and unusual usage alerts
Energy usage is stored in HBase, along with insights derived from the energy usage. Billions of energy usage data reads are stored in HBase. Insights are served directly from HBase.
Unusual usage alerts were the first use case for HBase/hadoop. We sold a deal that required us to generate “unusual usage alerts” at a scale we had yet hit
UUA are email or phone messages we send to let customers know if they are trending towards higher than usual energy usage
We also project the bill for them and can let them know if they are going to pay more than expected
Transition: The initial architecture we built to calculate and deliver this insight
Hadoop has been used in production at Opower since 2012.
Overview of end-to-end architecture: Data is copied from single-tenant mysql databases into HBase. MySQL is single tenant (one DB per opower client), and we have > 100 MySQL dbs in production. Batch clients read from HBase. Other workloads running on the cluster as an attempt to eliminate the need to support clusters for separate workloads. Sqoop is a mapreduce job that reads data from mysql and outputs to some other source, like hive+hdfs or in our case HBase.
Challenges:
Sqoop ingest introduced a lot of memory pressure on region servers and traffic on mysql read slaves. Need to take caution to not introduce excessive MySQL load from sqoop queries, as the databases are serving other critical apps
Queries required longer multi-row scans and aggregations. Lot’s of tuning was necessary, such as increase in region file sizes, memstore sizes, heap size. Disable major compactions, HDFS short-circuit.
Composite row keys with timestamps in them, thinking about Hbase more like a relational table than a big sorted map
We had supporting data in single-tenanted because we were sqooping it over from the mysql databases
Because of how we designed the schema, we needed multiple tables to store the data
Single-tenanted tables adds operational overhead and difficulty in tracking bottlenecks in the process
Initial support of ad-hoc MR jobs via Hive was quickly removed due to unmanageable load
This architecture has been successful, but difficult to scale. The hbase schema was difficult to extend to support new insights and there was no story for offline analytics and experimentation.
Transition: V2 (the modern) Opower hadoop architecture addresses these issues
Overview/walk through of the major components. Usage data is collected from the utility and directly ingested into HBase via bulkloading MR jobs. [Explain bulkloading] Data is stored in an Entity-centric table, where each entity is a single hbase row containing the energy usage history for a household, and any derived analytics from that energy, such as bill forecasts and neighbor comparisons. MapReduce jobs will periodically referesh these analytics, but some are also refreshed on-demand in a streaming fashion, as insights are queried. Data is replicated to the data warehouse cluster via a combination of HBase replication (for direct puts) and as an HFile distcp step during the intial bulkload ingest (not pictured).
Full, multi-tenant datasets are now available to be analyzed in the data warehouse, which has enabled new off-line analytics such as product eligibility calculations, and a general test-bed experimenting with new insights. There is no longer a need to painstakingly collect data from multiple sources or worry about crashing a mysql slave when running a full table scan.
Improvements:
Write path performance via bulkloading. less GC pressure in the region server, no memstore flushes. fewer RPC’s/round-trips to the databsae. Simultaneous bulkloading via distcp into the data warehouse hbase instance, so the data warehouse has fresh data.
Entity-centric HBase schema provides ability to add new analytics/insights in a scalable manner. Data used to derive a personalized insight is stored in a single HBase row, providing data locality for scans and eliminating hbase overhead of multi-row traversals and aggregations.
Secondary analyics were moved to data warehouse, reducing the memory pressure and task contention on the service cluster. MR jobs on the service cluster are specific to generation of personalized insights served at low-latency.
The new architecture has worked, but there are still areas we want to improve, such as automation and ETL tooling that will make it easier to load new datasets and create new insights.
Transition: This new architecture enables two distinct environments for creating new data insights
Product calculations are built as producer-style mapreduce jobs – reading and writing to the same HBase row. For example, a trend in energy usage for the current bill period will be derived from the usage data present in the row and used to forecast the customers energy consumption and spending for the current period.
Insights are accessed by a service query layer. An template HBase service container can be easily extended to create service API’s for different insight products. Service client applications are used by reporting pipelines and embedded web components.
Offline analysis and experimentation occurs in the data warehouse. Hive, BI tools (platfora, datameer), and raw mapreduce jobs are used to create aggregate reports, and non-product analytics such as customer program eligibility.
These tools are also used for ad-hoc analysis of full energy usage datasets, such as electric car charging trends or the impact of the super bowl on energy consumption.
In the future we look to link the two systems, enabling analytics developed offline to be ‘promoted’ to product calculations.
Transition: What’s been the result of a switch to hadoop architecture?
Batch analytics calculated via the producer pattern are much more amenable to the MR parallelization and take advantage of HBase row locality. Run time reduced significantly. Some jobs could be multi-tenant, which are easier to operate.
Individual insight query latency dropped from several seconds to ~10ms. Our performance tests measure at the 99.999% point on the latency tail, so average time is even faster. Query latency has been critical for SOA-model SLA’s, since multiple external services will access this data in real time.
Analytic development time is faster, although it could still be improved. Development speedups came from adding a data warehouse cluster for development and experimentation, which used more analyst friendly tools like hive and scalding. Also, the entity-based schema used in production is more amenable to adding new data.
Transition: We’ve had some success but encountered challenges along the way. Here are some lessons we learned: