Breakout: Operational Analytics with Hadoop

•

2 likes•1,990 views

Operationalizing models and responding to large volumes of data, fast, requires bolt on systems that can struggle with processing (transforming the data), consistency (always responding to data), and scalability (processing and responding to large volumes of data). If the data volume become too large, these traditional systems fail to deliver their responses resulting in significant losses to organizations. Join this breakout to learn how to overcome the roadblocks.

Technology

1© Cloudera, Inc. All rights reserved.
Smarter Decisions in Less Time
Operational Analytics with Cloudera

2© Cloudera, Inc. All rights reserved.
Trends in the Market
“5 percent more productive and 6
percent more profitable than other
companies.”
Source: McKinsey & Company
1 millisecond advantage can be
worth $100 million to a major
brokerage firm.
Source: Information Week
80% of CEOs said analytics is a
strategic objective for 2015.
Source: PWC CEO Survey
Data Driven Pays Cost of Latency Strategic Direction
Trends Driving Change

3© Cloudera, Inc. All rights reserved.
Trends in the Market
Trends Driving Change
Operational analytic applications
increase data returns.

4© Cloudera, Inc. All rights reserved.
Operational Analytics (OA) Overview
Automating analytics pipelines providing individuals the
right information at the right time.
(AKA: Recommendation engines, CEP solutions, Rules Engines)

5© Cloudera, Inc. All rights reserved.
Operational Analytics (OA) Overview
Serving product, content, or
services recommendations
(Wibi, Oryx)
Flagging outliers based on past
behaviors
(SIEM, Blueprint)
Running aggregates for large
scale report serving
(SAS)
Recommendation Engine Event Detection Model Scoring

6© Cloudera, Inc. All rights reserved.
Operationalizing Reports, Models, or Rules
Recommendation
Engine
Event
Detection
Model
Scoring
Point Solutions
Custom Development 3rd Party
Data Discovery
& Analytics

7© Cloudera, Inc. All rights reserved.
Custom Development Use Cases
Recommendation
Engine
Event
Detection
Model
Scoring
Fraud Detection
Spam Filter
Marketing Alerts
Embedded Analytics
Analytic Aggregates
Reports
Next Best Offer
Content Rec
Services Rec

8© Cloudera, Inc. All rights reserved.
The Process of Operational Analytics
Data Discovery
Advanced Analytics
Data Volumes
Stream & Batch Processing
Data
Generation
Operational
Analytics
Flow
Optimize Analytic
Function
Processing
Respond to Data
Feed Data
Application
Act and
Measure
Model Flexibility
Scalability
Embedded Analytics
Reports

9© Cloudera, Inc. All rights reserved.
Operational Analytic Needs
Scale Embed Analytics
Enterprise Data Warehouse
DataData Sources
ETL
Structured
Unstructured
Database
ELT
Store & Process
Traditional Architecture
Archive
Serve
Action
Online Model
f (D1, DN)
Structured
Unstructured
Machine
Drill Down
Human
API
Ingest
Little Latency
Process

10© Cloudera, Inc. All rights reserved.
Challenges with Traditional Operational Analytic
1) Limited Data 3) Analytic Latency2) No Granularity
Enterprise Data Warehouse
DataData Sources
ETL
Structured
Unstructured
Database
ELT
Store & Process
Traditional Architecture
Archive
Serve
Action
Online Model
Process
f (D1, DN)
Structured
Unstructured
Machine
Drill Down
Human
API
Ingest1
2
1
3

11© Cloudera, Inc. All rights reserved.
A New Way Forward
1) Data Scale 3) Little Latency2) Granularity
Enterprise Data Warehouse
DataData Sources
ETL
Structured
Unstructured
Enterprise
Data Hub
ELT
Store & Process
Modern Architecture
Serve
Action
Process
f (D1, DN) Structured
Unstructured
Machine
Drill Down
Human
API
Ingest
1
1
2
3
Online Model

12© Cloudera, Inc. All rights reserved.
Opower Customer Story

13© Cloudera, Inc. All rights reserved.
Opower Overview
The Company
• Serving 95+ utilities in 9 countries
• Over 5TWh saved to date
• 40% of US household data under management totaling 300
billion reads
Our DNA
• Behavioral science software
• Data analytics
• Consumer marketing
• User-centric design
A Software as a Service Customer Engagement Platform

14© Cloudera, Inc. All rights reserved.
Opower’s Personalized Insights
Neighbor comparisons Usage trend analysis

15© Cloudera, Inc. All rights reserved.
Initial Hadoop Architecture
1
2
3
Ingest performance
Complex query paths
1
3
2
Challenges
Multiple workloads

16© Cloudera, Inc. All rights reserved.
Modern Hadoop Architecture
Offline Data Discovery & AnalyticsOnline Operational Analytics
Ingest Performance
Workload separation3
1 2
Improvements
Entity-centric HBase schema2 1
3

17© Cloudera, Inc. All rights reserved.
Insight Creation Environments
Insight Delivery
Insight Calculation
Online Operational Analytics Offline Data Discovery & Analytics
Hive BI
Raw
MR
Batch Tools
HDFS
Reporting
External
Feeds
HBase Export
Non-product
Insights

18© Cloudera, Inc. All rights reserved.
What does this mean to end users?
Batch Analytic Calculations Individual Insight Query Latency
Pre-Hadoop Modern Hadoop
Hours
12
24
48
Hours
Days
Pre-Hadoop
Seconds
1
2
3
~10ms
3 secs
Analytic Development Time
Pre-Hadoop
Months
1
3
5
Weeks
Months
Modern HadoopModern Hadoop

What's hot

Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data HubCloudera, Inc.

Hadoop: Extending your Data WarehouseCloudera, Inc.

Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.

The Future of Data Management: The Enterprise Data HubCloudera, Inc.

Govern This! Data Discovery and the application of data governance with new s...Cloudera, Inc.

Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.

Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung

Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldCloudera, Inc.

Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.

Hadoop and ManufacturingCloudera, Inc.

Breakout: Data Discovery with HadoopCloudera, Inc.

Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...Cloudera, Inc.

Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.

Part 1: Introducing the Cloudera Data Science WorkbenchCloudera, Inc.

The Vortex of Change - Digital Transformation (Presented by Intel)Cloudera, Inc.

Driving Better Products with Customer Intelligence Cloudera, Inc.

Data Offload for the Chief Data Officer – how to move data onto Hadoop withou...DataWorks Summit

Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.

What's hot (20)

Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub

Hadoop: Extending your Data Warehouse

Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...

The Future of Data Management: The Enterprise Data Hub

Govern This! Data Discovery and the application of data governance with new s...

Building a Modern Analytic Database with Cloudera 5.8

Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...

Increase your ROI with Hadoop in Six Months - Presented by Dell, Cloudera and...

Extending Cloudera SDX beyond the Platform

Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World

Enterprise Data Hub: The Next Big Thing in Big Data

Hadoop and Manufacturing

Breakout: Data Discovery with Hadoop

Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...

Is your big data journey stalling? Take the Leap with Capgemini and Cloudera

Part 1: Introducing the Cloudera Data Science Workbench

The Vortex of Change - Digital Transformation (Presented by Intel)

Driving Better Products with Customer Intelligence 

Data Offload for the Chief Data Officer – how to move data onto Hadoop withou...

Hadoop Essentials -- The What, Why and How to Meet Agency Objectives

Viewers also liked

Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY

Data Analytics on Healthcare FraudNicholas Szeto

Healthcare fraud detectionMahdi Esmailoghli

Introduction to Cloudera Search TrainingCloudera, Inc.

Operationalizing analytics to scaleLooker

Complex Analytics with NoSQL Data Store in Real TimeNati Shalom

Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi

Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks

Big Data AnalyticsGhulam Imaduddin

Big data architectures and the data lakeJames Serra

Data analysis powerpointSarah Hallum

Big Data Analytics 2014Stratebi

Cisco OpenSOCJames Sirota

Chapter 10-DATA ANALYSIS & PRESENTATIONLudy Mae Nalzaro,BSM,BSN,MN

Big data pptIDBI Bank Ltd.

What is Big Data?Bernard Marr

Big Data Analytics with HadoopPhilippe Julio

Big data pptNasrin Hussain

Viewers also liked (18)

Operational Analytics Using Spark and NoSQL Data Stores

Data Analytics on Healthcare Fraud

Healthcare fraud detection

Introduction to Cloudera Search Training

Operationalizing analytics to scale

Complex Analytics with NoSQL Data Store in Real Time

Building a Modern Data Architecture with Enterprise Hadoop

Implementing a Data Lake with Enterprise Grade Data Governance

Big Data Analytics

Big data architectures and the data lake

Data analysis powerpoint

Big Data Analytics 2014

Cisco OpenSOC

Chapter 10-DATA ANALYSIS & PRESENTATION

Big data ppt

What is Big Data?

Big Data Analytics with Hadoop

Big data ppt

Similar to Breakout: Operational Analytics with Hadoop

CS-Op AnalyticsCloudera, Inc.

Capgemini Leap Data Transformation Framework with ClouderaCapgemini

There are 250 Database products, are you running the right one?Aerospike, Inc.

Big data oracle_introduccionFran Navarro

Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.

Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsKinetica

Analytic Excellence - Saying Goodbye to Old ConstraintsInside Analysis

Impala Unlocks Interactive BI on HadoopCloudera, Inc.

SOUG Day - autonomous what is nextThomas Teske

Big Data for Product ManagersPentaho

Digital Transformation: How to Run Best-in-Class IT Operations in a World of ...Precisely

Future of Data Strategy (ASEAN)Denodo

Kudu Forrester WebinarCloudera, Inc.

Big Data LDN 2017: The New Dominant Companies Are Running on DataMatt Stubbs

The new dominant companies are running on data SnapLogic

Splunk Webinar: IT Operations Demo für Troubleshooting & DashboardingGeorg Knon

6 enriching your data warehouse with big data and hadoopDr. Wilfred Lin (Ph.D.)

Dell Digital Transformation Through AI and Data Analytics WebinarBill Wong

Oracle GoldenGate oracleonthebrain

Similar to Breakout: Operational Analytics with Hadoop (20)

CS-Op Analytics

Capgemini Leap Data Transformation Framework with Cloudera

There are 250 Database products, are you running the right one?

Big data oracle_introduccion

Simplifying Real-Time Architectures for IoT with Apache Kudu

Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets

Analytic Excellence - Saying Goodbye to Old Constraints

Impala Unlocks Interactive BI on Hadoop

SOUG Day - autonomous what is next

Big Data for Product Managers

Digital Transformation: How to Run Best-in-Class IT Operations in a World of ...

Future of Data Strategy (ASEAN)

Kudu Forrester Webinar

Big Data LDN 2017: The New Dominant Companies Are Running on Data

The new dominant companies are running on data

Splunk Webinar: IT Operations Demo für Troubleshooting & Dashboarding

6 enriching your data warehouse with big data and hadoop

Dell Digital Transformation Through AI and Data Analytics Webinar

Oracle GoldenGate

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Story boards and shot lists for my a level piececharlottematthew16

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

How to write a Business Continuity PlanDatabarracks

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Search Engine Optimization SEO PDF for 2024.pdfRankYa

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"ML in Production",Oleksandr BaganFwdays

From Family Reminiscence to Scholarly Archive .Alan Dix

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

CloudStudio User manual (basic edition):comworks

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web

Ensuring Technical Readiness For Copilot in Microsoft 365

TeamStation AI System Report LATAM IT Salaries 2024

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Story boards and shot lists for my a level piece

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

How to write a Business Continuity Plan

Human Factors of XR: Using Human Factors to Design XR Systems

Search Engine Optimization SEO PDF for 2024.pdf

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"ML in Production",Oleksandr Bagan

From Family Reminiscence to Scholarly Archive .

Nell’iperspazio con Rocket: il Framework Web di Rust!

Powerpoint exploring the locations used in television show Time Clash

CloudStudio User manual (basic edition):

DevoxxFR 2024 Reproducible Builds with Apache Maven

Streamlining Python Development: A Guide to a Modern Project Setup

How AI, OpenAI, and ChatGPT impact business and software.

Scanning the Internet for External Cloud Exposures via SSL Certs

Unleash Your Potential - Namagunga Girls Coding Club

Breakout: Operational Analytics with Hadoop

2. 2© Cloudera, Inc. All rights reserved. Trends in the Market “5 percent more productive and 6 percent more profitable than other companies.” Source: McKinsey & Company 1 millisecond advantage can be worth $100 million to a major brokerage firm. Source: Information Week 80% of CEOs said analytics is a strategic objective for 2015. Source: PWC CEO Survey Data Driven Pays Cost of Latency Strategic Direction Trends Driving Change

4. 4© Cloudera, Inc. All rights reserved. Operational Analytics (OA) Overview Automating analytics pipelines providing individuals the right information at the right time. (AKA: Recommendation engines, CEP solutions, Rules Engines)

5. 5© Cloudera, Inc. All rights reserved. Operational Analytics (OA) Overview Serving product, content, or services recommendations (Wibi, Oryx) Flagging outliers based on past behaviors (SIEM, Blueprint) Running aggregates for large scale report serving (SAS) Recommendation Engine Event Detection Model Scoring

6. 6© Cloudera, Inc. All rights reserved. Operationalizing Reports, Models, or Rules Recommendation Engine Event Detection Model Scoring Point Solutions Custom Development 3rd Party Data Discovery & Analytics

7. 7© Cloudera, Inc. All rights reserved. Custom Development Use Cases Recommendation Engine Event Detection Model Scoring Fraud Detection Spam Filter Marketing Alerts Embedded Analytics Analytic Aggregates Reports Next Best Offer Content Rec Services Rec

8. 8© Cloudera, Inc. All rights reserved. The Process of Operational Analytics Data Discovery Advanced Analytics Data Volumes Stream & Batch Processing Data Generation Operational Analytics Flow Optimize Analytic Function Processing Respond to Data Feed Data Application Act and Measure Model Flexibility Scalability Embedded Analytics Reports

9. 9© Cloudera, Inc. All rights reserved. Operational Analytic Needs Scale Embed Analytics Enterprise Data Warehouse DataData Sources ETL Structured Unstructured Database ELT Store & Process Traditional Architecture Archive Serve Action Online Model f (D1, DN) Structured Unstructured Machine Drill Down Human API Ingest Little Latency Process

10. 10© Cloudera, Inc. All rights reserved. Challenges with Traditional Operational Analytic 1) Limited Data 3) Analytic Latency2) No Granularity Enterprise Data Warehouse DataData Sources ETL Structured Unstructured Database ELT Store & Process Traditional Architecture Archive Serve Action Online Model Process f (D1, DN) Structured Unstructured Machine Drill Down Human API Ingest1 2 1 3

11. 11© Cloudera, Inc. All rights reserved. A New Way Forward 1) Data Scale 3) Little Latency2) Granularity Enterprise Data Warehouse DataData Sources ETL Structured Unstructured Enterprise Data Hub ELT Store & Process Modern Architecture Serve Action Process f (D1, DN) Structured Unstructured Machine Drill Down Human API Ingest 1 1 2 3 Online Model

13. 13© Cloudera, Inc. All rights reserved. Opower Overview The Company • Serving 95+ utilities in 9 countries • Over 5TWh saved to date • 40% of US household data under management totaling 300 billion reads Our DNA • Behavioral science software • Data analytics • Consumer marketing • User-centric design A Software as a Service Customer Engagement Platform

16. 16© Cloudera, Inc. All rights reserved. Modern Hadoop Architecture Offline Data Discovery & AnalyticsOnline Operational Analytics Ingest Performance Workload separation3 1 2 Improvements Entity-centric HBase schema2 1 3

17. 17© Cloudera, Inc. All rights reserved. Insight Creation Environments Insight Delivery Insight Calculation Online Operational Analytics Offline Data Discovery & Analytics Hive BI Raw MR Batch Tools HDFS Reporting External Feeds HBase Export Non-product Insights

18. 18© Cloudera, Inc. All rights reserved. What does this mean to end users? Batch Analytic Calculations Individual Insight Query Latency Pre-Hadoop Modern Hadoop Hours 12 24 48 Hours Days Pre-Hadoop Seconds 1 2 3 ~10ms 3 secs Analytic Development Time Pre-Hadoop Months 1 3 5 Weeks Months Modern HadoopModern Hadoop

19. Thank you.

Editor's Notes

McKinsey & Company: http://www.mckinsey.com/insights/business_technology/getting_the_cmo_and_cio_to_work_as_partners “Wall Street’s quest to process data at the speed of light,” Information Week, April 21, 2007
McKinsey & Company: http://www.mckinsey.com/insights/business_technology/getting_the_cmo_and_cio_to_work_as_partners “Wall Street’s quest to process data at the speed of light,” Information Week, April 21, 2007
Only talking about numerical and categorical data. Not talking text or image because of limitations within Hbase around 1-2G for each entity. Recommendations is a list of content that based on the entity behavior fits the next best action Anomaly detection is the inverse of recommendations. They should fit into these recommendations based on their behaviors, but they are outside of the norm. Signal alert. Model scoring are numerical aggregates that you can send to the right system
Only talking about numerical and categorical data. Not talking text or image because of limitations within Hbase around 1-2G for each entity. Recommendations is a list of content that based on the entity behavior fits the next best action Anomaly detection is the inverse of recommendations. They should fit into these recommendations based on their behaviors, but they are outside of the norm. Signal alert. Model scoring are numerical aggregates that you can send to the right system
Limited Data – Can’t bring in unstructured data, historic data is moved offline because the system can’t scale. Drill down performance – Individual insight drilldowns take time, ad-hoc queries steal resources from operational analytics workload Analytic Latency – Processing data volumes at speed is inefficient on traditional systems, latency can’t occur (hurts business or customer relationship)
Data Scale – Bring in structured and unstructured data, keep data online for deeper drill downs Ad-hoc Queries – Perform drill downs without compromising system performance Little Latency – Scale data processing to meet analytics SLAs.
Opower Intro: Who is Opower and what does Opower do? Produce energy insights to help utilities and customers manage energy consumption. 100+millions of meter reads received daily. Millions of individual insight calculations routinely created, from simple trending analytics, to more advanced forecasting/prediction. Energy saved: 5+TW hours, $500M energy bill savings, >6 billion lbs CO2 Product lines: Consumer engagement Energy efficiency Demand Response Hadoop-based insights are a critical portion of each of these product lines. Transition: Some example hadoop-based insights:
Two example of Opower’s personalized insights that use hadoop components: neighbor comparisons and unusual usage alerts Energy usage is stored in HBase, along with insights derived from the energy usage. Billions of energy usage data reads are stored in HBase. Insights are served directly from HBase. Unusual usage alerts were the first use case for HBase/hadoop. We sold a deal that required us to generate “unusual usage alerts” at a scale we had yet hit UUA are email or phone messages we send to let customers know if they are trending towards higher than usual energy usage We also project the bill for them and can let them know if they are going to pay more than expected Transition: The initial architecture we built to calculate and deliver this insight
Hadoop has been used in production at Opower since 2012. Overview of end-to-end architecture: Data is copied from single-tenant mysql databases into HBase. MySQL is single tenant (one DB per opower client), and we have > 100 MySQL dbs in production. Batch clients read from HBase. Other workloads running on the cluster as an attempt to eliminate the need to support clusters for separate workloads. Sqoop is a mapreduce job that reads data from mysql and outputs to some other source, like hive+hdfs or in our case HBase. Challenges: Sqoop ingest introduced a lot of memory pressure on region servers and traffic on mysql read slaves. Need to take caution to not introduce excessive MySQL load from sqoop queries, as the databases are serving other critical apps Queries required longer multi-row scans and aggregations. Lot’s of tuning was necessary, such as increase in region file sizes, memstore sizes, heap size. Disable major compactions, HDFS short-circuit. Composite row keys with timestamps in them, thinking about Hbase more like a relational table than a big sorted map We had supporting data in single-tenanted because we were sqooping it over from the mysql databases Because of how we designed the schema, we needed multiple tables to store the data Single-tenanted tables adds operational overhead and difficulty in tracking bottlenecks in the process Initial support of ad-hoc MR jobs via Hive was quickly removed due to unmanageable load This architecture has been successful, but difficult to scale. The hbase schema was difficult to extend to support new insights and there was no story for offline analytics and experimentation. Transition: V2 (the modern) Opower hadoop architecture addresses these issues
Overview/walk through of the major components. Usage data is collected from the utility and directly ingested into HBase via bulkloading MR jobs. [Explain bulkloading] Data is stored in an Entity-centric table, where each entity is a single hbase row containing the energy usage history for a household, and any derived analytics from that energy, such as bill forecasts and neighbor comparisons. MapReduce jobs will periodically referesh these analytics, but some are also refreshed on-demand in a streaming fashion, as insights are queried. Data is replicated to the data warehouse cluster via a combination of HBase replication (for direct puts) and as an HFile distcp step during the intial bulkload ingest (not pictured). Full, multi-tenant datasets are now available to be analyzed in the data warehouse, which has enabled new off-line analytics such as product eligibility calculations, and a general test-bed experimenting with new insights. There is no longer a need to painstakingly collect data from multiple sources or worry about crashing a mysql slave when running a full table scan. Improvements: Write path performance via bulkloading. less GC pressure in the region server, no memstore flushes. fewer RPC’s/round-trips to the databsae. Simultaneous bulkloading via distcp into the data warehouse hbase instance, so the data warehouse has fresh data. Entity-centric HBase schema provides ability to add new analytics/insights in a scalable manner. Data used to derive a personalized insight is stored in a single HBase row, providing data locality for scans and eliminating hbase overhead of multi-row traversals and aggregations. Secondary analyics were moved to data warehouse, reducing the memory pressure and task contention on the service cluster. MR jobs on the service cluster are specific to generation of personalized insights served at low-latency. The new architecture has worked, but there are still areas we want to improve, such as automation and ETL tooling that will make it easier to load new datasets and create new insights. Transition: This new architecture enables two distinct environments for creating new data insights
Product calculations are built as producer-style mapreduce jobs – reading and writing to the same HBase row. For example, a trend in energy usage for the current bill period will be derived from the usage data present in the row and used to forecast the customers energy consumption and spending for the current period. Insights are accessed by a service query layer. An template HBase service container can be easily extended to create service API’s for different insight products. Service client applications are used by reporting pipelines and embedded web components. Offline analysis and experimentation occurs in the data warehouse. Hive, BI tools (platfora, datameer), and raw mapreduce jobs are used to create aggregate reports, and non-product analytics such as customer program eligibility. These tools are also used for ad-hoc analysis of full energy usage datasets, such as electric car charging trends or the impact of the super bowl on energy consumption. In the future we look to link the two systems, enabling analytics developed offline to be ‘promoted’ to product calculations. Transition: What’s been the result of a switch to hadoop architecture?
Batch analytics calculated via the producer pattern are much more amenable to the MR parallelization and take advantage of HBase row locality. Run time reduced significantly. Some jobs could be multi-tenant, which are easier to operate. Individual insight query latency dropped from several seconds to ~10ms. Our performance tests measure at the 99.999% point on the latency tail, so average time is even faster. Query latency has been critical for SOA-model SLA’s, since multiple external services will access this data in real time. Analytic development time is faster, although it could still be improved. Development speedups came from adding a data warehouse cluster for development and experimentation, which used more analyst friendly tools like hive and scalding. Also, the entity-based schema used in production is more amenable to adding new data. Transition: We’ve had some success but encountered challenges along the way. Here are some lessons we learned:

Breakout: Operational Analytics with Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Breakout: Operational Analytics with Hadoop

Similar to Breakout: Operational Analytics with Hadoop (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Breakout: Operational Analytics with Hadoop

Editor's Notes