SlideShare a Scribd company logo
1 of 27
Automated Analytics at Scale
Model Management in Streaming
Big Data Architectures
Chris Kang
• Machine learning allows organizations to proactively discover patterns and predict
outcomes for their operations, and improving those insights requires deploying
better analytical models on their data.
• Finding the best analytical model requires running thousands of hypotheses on
various datasets and comparing models in a brute force approach.
• Currently a model management framework does not exist - that is, an agnostic
tool or framework that manages and orchestrates the entire lifecycle of a model.
Real-time Analytics at Scale
2Copyright © 2016 Accenture All rights reserved.
Challenges of Model Management
Model Management Framework operationalizes analytics to ease
development and deployment of analytical models
The framework provides key benefits to operationalize and democratize access to analytical
modeling at scale
3Copyright © 2016 Accenture All rights reserved.
Captures and
templates analytical
models created by
expert data scientists
for easy reuse
Faster development
of analytical models
with rapid iteration of
training and
comparing models
using brute force
approach
Presents champion-
challenger view to
visually compare and
promote trained
models
Reduces complexity
for data scientists to
train and deploy
models
Enables business
analysts and others to
participate in
modeling process
Model Management Framework is essential for the Internet of Things
platform
The Internet of Things platform exposes thousands of sensors that require models to be
automatically managed and maintained as well as provide easy access to the predicted results
Identify desired
insights
Identify sights for operationalizing devices/machinery for various
purposes: detecting anomaly, prediction maintenance, budget and
resource optimization
Collect data
Collect various types of data (time series or static) and store them
into databases that best fits the data type
Analyze
Train the analytical models using the model management
framework or using other analytical tools such as R then onboard it
to the framework
Actuate and
optimize
Set up rules to act on predicted results from thousands of sensors,
e.g. schedule a maintenance or lower temperature on a device
Copyright © 2016 Accenture All rights reserved. 4
Background
Organizations today have an unprecedented amount of data available
because of the Internet of Things, the web, and social media
In order to take advantage of
this massive set of data,
organizations must build
analytics platforms
6Copyright © 2016 Accenture All rights reserved.
Source: IBM, Big Data Hub, 2013
Traditional analytics platforms use big data technologies to process and
analyze large amounts of data
“Excited by big data technology capabilities to store more data, more diverse data and more real-time data, (companies)
focus on data collection. Rapidly growing data stores put increasing pressure on figuring out what to do with this data.
Determining the value of the collected data becomes the top challenge in all industries.”
Source: Svetlana Sicular, Gartner, October 30 2015
Example
Technologies
The steps to derive value out of the data include collecting, processing, and analyzing the data using a variety
of big data tools.
Analytics and
Visualization
Data Processing
Data Collection
Store huge volumes of data in multiple
data stores in a variety of data types for
processing.
Process the data by filtering, transforming,
and applying machine learning algorithms
using computing engines.
Create ad hoc reports on processed data
using business intelligence and
visualization tools.
Copyright © 2016 Accenture All rights reserved. 7
Enterprises need access to both historical and real-time data to gain the
most value out of big data analytics
• Real-time is data that is processed in sub-seconds to seconds from the time data arrives to when the results are derived.
• Batch processing technologies alone are insufficient because in the time it takes to process a batch (hours, days), real-time
data has accumulated and is missed, which generates a loss of opportunity for proactive decision making.
Storing data in a fault-tolerant, replicated historic store,
processing a large batch of data, and storing the processed data
using batch writes incurs delays that make real-time not feasible
Queries are only
directed at stale data
of up to hours or
days. The lack of
real-time data limits
the analytics to ad-
hoc summarizations
and aggregations.
Because of the
batch processing
delay, by the
time the captured
data is available
for queries, it is
stale
Real-time data is missed by
the time analytics begins
Historic
Data
Store
Batch Batch
Write
Data Query
Storage Processing Serving
Real-Time Data
Copyright © 2016 Accenture All rights reserved. 8
The Lambda Architecture empowers real-time analytics by handling data at
scale and in real-time using a hybrid architecture
• Designed by Nathan Marz, the creator of the Apache Storm project and previously a lead engineer at Twitter, the goal was
to build a general architecture to process big data at scale.
• The architecture separates batch processing on historical data from stream processing on real-time flow of data, allowing
for analytics on data that combines the most up-to-date data with historical data views.
Real-time analytics
can now be
performed on data
combined from most
up-to-date data with
historical views
BATCH LAYER focuses
on processing historical
data views for queries
SPEED LAYER handles
the complexity of real-time
data collection and analysis
Historic
Data
Store
Batch Batch
Write
Data Query
Storage Processing Serving
Queue Speed Random
Write
Copyright © 2016 Accenture All rights reserved. 9
In the Internet of Things, predictive modeling on sensor data allows
organizations to discover patterns and predict outcomes for their operations
Remediation
Notification
and Alerts
Oil & Gas
Producer
Water Utility
Client
NoSQL for
Unstructured
Data
Computing Engines
and Stream
Processors
Machine Learning
Algorithms
Model Runtime
Environments
Sensors at
Field Sites
Predictive
Results
Data Collection Data Processing Predictive Modeling Proactive Decision Making
Collects data from over
190,000 sensors
Collects data from sensors
placed along pipes in a water
distribution network
Injects 6,000 rows/second
and 11 billion rows of data
per month – larger analytics
platform than Twitter
Processes data for water flow
rate and pressure
Has over 3,500 models analyzing
data using various algorithms
Apply predictive model to project
forward in time to see spikes or falls
that exhibit warning signs of failure
Enables company to examine huge sets of data,
discover trends to predict outcomes in operation
and exploration efforts
Use results from predictive model to proactively
reduce pressure spikes, avoiding leaks,
prolonging the longevity of assets, and reducing
disruption to customers
• The real value of big data is the insight via the analytics, not just the collection of the data.
• Predictive modeling is the primary means by which companies can discover trends and make proactive, as opposed
to reactive, decisions on data.
Copyright © 2016 Accenture All rights reserved. 10
The modeling process is iterative and its lifetime spans both the batch mode
model training and real-time prediction
In general, a model creates an output for an unknown target value given a defined set of inputs.
In a time-series model, the target value also depends on time as an input
11Copyright © 2016 Accenture All rights reserved.
Build Model
• Identify required data and
how to get it
• Design and validate
specific analytic models
• Verify approach through
initial set of insights on
particular environments
Analyzes a variety of
machine learning
algorithms and identifies
the logistic regression
model as the most suitable
for the problem. Codes
model .JAR file
Train Model
• Prepare historical data
for training
• Select model input
parameters and runtime
environment
• Train the model on data
from historical batch
and/or real-time stream in
runtime environment
Selects input parameters
such as the regularization
parameter for the logistic
regression model. Submits
the model to Spark to train
the model on historical
data in HDFS
Monitor
Execution
• Monitor the status of
training the model in the
runtime environment (e.g.
running, succeeded,
failed)
• Troubleshoot issues in
the runtime environment
if necessary
Opens the terminal, ssh
into the Hadoop cluster,
and enters the commands
to verify the status of the
model as it is trained
Compare
Models
• Compare trained models
in champion-challenger
fashion
• Brute force approach to
finding best-of-breed
model for deploying to
live stream
After iteratively training
many models, select the
best-of-breed based on the
model with the lowest
mean square error
Operationalize
Model
• Deploy best model on live
stream of data
• Generate predicted
results for automated or
manual proactive
decision making
• Observe results to feed
back and fine-tune the
model
Submits the model to
Spark Streaming to be
applied to streaming data
ingested from Kafka, and
model predicts in real-time
whether sensor will fail
I want to deploy
a model that can
detect if a sensor
is faulty in
real-time
Data
Scientist
Data Science System Administration
12Copyright © 2016 Accenture All rights reserved.
Challenges with Analytical Modeling
in the Current State
Building, training, and deploying analytical models require a
rare combination of data science and engineering skills
The ability to complete the modeling process is limited to specialized individuals who are experts
in both data science and engineering
“The United States
alone faces a shortage
of 140,000 to 190,000
people with analytical
expertise and
1.5 million
managers and analysts
with the skills to
understand and make
decisions based on the
analysis of big data.”
Source: McKinsey Global Institute
analysis
Traditional Strengths
Potential Hurdles with Model Building
and Deployment
Full Set of Skills Needed for
Model Building and Deployment
Mathematics, statistics,
machine learning, data
mining, pattern recognition,
predictive algorithms, domain
expertise
Troubleshooting and running a runtime
environment such as Spark requires
advanced system engineering skills, which a
data scientist may not be trained in. This can
potentially lead to slower development and
deployment of predictive models.
• Understanding of a variety of
machine learning algorithms,
pattern recognition, as well as
expertise in a domain.
• Ability to build and code accurate
models based on problem
space.
• System administrator skills as
well as deep understanding of
big data systems to deploy
models in runtime environment.
Domain expertise, business
processes, requirements
gathering
Traditional business analysts may lack core
skills in data science or data engineering
because of a lack of experience to build, train,
or deploy models
Combination of data science
skills as well as software
engineering and system
administrator skills for big
data systems
May lack domain expertise, in which case it
may take longer to build and train relevant
models for the use case
Data
Scientist
Business
Analyst
Dual Data
Scientist
and
Engineer
Copyright © 2016 Accenture All rights reserved. 13
Analytical models are not easily reusable or shareable,
resulting in siloed analytics work
There is no standard method for sharing models to let users leverage models created by other
data scientists, so the analytics work is siloed. This is true for both freshly built models and models
that were already trained on a dataset
Predictive models duplicate and sprawl
as data scientists build and train their
own individual library of models that
are not shared.
No standard for
sharing or
viewing other
data scientist’s
models
Individual Libraries of Models
Data scientists primarily leverage their own libraries
of models and previous datasets they worked with
to select an algorithm and build a model for the
current problem
Model Duplication
As models are built and trained, the same types of
models may be built by more than one data
scientist, particularly if the types of models are
common in the industry’s use cases
Model Sprawl
Over time, as more data scientists build and train
more models, the models begin to sprawl and
duplicate unnecessarily, making the central
management of models more difficult
Train and
deploy
individual
models
Runtime execution
environments for model
training and deployment
Copyright © 2016 Accenture All rights reserved. 14
Without a framework, current approach is too inflexible to support multiple
runtime execution environments
It is impractical to scale the number of runtime environments to train and deploy models using a
manual approach
Spark model
with R
dependencies
Model with R
dependencies
I have a model,
but I don’t know
which runtime
environment
can support it
I’m only familiar with R,
so I need to learn all the
environments to test my
model
I have a new
type of model
so I need to
learn another
runtime
environment
Runtime environments often times cannot support all types of models. As a result, data scientists must spend time learning
environments instead of using that time for analytical modeling.
Dependencies
match and runtime
can support model
Missing Spark
functionality to
execute model
Missing specific R
dependency so
cannot support
model
All R libraries
supported and can
execute model
Data
Scientist
Update
Test
Learn
• Data scientist needs to acquire the system
administration skills to operate the runtime
environments
• Each runtime environment is unique and
requires time and energy
• In the worst case, the data scientist must try
every runtime environment before successfully
finding a match for the predictive model
• As more model types are needed, additional
runtime environments must be learned
• Learning additional environments becomes a
time-consuming endeavor
Copyright © 2016 Accenture All rights reserved. 15
Lack of engineering abstraction makes it difficult to
quickly train predictive models on data
Data scientists lose productivity as the process to train models is manual, requiring a manual
check for the status of a model in the environment as well as system administration for
troubleshooting the model in the environment
Need for abstraction
grows as the number
of types of models and
runtime environments
increases
Wasted productivity – Spending time on data
engineering instead of comparing models to
find the best-of-breed for deployment
No abstractions for
training or monitoring
models on runtime
environments
Train model
Repeated for hundreds of models
on various runtime environments
Check
status of
model
Troubleshoot
model
Train model
Check
status of
model
Troubleshoot
model
Train model
Check
status of
model
Troubleshoot
model
Build many
models on
various
algorithms
More time spent
on system
administration
Less time spent
on building
predictive
models
Try different input
parameters and
algorithms to find best-of-
breed model
…..
Manual Process
Data
Scientist
Build Model Train Model and Monitor Status
Copyright © 2016 Accenture All rights reserved. 16
17Copyright © 2015 Accenture All rights reserved.
Model Management Framework
for Automated Analytics at Scale
Model Management Framework simplifies the training, deployment, and
management of a large number of models for a Lambda architecture
Model management is a framework for data scientists and users to more easily
train and deploy analytical models in various runtime environments on the lambda
architecture by abstracting the system administration, reducing the complexity of
train and deploy, and sharing the models in a way that is consumable by users in
your organization, enabling other users such as business analysts to partake in the
modeling process.
The framework in this reference architecture proposes
• Model Store and Trained Model Store: A library of models of commonly used
machine learning algorithms that can be trained on user’s historical datasets, as
well as trained models that are available to be deployed.
• Model Interface Templates: Interfaces that abstract away the complexity of the
machine learning algorithm, allowing users to specify the inputs and outputs of
the model.
• Deployment and Scheduler: Automatic training, deployment, and scheduling of
models on runtime environments so that users do not need to operate the runtime
environments themselves.
• Runtime Verifier: Ability to determine which runtime environments can support a
model prior to execution, enabling faster development of trained models.
• Monitoring Service and Metadata Store: Service monitors the status of the
model during its execution on the runtime environment for the user, as well as
any metadata about its execution which it can then store.
• API: Exposes functionalities with API endpoints for users to verify, train, deploy,
and monitor models on runtime environments.
Real Time Analytics
Runtime Environments
Distributed Computing Scientific Computing
Model Management
Deployment
and Scheduler
Runtime
Verifier
Model Store Metadata StoreTrained Model
Store
Monitoring
Service
API Model Interface
Templates
Users
Data Scientists Business Analysts
Copyright © 2016 Accenture All rights reserved. 18
• Design for seamless interfaces is the method of connecting various stages throughout modeling pipeline to support
the domain experts/data scientists to create and update models and for the business analysts to extract data insights.
• Model management at scale is specific for large scale data analytics which requires distributed resources allocation
and communicates with various data stores.
Model Management Framework provides seamless interfaces along data
analytics pipeline for model creation, deployment and scheduling
The framework in this technical architecture
proposes
• Runtime Environments: Backend runtime
environments such as Spark, MapReduce, R, and
more interact with distributed resources (e.g. Hadoop)
to train and deploy models
• Historical Data Store: Data virtualization interacts
with various databases (e.g. Cassandra, Redshift, S3)
• Training, Prediction, Model Runtime Services:
Framework services interact with runtime service to
deploy and allocate resources for models as well as
verify models for execution
• APIs: APIs interact with framework services for
various functionalities
• Online Message Queue: Message queue is injected
with real-time data
Copyright © 2016 Accenture All rights reserved. 19
Prediction Service Training Service
API
User Interface
Resource
Allocation
Service
Model
Store
Results
Store
Model
Metadat
a Store
Historical
Data
Storage
Runtime Environments
Model Runtime Service
Online
Message
Queue
Data
Scientist
Business
Analyst
Demo
20Copyright © 2016 Accenture All rights reserved.
Model Management Framework covers a number of features to support
various perspectives
The framework provides the following features from the services to better serve domain
experts/data scientists and business analysts
21Copyright © 2016 Accenture All rights reserved.
Feature Explanation
Automatic model deployment on multiple
runtime environments
Automatic preparing trained model to serve real-time data with the saved.jar file to multiple runtime
environments with pre-verification prior to execution.
Modeling algorithm library A library with algorithms for machine learning and statistical learning
Model metadata A model profile to describe the configuration parameters, path to input/output data, model version as well
as resource consumption
Heterogeneous data stores Data can be stored in various databases
Champion-challenger model Multiple models with the best performed model as the champion and the rest as the challengers
Batch mode and real-time mode A combination of model training and serving model to real-time data
Model update Retraining of the current model or re-selecting of the champion model
Job completion time estimation Estimate of how soon a job can be completed given the current resources
Prediction results query and UI Access to prediction results from applying trained model for real-time data for dashboard display
Algorithm parameter tuning Automatic fine tuning of algorithm parameters to achieve the best model quality
Deploy Accenture’s Model Management Framework on-premise to
operationalize analytics in a big data analytics platform
At Accenture Labs, we have a patent-protected invention on the model management framework
that showcases the unique capabilities of our framework. If you have analytical models running in
a big data analytics platform, we can help deploy our model manager in your environment before
problems arise as the number of types of models and runtime environments you need to support
increases
22Copyright © 2016 Accenture All rights reserved.
Simplified modeling process for data
scientists
Abstracts data engineering and presents champion-challenger view for your data scientists to
more quickly train, compare, and promote their models for deployment.
Provide analytics for Internet of Things
use cases
Process data from heterogeneous data stores allows for sending data from thousands of
sensors through modeling pipeline to leverage existing platform’s analytical capabilities.
Enabled for real-time analytics
The model manager can deploy prediction jobs that ingest streaming data and applies a trained
model for real-time predictions.
Greater coverage of runtime
environments and models
Extends the capability to support additional runtime environments, increasing the number of
types of models you can use in your data pipeline.
Democratized access to analytics
Share library of models created by experts allows other data scientists and business analysts to
leverage the models for their use cases.
Contact Information
Accenture Labs
Teresa Tung
Technology Labs Fellow
teresa.tung@accenture.com
Carl Dukatz
R&D Senior Manager
carl.m.dukatz@accenture.com
Copyright © 2016 Accenture All rights reserved. 23
Chris Kang
R&D Associate Principal
chris.kang@accenture.com
Appendix
24Copyright © 2016 Accenture All rights reserved.
The solution: A new Model Management Framework
Simplifying model deployment at scale
25Copyright © 2016 Accenture All rights reserved.
A simplified
interface
RESULTS
• Enables a catalog approach to finding analytics
• Simplified onboarding of new analytics
• Brute-force approach to retraining and comparing models
Comprises of a model
building service, a prediction
service, and a resource
allocation service
Supports end-to-end
analytical modeling at scale
using the Lambda
Architecture
Hides the complexity of
Lambda and unlocks its
power for data scientists,
domain experts, and
business analysts
Benefits of the new framework
Unlocking the power of Lambda for data scientists, domain experts, and business analysts
26Copyright © 2016 Accenture All rights reserved.
Data scientists and domain experts
who generate the models can:
• Select from already captured
modeling approaches or onboard
their own
• Easily compare models in a
champion-challenger fashion
Business analysts who rely on
model’s results can select from a
catalog of models created by
experts
Model Management Framework differs from other approaches in its
enablement of big data capability with heterogeneity and scalability
Other analytics focuses on designing and fine tuning machine learning algorithms to
improve accuracy with modeling tools that are hard to scale or speed. For example, WEKA
libraries provides comprehensive machine learning algorithms but lack the capability to
integrate with big data or manage thousands of models. For example, Apache Mahout
works with Hadoop MapReduce with slowdown from frequent writes to disk.
Comparison Examples
Model Management Framework
• I want to run my analytics on the distributed data set with the size of
TB or PB which is geographically distributed and stored in various
databases
• I want to deploy multiple models on distributed resources and let the
framework automatically select the best model based on the metrics
I have defined
• I want to specify the prediction interval and query the results by
calling API endpoints
• I want to always use the up-to-date model by having the framework
retrain the current model or selecting a new champion model
Other Model Management
• I want to the improve my SVM classification algorithm by 3% in
terms of accuracy with my 300MB dataset residing on my local disk
• I want to try various algorithms and fine tune parameters to see how
the accuracy can be improved
• I want to apply the trained model for new data for prediction by
calling the modeling method and specifying where to store the
results. I need to try multiple prediction intervals to see which works.
• I want to see the prediction results by plotting the data from the file
where results are saved into
Copyright © 2016 Accenture All rights reserved. 27

More Related Content

What's hot

Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemDataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
 

What's hot (20)

Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
 

Viewers also liked

End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentDataWorks Summit/Hadoop Summit
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Efficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryEfficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryCSCJournals
 
6 staffing system and retention management
6 staffing  system and  retention  management6 staffing  system and  retention  management
6 staffing system and retention managementPreeti Bhaskar
 
Summary of Whale Done Approach
Summary of Whale Done ApproachSummary of Whale Done Approach
Summary of Whale Done ApproachGMR Group
 
Metrics formulas
Metrics formulasMetrics formulas
Metrics formulasmd_taufeeq
 
Overcoming the Challenges of your Master Data Management Journey
Overcoming the Challenges of your Master Data Management JourneyOvercoming the Challenges of your Master Data Management Journey
Overcoming the Challenges of your Master Data Management JourneyJean-Michel Franco
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
Medical Billing Flow Chart
Medical Billing Flow ChartMedical Billing Flow Chart
Medical Billing Flow ChartKarna *
 
Predictive Analytics: Extending asset management framework for multi-industry...
Predictive Analytics: Extending asset management framework for multi-industry...Predictive Analytics: Extending asset management framework for multi-industry...
Predictive Analytics: Extending asset management framework for multi-industry...Capgemini
 
Application Developers Guide to HIPAA Compliance
Application Developers Guide to HIPAA ComplianceApplication Developers Guide to HIPAA Compliance
Application Developers Guide to HIPAA ComplianceTrueVault
 
Calibration of spectrophotometer
Calibration of spectrophotometerCalibration of spectrophotometer
Calibration of spectrophotometerDeepak Shilkar
 
Proactive Contact Beta Results & Outbound Contact Express
Proactive Contact Beta Results & Outbound Contact ExpressProactive Contact Beta Results & Outbound Contact Express
Proactive Contact Beta Results & Outbound Contact ExpressDavid Ward
 
Mobile Commerce: A Security Perspective
Mobile Commerce: A Security PerspectiveMobile Commerce: A Security Perspective
Mobile Commerce: A Security PerspectivePragati Rai
 

Viewers also liked (17)

Keep your Hadoop Cluster at its Best
Keep your Hadoop Cluster at its BestKeep your Hadoop Cluster at its Best
Keep your Hadoop Cluster at its Best
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases
 
Efficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryEfficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud Library
 
6 staffing system and retention management
6 staffing  system and  retention  management6 staffing  system and  retention  management
6 staffing system and retention management
 
Summary of Whale Done Approach
Summary of Whale Done ApproachSummary of Whale Done Approach
Summary of Whale Done Approach
 
Metrics formulas
Metrics formulasMetrics formulas
Metrics formulas
 
Overcoming the Challenges of your Master Data Management Journey
Overcoming the Challenges of your Master Data Management JourneyOvercoming the Challenges of your Master Data Management Journey
Overcoming the Challenges of your Master Data Management Journey
 
los mercados globales en accion
los mercados globales en accionlos mercados globales en accion
los mercados globales en accion
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Medical Billing Flow Chart
Medical Billing Flow ChartMedical Billing Flow Chart
Medical Billing Flow Chart
 
Predictive Analytics: Extending asset management framework for multi-industry...
Predictive Analytics: Extending asset management framework for multi-industry...Predictive Analytics: Extending asset management framework for multi-industry...
Predictive Analytics: Extending asset management framework for multi-industry...
 
Application Developers Guide to HIPAA Compliance
Application Developers Guide to HIPAA ComplianceApplication Developers Guide to HIPAA Compliance
Application Developers Guide to HIPAA Compliance
 
Calibration of spectrophotometer
Calibration of spectrophotometerCalibration of spectrophotometer
Calibration of spectrophotometer
 
Proactive Contact Beta Results & Outbound Contact Express
Proactive Contact Beta Results & Outbound Contact ExpressProactive Contact Beta Results & Outbound Contact Express
Proactive Contact Beta Results & Outbound Contact Express
 
Mobile Commerce: A Security Perspective
Mobile Commerce: A Security PerspectiveMobile Commerce: A Security Perspective
Mobile Commerce: A Security Perspective
 

Similar to Automated Analytics at Scale

Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleatSistemas
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?Chris Sparshott
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBMongoDB
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...YASH Technologies
 
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Precisely
 
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunk
 
Platforming the Major Analytic Use Cases for Modern Engineering
Platforming the Major Analytic Use Cases for Modern EngineeringPlatforming the Major Analytic Use Cases for Modern Engineering
Platforming the Major Analytic Use Cases for Modern EngineeringDATAVERSITY
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsightsWilfried Hoge
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationDATAVERSITY
 
Metrology sampling models using tool sensor data
Metrology sampling models using tool sensor dataMetrology sampling models using tool sensor data
Metrology sampling models using tool sensor dataArvind Mozumdar
 

Similar to Automated Analytics at Scale (20)

Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
 
Big Data and Business Insight
Big Data and Business InsightBig Data and Business Insight
Big Data and Business Insight
 
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
 
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Platforming the Major Analytic Use Cases for Modern Engineering
Platforming the Major Analytic Use Cases for Modern EngineeringPlatforming the Major Analytic Use Cases for Modern Engineering
Platforming the Major Analytic Use Cases for Modern Engineering
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
Metrology sampling models using tool sensor data
Metrology sampling models using tool sensor dataMetrology sampling models using tool sensor data
Metrology sampling models using tool sensor data
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Automated Analytics at Scale

  • 1. Automated Analytics at Scale Model Management in Streaming Big Data Architectures Chris Kang
  • 2. • Machine learning allows organizations to proactively discover patterns and predict outcomes for their operations, and improving those insights requires deploying better analytical models on their data. • Finding the best analytical model requires running thousands of hypotheses on various datasets and comparing models in a brute force approach. • Currently a model management framework does not exist - that is, an agnostic tool or framework that manages and orchestrates the entire lifecycle of a model. Real-time Analytics at Scale 2Copyright © 2016 Accenture All rights reserved. Challenges of Model Management
  • 3. Model Management Framework operationalizes analytics to ease development and deployment of analytical models The framework provides key benefits to operationalize and democratize access to analytical modeling at scale 3Copyright © 2016 Accenture All rights reserved. Captures and templates analytical models created by expert data scientists for easy reuse Faster development of analytical models with rapid iteration of training and comparing models using brute force approach Presents champion- challenger view to visually compare and promote trained models Reduces complexity for data scientists to train and deploy models Enables business analysts and others to participate in modeling process
  • 4. Model Management Framework is essential for the Internet of Things platform The Internet of Things platform exposes thousands of sensors that require models to be automatically managed and maintained as well as provide easy access to the predicted results Identify desired insights Identify sights for operationalizing devices/machinery for various purposes: detecting anomaly, prediction maintenance, budget and resource optimization Collect data Collect various types of data (time series or static) and store them into databases that best fits the data type Analyze Train the analytical models using the model management framework or using other analytical tools such as R then onboard it to the framework Actuate and optimize Set up rules to act on predicted results from thousands of sensors, e.g. schedule a maintenance or lower temperature on a device Copyright © 2016 Accenture All rights reserved. 4
  • 6. Organizations today have an unprecedented amount of data available because of the Internet of Things, the web, and social media In order to take advantage of this massive set of data, organizations must build analytics platforms 6Copyright © 2016 Accenture All rights reserved. Source: IBM, Big Data Hub, 2013
  • 7. Traditional analytics platforms use big data technologies to process and analyze large amounts of data “Excited by big data technology capabilities to store more data, more diverse data and more real-time data, (companies) focus on data collection. Rapidly growing data stores put increasing pressure on figuring out what to do with this data. Determining the value of the collected data becomes the top challenge in all industries.” Source: Svetlana Sicular, Gartner, October 30 2015 Example Technologies The steps to derive value out of the data include collecting, processing, and analyzing the data using a variety of big data tools. Analytics and Visualization Data Processing Data Collection Store huge volumes of data in multiple data stores in a variety of data types for processing. Process the data by filtering, transforming, and applying machine learning algorithms using computing engines. Create ad hoc reports on processed data using business intelligence and visualization tools. Copyright © 2016 Accenture All rights reserved. 7
  • 8. Enterprises need access to both historical and real-time data to gain the most value out of big data analytics • Real-time is data that is processed in sub-seconds to seconds from the time data arrives to when the results are derived. • Batch processing technologies alone are insufficient because in the time it takes to process a batch (hours, days), real-time data has accumulated and is missed, which generates a loss of opportunity for proactive decision making. Storing data in a fault-tolerant, replicated historic store, processing a large batch of data, and storing the processed data using batch writes incurs delays that make real-time not feasible Queries are only directed at stale data of up to hours or days. The lack of real-time data limits the analytics to ad- hoc summarizations and aggregations. Because of the batch processing delay, by the time the captured data is available for queries, it is stale Real-time data is missed by the time analytics begins Historic Data Store Batch Batch Write Data Query Storage Processing Serving Real-Time Data Copyright © 2016 Accenture All rights reserved. 8
  • 9. The Lambda Architecture empowers real-time analytics by handling data at scale and in real-time using a hybrid architecture • Designed by Nathan Marz, the creator of the Apache Storm project and previously a lead engineer at Twitter, the goal was to build a general architecture to process big data at scale. • The architecture separates batch processing on historical data from stream processing on real-time flow of data, allowing for analytics on data that combines the most up-to-date data with historical data views. Real-time analytics can now be performed on data combined from most up-to-date data with historical views BATCH LAYER focuses on processing historical data views for queries SPEED LAYER handles the complexity of real-time data collection and analysis Historic Data Store Batch Batch Write Data Query Storage Processing Serving Queue Speed Random Write Copyright © 2016 Accenture All rights reserved. 9
  • 10. In the Internet of Things, predictive modeling on sensor data allows organizations to discover patterns and predict outcomes for their operations Remediation Notification and Alerts Oil & Gas Producer Water Utility Client NoSQL for Unstructured Data Computing Engines and Stream Processors Machine Learning Algorithms Model Runtime Environments Sensors at Field Sites Predictive Results Data Collection Data Processing Predictive Modeling Proactive Decision Making Collects data from over 190,000 sensors Collects data from sensors placed along pipes in a water distribution network Injects 6,000 rows/second and 11 billion rows of data per month – larger analytics platform than Twitter Processes data for water flow rate and pressure Has over 3,500 models analyzing data using various algorithms Apply predictive model to project forward in time to see spikes or falls that exhibit warning signs of failure Enables company to examine huge sets of data, discover trends to predict outcomes in operation and exploration efforts Use results from predictive model to proactively reduce pressure spikes, avoiding leaks, prolonging the longevity of assets, and reducing disruption to customers • The real value of big data is the insight via the analytics, not just the collection of the data. • Predictive modeling is the primary means by which companies can discover trends and make proactive, as opposed to reactive, decisions on data. Copyright © 2016 Accenture All rights reserved. 10
  • 11. The modeling process is iterative and its lifetime spans both the batch mode model training and real-time prediction In general, a model creates an output for an unknown target value given a defined set of inputs. In a time-series model, the target value also depends on time as an input 11Copyright © 2016 Accenture All rights reserved. Build Model • Identify required data and how to get it • Design and validate specific analytic models • Verify approach through initial set of insights on particular environments Analyzes a variety of machine learning algorithms and identifies the logistic regression model as the most suitable for the problem. Codes model .JAR file Train Model • Prepare historical data for training • Select model input parameters and runtime environment • Train the model on data from historical batch and/or real-time stream in runtime environment Selects input parameters such as the regularization parameter for the logistic regression model. Submits the model to Spark to train the model on historical data in HDFS Monitor Execution • Monitor the status of training the model in the runtime environment (e.g. running, succeeded, failed) • Troubleshoot issues in the runtime environment if necessary Opens the terminal, ssh into the Hadoop cluster, and enters the commands to verify the status of the model as it is trained Compare Models • Compare trained models in champion-challenger fashion • Brute force approach to finding best-of-breed model for deploying to live stream After iteratively training many models, select the best-of-breed based on the model with the lowest mean square error Operationalize Model • Deploy best model on live stream of data • Generate predicted results for automated or manual proactive decision making • Observe results to feed back and fine-tune the model Submits the model to Spark Streaming to be applied to streaming data ingested from Kafka, and model predicts in real-time whether sensor will fail I want to deploy a model that can detect if a sensor is faulty in real-time Data Scientist Data Science System Administration
  • 12. 12Copyright © 2016 Accenture All rights reserved. Challenges with Analytical Modeling in the Current State
  • 13. Building, training, and deploying analytical models require a rare combination of data science and engineering skills The ability to complete the modeling process is limited to specialized individuals who are experts in both data science and engineering “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.” Source: McKinsey Global Institute analysis Traditional Strengths Potential Hurdles with Model Building and Deployment Full Set of Skills Needed for Model Building and Deployment Mathematics, statistics, machine learning, data mining, pattern recognition, predictive algorithms, domain expertise Troubleshooting and running a runtime environment such as Spark requires advanced system engineering skills, which a data scientist may not be trained in. This can potentially lead to slower development and deployment of predictive models. • Understanding of a variety of machine learning algorithms, pattern recognition, as well as expertise in a domain. • Ability to build and code accurate models based on problem space. • System administrator skills as well as deep understanding of big data systems to deploy models in runtime environment. Domain expertise, business processes, requirements gathering Traditional business analysts may lack core skills in data science or data engineering because of a lack of experience to build, train, or deploy models Combination of data science skills as well as software engineering and system administrator skills for big data systems May lack domain expertise, in which case it may take longer to build and train relevant models for the use case Data Scientist Business Analyst Dual Data Scientist and Engineer Copyright © 2016 Accenture All rights reserved. 13
  • 14. Analytical models are not easily reusable or shareable, resulting in siloed analytics work There is no standard method for sharing models to let users leverage models created by other data scientists, so the analytics work is siloed. This is true for both freshly built models and models that were already trained on a dataset Predictive models duplicate and sprawl as data scientists build and train their own individual library of models that are not shared. No standard for sharing or viewing other data scientist’s models Individual Libraries of Models Data scientists primarily leverage their own libraries of models and previous datasets they worked with to select an algorithm and build a model for the current problem Model Duplication As models are built and trained, the same types of models may be built by more than one data scientist, particularly if the types of models are common in the industry’s use cases Model Sprawl Over time, as more data scientists build and train more models, the models begin to sprawl and duplicate unnecessarily, making the central management of models more difficult Train and deploy individual models Runtime execution environments for model training and deployment Copyright © 2016 Accenture All rights reserved. 14
  • 15. Without a framework, current approach is too inflexible to support multiple runtime execution environments It is impractical to scale the number of runtime environments to train and deploy models using a manual approach Spark model with R dependencies Model with R dependencies I have a model, but I don’t know which runtime environment can support it I’m only familiar with R, so I need to learn all the environments to test my model I have a new type of model so I need to learn another runtime environment Runtime environments often times cannot support all types of models. As a result, data scientists must spend time learning environments instead of using that time for analytical modeling. Dependencies match and runtime can support model Missing Spark functionality to execute model Missing specific R dependency so cannot support model All R libraries supported and can execute model Data Scientist Update Test Learn • Data scientist needs to acquire the system administration skills to operate the runtime environments • Each runtime environment is unique and requires time and energy • In the worst case, the data scientist must try every runtime environment before successfully finding a match for the predictive model • As more model types are needed, additional runtime environments must be learned • Learning additional environments becomes a time-consuming endeavor Copyright © 2016 Accenture All rights reserved. 15
  • 16. Lack of engineering abstraction makes it difficult to quickly train predictive models on data Data scientists lose productivity as the process to train models is manual, requiring a manual check for the status of a model in the environment as well as system administration for troubleshooting the model in the environment Need for abstraction grows as the number of types of models and runtime environments increases Wasted productivity – Spending time on data engineering instead of comparing models to find the best-of-breed for deployment No abstractions for training or monitoring models on runtime environments Train model Repeated for hundreds of models on various runtime environments Check status of model Troubleshoot model Train model Check status of model Troubleshoot model Train model Check status of model Troubleshoot model Build many models on various algorithms More time spent on system administration Less time spent on building predictive models Try different input parameters and algorithms to find best-of- breed model ….. Manual Process Data Scientist Build Model Train Model and Monitor Status Copyright © 2016 Accenture All rights reserved. 16
  • 17. 17Copyright © 2015 Accenture All rights reserved. Model Management Framework for Automated Analytics at Scale
  • 18. Model Management Framework simplifies the training, deployment, and management of a large number of models for a Lambda architecture Model management is a framework for data scientists and users to more easily train and deploy analytical models in various runtime environments on the lambda architecture by abstracting the system administration, reducing the complexity of train and deploy, and sharing the models in a way that is consumable by users in your organization, enabling other users such as business analysts to partake in the modeling process. The framework in this reference architecture proposes • Model Store and Trained Model Store: A library of models of commonly used machine learning algorithms that can be trained on user’s historical datasets, as well as trained models that are available to be deployed. • Model Interface Templates: Interfaces that abstract away the complexity of the machine learning algorithm, allowing users to specify the inputs and outputs of the model. • Deployment and Scheduler: Automatic training, deployment, and scheduling of models on runtime environments so that users do not need to operate the runtime environments themselves. • Runtime Verifier: Ability to determine which runtime environments can support a model prior to execution, enabling faster development of trained models. • Monitoring Service and Metadata Store: Service monitors the status of the model during its execution on the runtime environment for the user, as well as any metadata about its execution which it can then store. • API: Exposes functionalities with API endpoints for users to verify, train, deploy, and monitor models on runtime environments. Real Time Analytics Runtime Environments Distributed Computing Scientific Computing Model Management Deployment and Scheduler Runtime Verifier Model Store Metadata StoreTrained Model Store Monitoring Service API Model Interface Templates Users Data Scientists Business Analysts Copyright © 2016 Accenture All rights reserved. 18
  • 19. • Design for seamless interfaces is the method of connecting various stages throughout modeling pipeline to support the domain experts/data scientists to create and update models and for the business analysts to extract data insights. • Model management at scale is specific for large scale data analytics which requires distributed resources allocation and communicates with various data stores. Model Management Framework provides seamless interfaces along data analytics pipeline for model creation, deployment and scheduling The framework in this technical architecture proposes • Runtime Environments: Backend runtime environments such as Spark, MapReduce, R, and more interact with distributed resources (e.g. Hadoop) to train and deploy models • Historical Data Store: Data virtualization interacts with various databases (e.g. Cassandra, Redshift, S3) • Training, Prediction, Model Runtime Services: Framework services interact with runtime service to deploy and allocate resources for models as well as verify models for execution • APIs: APIs interact with framework services for various functionalities • Online Message Queue: Message queue is injected with real-time data Copyright © 2016 Accenture All rights reserved. 19 Prediction Service Training Service API User Interface Resource Allocation Service Model Store Results Store Model Metadat a Store Historical Data Storage Runtime Environments Model Runtime Service Online Message Queue Data Scientist Business Analyst
  • 20. Demo 20Copyright © 2016 Accenture All rights reserved.
  • 21. Model Management Framework covers a number of features to support various perspectives The framework provides the following features from the services to better serve domain experts/data scientists and business analysts 21Copyright © 2016 Accenture All rights reserved. Feature Explanation Automatic model deployment on multiple runtime environments Automatic preparing trained model to serve real-time data with the saved.jar file to multiple runtime environments with pre-verification prior to execution. Modeling algorithm library A library with algorithms for machine learning and statistical learning Model metadata A model profile to describe the configuration parameters, path to input/output data, model version as well as resource consumption Heterogeneous data stores Data can be stored in various databases Champion-challenger model Multiple models with the best performed model as the champion and the rest as the challengers Batch mode and real-time mode A combination of model training and serving model to real-time data Model update Retraining of the current model or re-selecting of the champion model Job completion time estimation Estimate of how soon a job can be completed given the current resources Prediction results query and UI Access to prediction results from applying trained model for real-time data for dashboard display Algorithm parameter tuning Automatic fine tuning of algorithm parameters to achieve the best model quality
  • 22. Deploy Accenture’s Model Management Framework on-premise to operationalize analytics in a big data analytics platform At Accenture Labs, we have a patent-protected invention on the model management framework that showcases the unique capabilities of our framework. If you have analytical models running in a big data analytics platform, we can help deploy our model manager in your environment before problems arise as the number of types of models and runtime environments you need to support increases 22Copyright © 2016 Accenture All rights reserved. Simplified modeling process for data scientists Abstracts data engineering and presents champion-challenger view for your data scientists to more quickly train, compare, and promote their models for deployment. Provide analytics for Internet of Things use cases Process data from heterogeneous data stores allows for sending data from thousands of sensors through modeling pipeline to leverage existing platform’s analytical capabilities. Enabled for real-time analytics The model manager can deploy prediction jobs that ingest streaming data and applies a trained model for real-time predictions. Greater coverage of runtime environments and models Extends the capability to support additional runtime environments, increasing the number of types of models you can use in your data pipeline. Democratized access to analytics Share library of models created by experts allows other data scientists and business analysts to leverage the models for their use cases.
  • 23. Contact Information Accenture Labs Teresa Tung Technology Labs Fellow teresa.tung@accenture.com Carl Dukatz R&D Senior Manager carl.m.dukatz@accenture.com Copyright © 2016 Accenture All rights reserved. 23 Chris Kang R&D Associate Principal chris.kang@accenture.com
  • 24. Appendix 24Copyright © 2016 Accenture All rights reserved.
  • 25. The solution: A new Model Management Framework Simplifying model deployment at scale 25Copyright © 2016 Accenture All rights reserved. A simplified interface RESULTS • Enables a catalog approach to finding analytics • Simplified onboarding of new analytics • Brute-force approach to retraining and comparing models Comprises of a model building service, a prediction service, and a resource allocation service Supports end-to-end analytical modeling at scale using the Lambda Architecture Hides the complexity of Lambda and unlocks its power for data scientists, domain experts, and business analysts
  • 26. Benefits of the new framework Unlocking the power of Lambda for data scientists, domain experts, and business analysts 26Copyright © 2016 Accenture All rights reserved. Data scientists and domain experts who generate the models can: • Select from already captured modeling approaches or onboard their own • Easily compare models in a champion-challenger fashion Business analysts who rely on model’s results can select from a catalog of models created by experts
  • 27. Model Management Framework differs from other approaches in its enablement of big data capability with heterogeneity and scalability Other analytics focuses on designing and fine tuning machine learning algorithms to improve accuracy with modeling tools that are hard to scale or speed. For example, WEKA libraries provides comprehensive machine learning algorithms but lack the capability to integrate with big data or manage thousands of models. For example, Apache Mahout works with Hadoop MapReduce with slowdown from frequent writes to disk. Comparison Examples Model Management Framework • I want to run my analytics on the distributed data set with the size of TB or PB which is geographically distributed and stored in various databases • I want to deploy multiple models on distributed resources and let the framework automatically select the best model based on the metrics I have defined • I want to specify the prediction interval and query the results by calling API endpoints • I want to always use the up-to-date model by having the framework retrain the current model or selecting a new champion model Other Model Management • I want to the improve my SVM classification algorithm by 3% in terms of accuracy with my 300MB dataset residing on my local disk • I want to try various algorithms and fine tune parameters to see how the accuracy can be improved • I want to apply the trained model for new data for prediction by calling the modeling method and specifying where to store the results. I need to try multiple prediction intervals to see which works. • I want to see the prediction results by plotting the data from the file where results are saved into Copyright © 2016 Accenture All rights reserved. 27

Editor's Notes

  1. The Lambda Architecture delivers the promise of analytics that is both real-time over streamed data and batch over comprehensive data. But its use relies largely on individuals with architecture savvy and sys admin skills for the capture, scheduling, and deployment of analytical models We introduce a Model Management Framework that presents a simplified interface that supports end-to-end analytical modeling at scale using the Lambda Architecture. The framework hides the complexity of Lambda and unlocks its power for data scientists, domain experts, and business analysts. - Data scientists and domain experts who generate the models can select from already captured modeling approaches or onboard their own. The platform makes it easy to compare models in a champion-challenger fashion. - Business analysts who rely on model’s results can select from a catalog of models created by experts. Model Management Framework comprises of a model building service, a prediction service, and a resource allocation service. The result enables a catalog approach to finding analytics, simplified onboarding of new analytics, and a brute-force approach to retraining and comparing models.
  2. How to apply and next steps Identify desired insights Data collection Model-driven analytics Take
  3. References http://www.ibmbigdatahub.com/infographic/big-data-making-world-go-round
  4. References http://data-informed.com/lambda-architecture-can-analyze-big-data-batches-near-real-time/ http://blog.couchbase.com/lamda-architecture-and-beyond-with-nosql
  5. References http://data-informed.com/lambda-architecture-can-analyze-big-data-batches-near-real-time/
  6. References http://www.hcltech.com/sites/default/files/resources/whitepaper/files/2014/08/18/key_to_monetizing_big_data_via_predictive_analytics.pdf
  7. Pillars with different colors. Gear the problem towards the pillar phases, as opposed to roles from this slide and onwards.
  8. References http://www.mckinsey.com/features/big_data
  9. Manual launching of models, querying for its status to check the run, and troubleshooting errors if any is time-consuming
  10. Update with verification, automatic deployment on runtime environments
  11. Compare big data model management and small data model management