SlideShare a Scribd company logo
1 of 46
Download to read offline
Real-Time Anomaly Detection
with Spark MLlib, Akka and
Cassandra
Natalino Busa
Data Platform Architect at Ing
Distributed computing Machine Learning
Statistics Big/Fast Data Streaming Computing
@natalinobusa | linkedin.com/in/natalinobusa
@natbusa | linkedin: Natalino Busa
ING group
http://www.ing.com/About-us/Purpose-Strategy.htm
@natbusa | linkedin: Natalino Busa
ING group
Empowering people to stay a step ahead
in life and in business.
http://www.ing.com/About-us/Purpose-Strategy.htm
@natbusa | linkedin: Natalino Busa
ING group
http://www.ing.com/About-us/Purpose-Strategy.htm
Clear and Easy
Anytime, Anywhere
Empower
Keep getting better
@natbusa | linkedin: Natalino Busa
Apply advanced, predictive analytics on live data
Event-Driven and exposed via APIs
Lean Architecture, Easy to integrate
Available, Consistent, Streaming, Real-time Data
Resilient, Distributed, Scalable, Maintainable
Clear and Easy
Anytime, Anywhere
Empower
Keep getting better
Data Principles
ING group
@natbusa | linkedin: Natalino Busa
Big Data and Fast Data
10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m
time
population:events,transactions,
sessions,customers,etc
event streams
recent data
historical big data
@natbusa | linkedin: Natalino Busa
Why Fast Data?
1. Relevant up-to-date information.
2. Delivers actionable events.
@natbusa | linkedin: Natalino Busa
Why Big Data?
1. Analyze and model
2. Learn, cluster, categorize, organize facts
@natbusa | linkedin: Natalino Busa10
Distributed
Data Store
Real Time APIs
Streaming Data
Data Sources,
Files, DB extracts
Batched Data
API for mobile and web
Training, Scoring and Exposing models
@natbusa | linkedin: Natalino Busa11
Distributed
Data Store
Fast Analytics
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
API for mobile and web
Training, Scoring and Exposing models
read the data
write the model
@natbusa | linkedin: Natalino Busa12
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Training, Scoring and Exposing models
read the model
read the data
write the model
@natbusa | linkedin: Natalino Busa
Cassandra+Akka+Spark: Machine Learning
Fast writes
2D Data Structure
Replicated
Tunable consistency
Multi-Data centers
C*Akka Spark
Very Fast processing
Distributed, Scalable computing
Actor-based Pipelines
Actor state can be persisted
Supervision strategies
Ad-Hoc Queries
Joins, Aggregate
User Defined Functions
Machine Learning,
Advanced Stats and Analytics
@natbusa | linkedin: Natalino Busa
Akka-Cassandra-Spark Stack
Cassandra-Spark Connector
Cassandra
Spark
Streaming SQL MLlib Graphx
Extract
Data
Create Models,
Enrich, Transform
Fetch from other
Sources: Kafka
Fetch from other
Sources: DB’s, Files
Akka
Analytics, Statistics, Data
Science, Model Training
Access
Model
Persist
Actors’ State
@natbusa | linkedin: Natalino Busa
Cassandra-Spark Connector
Cassandra: Store all the data
Spark: Analyze all the data
DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors
Storage! Analytics!
Data
@natbusa | linkedin: Natalino Busa
Cassandra-Spark Stack
Cassandra: Store all the data
Spark: Distributed Data Processing
Executors and Workers
Cassandra-Spark Connector:
Data locality,
Reduce Shuffling
RDD’s to Cassandra
Partitions
DC3: replication factor 3 +
Spark Executors
@natbusa | linkedin: Natalino Busa
Data Science: Anomaly Detection
An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a different
mechanism.
Hawkins, 1980
@natbusa | linkedin: Natalino Busa
Data Science: Anomaly Detection (1)
Parametric BasedGaussian Model Based Histogram, nonparametric
@natbusa | linkedin: Natalino Busa
Data Science: Anomaly Detection (2)
Distance Based Density Based
@natbusa | linkedin: Natalino Busa
Example: Analyze gowalla check-ins
year | month | day | time | uid | lat | lon | ts | vid
------+-------+-----+------+--------+----------+-----------+--------------------------+---------
2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955
2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160
2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870
Check-ins dataset
Venues dataset
vid | name | lat | long ------+-------+-----+------+--------+----------+-----------
+--------------------------+---------
754108 | My Suit NY | 40.73474 | -73.87434
249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289
6919688 | Sky Asian Bistro | 40.67621 | -73.98405
@natbusa | linkedin: Natalino Busa
Data Science: clustering venues
Weekly visitors patterns!
Madison Square, Apple Store, Radio City Music Hall
Thursdays, Fridays, Saturdays are busy
Statue of Liberty, Jacob K. Javits Convention Center,
Whole Foods Market (Columbus Circle)
Not popular on midweek
Intuition:
@natbusa | linkedin: Natalino Busa
Data Science: clustering with k-means
Histograms components as dimensions
Similar histograms would occupy similar places in
the feature space
How do I compare histograms:
- EMD
- Chi-squared distance
- Space transformation (DCT)
Intuition:
@natbusa | linkedin: Natalino Busa
K-Means: Featurize data + cluster
val weekly_visits = checkins_venues.select("vid","ts")
.map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts"))
.reduceByKey(_ + _)
.mapValues(_ => featurize_histogram(_._1))
val numClusters = 15
val numIterations = 100
val clusters = KMeans.train(weekly_visits, numClusters, numIterations)
PairRDDs, weekly patterns per venue
cluster similar weekly patterns
@natbusa | linkedin: Natalino Busa
Assigning venues to clusters
val venues_clustered = checkins_venues.select("vid","ts").where("ts > dateof(now())")
.map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts"))
.reduceByKey(_ + _)
.mapValues(_ => clusters.predict(featurize_time(_)))
venues_clustered.saveToCassandra("lbsn", "venues_cl") score and assign each venue to a cluster
Store the cluster centers in cassandra
@natbusa | linkedin: Natalino Busa
How to use it
1) Classification
Classify venues to given groups
2) Anomaly Detection
Detect shift in the clustering assignment for a given venue for a given week
Keep monitoring weekly change in patterns, when it happens trigger a signal
week 26 week 27
alert!
@natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
Intuition:
Users tend to stick in the same places
People have habits
By clustering the places together
We can identify anomalous locations
Size of the cluster matters
More points means less anomalous
Mini-clusters and single anomalies are
treated in similar ways ...
@natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
Users tend to stick in the same places
People have habits
By clustering the places together
We can identify anomalous locations
Size of the cluster matters
More points means less anomalous
Mini-clusters and single anomalies are
treated in similar ways ...
Intuition:
@natbusa | linkedin: Natalino Busa
Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density
Does not require the number of cluster k beforehand.
Clusters are not spherical
@natbusa | linkedin: Natalino Busa
Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density
Does not require the number of cluster k beforehand.
Clusters are not spherical
It’s a graph!
@natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
val locs = checkins_venues.select("uid", "lat","lon")
.map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) ))
.reduceByKey(_ + _)
.mapValues( dbscan cluster _ )
Have a look at: scalanlp/nak
@natbusa | linkedin: Natalino Busa
Data Science:
Two ways to find anomalies with clustering
- Cluster big amount of data with k-means and histograms
- Apply clustering independently to million of users,
to each identify the patterns with dbscan algorithm
@natbusa | linkedin: Natalino Busa
MLlib vs PairRDDs
KMeans.train(FeaturesRDD, numClusters, numIterations)
UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan cluster _ )
RDDs map functions
Parallelism easy to exploit
The function runs locally for each Key
Pick your fav machine learning algorithms
Limited nr of points
Running in parallel for millions of Keys
MLlib
Truly distributed algorithm
Classify venues to given groups
Millions of datapoints
Limited amount of clusters
@natbusa | linkedin: Natalino Busa33
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Training, Scoring and Exposing models
read the model
read the data
write the model
@natbusa | linkedin: Natalino Busa
Training vs Scoring: Time budgets
● Akka: millisecond response
● Spark: in-memory (big)-data modelsTrain: Spark
Score: Spark
Train: Spark
Score: Akka
slow: minutes fast: millisecs
Train: Akka
Score: Akka
Model Scoring
ModelTraining
slow:minutesfast:millisecs
@natbusa | linkedin: Natalino Busa
Which processing? .. and the granularity of data?
What is the latency and the throughput?
@natbusa | linkedin: Natalino Busa
It’s all about latency!
Map Reduce
Big Data
Batch based
RDDs
Big Data
Micro-Batch Based
CRDT’s + Monoids
Fast Data
Event Based
MillWheel
Fast + Big Data
Event & Window
@natbusa | linkedin: Natalino Busa
Akka
Mixed Load Cassandra Cluster
Coral: Web API for dynamic data flows
@natbusa | linkedin: Natalino Busa
Data
events
POST http://coral/api/actors/23/in
{
"amount":23.45,
"user": 76232,
"city": "Berlin"
}
@natalinobusa | linkedin.com/in/natalinobusa
Coral: Streaming data via Web APIs
@natbusa | linkedin: Natalino Busa
Trigger Emit
Params
State
GET http://coral/api/actors/23
{
"actors": {
"def": {
"type": "stats",
"params": {
"field": "amount"
}
},
"state": {
"count": 134,
"avg": 39.84,
"min": 1.99,
"max": 204.19,
"sd": 38.01
}
}
}
Coral: Streaming data via Web APIs
@natbusa | linkedin: Natalino Busa
Akka
Coral: Web API for dynamic data flows
● a web api to define/manage/run streaming data-flows
● open source and community managed
● event processing as a service
● connect to Cassandra to access models
● connect to kafka to consume and produce events
Steven Raemaekers
Jasper van Zandbeek
Ger van Rossum
Hoda Alemi
Koen Verschuren
@natbusa | linkedin: Natalino Busa41
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Summary:
read the model
read the data
write the model
@natbusa | linkedin: Natalino Busa
Akka
Feedback to the community:
More Algorithms for machine learning!
- DBSCAN, OPTICS, PAM
- More metrics, non-euclidean spaces, etc
- Non distributed algorithms: more scalanlp integration?
Streaming all the way:
Unify batch (Spark) and event streaming (Akka) computing
@natbusa | linkedin: Natalino Busa
Thanks!
- Vision and strategy on an event-driven bank
- ING CIO management team and awesome colleagues
Spark, Cassandra, Akka communities !
@natbusa | linkedin: Natalino Busa
Resources
Coral: event processing webapi
https://github.com/coral-streaming/coral
Spark + Cassandra: Clustering Events
http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html
Spark: Machine Learning, SQL frames
https://spark.apache.org/docs/latest/mllib-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
Datastax: Analytics and Spark connector
http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html
Anomaly Detection
Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
@natbusa | linkedin: Natalino Busa
Resources
Datasets
https://snap.stanford.edu/data/loc-gowalla.html
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2011
https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip
The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant
PTDC/EIA-EIA/109840/2009. .
Pictures:
"DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data.
svg#/media/File:DBSCAN-density-data.svg
"DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File:
DBSCAN-Illustration.svg
"Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons -
https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png
"Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https:
//commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg
"Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot.
svg#/media/File:Michelsonmorley-boxplot.svg

More Related Content

More from Big Data Spain

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
 
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Big Data Spain
 
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...Big Data Spain
 
A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...Big Data Spain
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
 

More from Big Data Spain (20)

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
Deep reinforcement learning : Starcraft learning environment by Gema Parreño ...
 
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
End-to-End “Exactly Once” with Heron & Pulsar by Ivan Kelly at Big Data Spain...
 
A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...A Deep Learning use case for water end use detection by Roberto Díaz and José...
A Deep Learning use case for water end use detection by Roberto Díaz and José...
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Real-time anomaly detection with Cassandra, Spark ML and Akka by Natalino Busa at Big Data Spain 2015

  • 1.
  • 2. Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra Natalino Busa Data Platform Architect at Ing
  • 3. Distributed computing Machine Learning Statistics Big/Fast Data Streaming Computing @natalinobusa | linkedin.com/in/natalinobusa
  • 4. @natbusa | linkedin: Natalino Busa ING group http://www.ing.com/About-us/Purpose-Strategy.htm
  • 5. @natbusa | linkedin: Natalino Busa ING group Empowering people to stay a step ahead in life and in business. http://www.ing.com/About-us/Purpose-Strategy.htm
  • 6. @natbusa | linkedin: Natalino Busa ING group http://www.ing.com/About-us/Purpose-Strategy.htm Clear and Easy Anytime, Anywhere Empower Keep getting better
  • 7. @natbusa | linkedin: Natalino Busa Apply advanced, predictive analytics on live data Event-Driven and exposed via APIs Lean Architecture, Easy to integrate Available, Consistent, Streaming, Real-time Data Resilient, Distributed, Scalable, Maintainable Clear and Easy Anytime, Anywhere Empower Keep getting better Data Principles ING group
  • 8. @natbusa | linkedin: Natalino Busa Big Data and Fast Data 10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m time population:events,transactions, sessions,customers,etc event streams recent data historical big data
  • 9. @natbusa | linkedin: Natalino Busa Why Fast Data? 1. Relevant up-to-date information. 2. Delivers actionable events.
  • 10. @natbusa | linkedin: Natalino Busa Why Big Data? 1. Analyze and model 2. Learn, cluster, categorize, organize facts
  • 11. @natbusa | linkedin: Natalino Busa10 Distributed Data Store Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data API for mobile and web Training, Scoring and Exposing models
  • 12. @natbusa | linkedin: Natalino Busa11 Distributed Data Store Fast Analytics Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data API for mobile and web Training, Scoring and Exposing models read the data write the model
  • 13. @natbusa | linkedin: Natalino Busa12 Distributed Data Store Fast Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model
  • 14. @natbusa | linkedin: Natalino Busa Cassandra+Akka+Spark: Machine Learning Fast writes 2D Data Structure Replicated Tunable consistency Multi-Data centers C*Akka Spark Very Fast processing Distributed, Scalable computing Actor-based Pipelines Actor state can be persisted Supervision strategies Ad-Hoc Queries Joins, Aggregate User Defined Functions Machine Learning, Advanced Stats and Analytics
  • 15. @natbusa | linkedin: Natalino Busa Akka-Cassandra-Spark Stack Cassandra-Spark Connector Cassandra Spark Streaming SQL MLlib Graphx Extract Data Create Models, Enrich, Transform Fetch from other Sources: Kafka Fetch from other Sources: DB’s, Files Akka Analytics, Statistics, Data Science, Model Training Access Model Persist Actors’ State
  • 16. @natbusa | linkedin: Natalino Busa Cassandra-Spark Connector Cassandra: Store all the data Spark: Analyze all the data DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics! Data
  • 17. @natbusa | linkedin: Natalino Busa Cassandra-Spark Stack Cassandra: Store all the data Spark: Distributed Data Processing Executors and Workers Cassandra-Spark Connector: Data locality, Reduce Shuffling RDD’s to Cassandra Partitions DC3: replication factor 3 + Spark Executors
  • 18. @natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Hawkins, 1980
  • 19. @natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection (1) Parametric BasedGaussian Model Based Histogram, nonparametric
  • 20. @natbusa | linkedin: Natalino Busa Data Science: Anomaly Detection (2) Distance Based Density Based
  • 21. @natbusa | linkedin: Natalino Busa Example: Analyze gowalla check-ins year | month | day | time | uid | lat | lon | ts | vid ------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870 Check-ins dataset Venues dataset vid | name | lat | long ------+-------+-----+------+--------+----------+----------- +--------------------------+--------- 754108 | My Suit NY | 40.73474 | -73.87434 249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405
  • 22. @natbusa | linkedin: Natalino Busa Data Science: clustering venues Weekly visitors patterns! Madison Square, Apple Store, Radio City Music Hall Thursdays, Fridays, Saturdays are busy Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle) Not popular on midweek Intuition:
  • 23. @natbusa | linkedin: Natalino Busa Data Science: clustering with k-means Histograms components as dimensions Similar histograms would occupy similar places in the feature space How do I compare histograms: - EMD - Chi-squared distance - Space transformation (DCT) Intuition:
  • 24. @natbusa | linkedin: Natalino Busa K-Means: Featurize data + cluster val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1)) val numClusters = 15 val numIterations = 100 val clusters = KMeans.train(weekly_visits, numClusters, numIterations) PairRDDs, weekly patterns per venue cluster similar weekly patterns
  • 25. @natbusa | linkedin: Natalino Busa Assigning venues to clusters val venues_clustered = checkins_venues.select("vid","ts").where("ts > dateof(now())") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => clusters.predict(featurize_time(_))) venues_clustered.saveToCassandra("lbsn", "venues_cl") score and assign each venue to a cluster Store the cluster centers in cassandra
  • 26. @natbusa | linkedin: Natalino Busa How to use it 1) Classification Classify venues to given groups 2) Anomaly Detection Detect shift in the clustering assignment for a given venue for a given week Keep monitoring weekly change in patterns, when it happens trigger a signal week 26 week 27 alert!
  • 27. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues Intuition: Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ...
  • 28. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ... Intuition:
  • 29. @natbusa | linkedin: Natalino Busa Data Science: clustering with DBSCAN DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical
  • 30. @natbusa | linkedin: Natalino Busa Data Science: clustering with DBSCAN DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical It’s a graph!
  • 31. @natbusa | linkedin: Natalino Busa Data Science: clustering users’ venues val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) )) .reduceByKey(_ + _) .mapValues( dbscan cluster _ ) Have a look at: scalanlp/nak
  • 32. @natbusa | linkedin: Natalino Busa Data Science: Two ways to find anomalies with clustering - Cluster big amount of data with k-means and histograms - Apply clustering independently to million of users, to each identify the patterns with dbscan algorithm
  • 33. @natbusa | linkedin: Natalino Busa MLlib vs PairRDDs KMeans.train(FeaturesRDD, numClusters, numIterations) UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan cluster _ ) RDDs map functions Parallelism easy to exploit The function runs locally for each Key Pick your fav machine learning algorithms Limited nr of points Running in parallel for millions of Keys MLlib Truly distributed algorithm Classify venues to given groups Millions of datapoints Limited amount of clusters
  • 34. @natbusa | linkedin: Natalino Busa33 Distributed Data Store Fast Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model
  • 35. @natbusa | linkedin: Natalino Busa Training vs Scoring: Time budgets ● Akka: millisecond response ● Spark: in-memory (big)-data modelsTrain: Spark Score: Spark Train: Spark Score: Akka slow: minutes fast: millisecs Train: Akka Score: Akka Model Scoring ModelTraining slow:minutesfast:millisecs
  • 36. @natbusa | linkedin: Natalino Busa Which processing? .. and the granularity of data? What is the latency and the throughput?
  • 37. @natbusa | linkedin: Natalino Busa It’s all about latency! Map Reduce Big Data Batch based RDDs Big Data Micro-Batch Based CRDT’s + Monoids Fast Data Event Based MillWheel Fast + Big Data Event & Window
  • 38. @natbusa | linkedin: Natalino Busa Akka Mixed Load Cassandra Cluster Coral: Web API for dynamic data flows
  • 39. @natbusa | linkedin: Natalino Busa Data events POST http://coral/api/actors/23/in { "amount":23.45, "user": 76232, "city": "Berlin" } @natalinobusa | linkedin.com/in/natalinobusa Coral: Streaming data via Web APIs
  • 40. @natbusa | linkedin: Natalino Busa Trigger Emit Params State GET http://coral/api/actors/23 { "actors": { "def": { "type": "stats", "params": { "field": "amount" } }, "state": { "count": 134, "avg": 39.84, "min": 1.99, "max": 204.19, "sd": 38.01 } } } Coral: Streaming data via Web APIs
  • 41. @natbusa | linkedin: Natalino Busa Akka Coral: Web API for dynamic data flows ● a web api to define/manage/run streaming data-flows ● open source and community managed ● event processing as a service ● connect to Cassandra to access models ● connect to kafka to consume and produce events Steven Raemaekers Jasper van Zandbeek Ger van Rossum Hoda Alemi Koen Verschuren
  • 42. @natbusa | linkedin: Natalino Busa41 Distributed Data Store Fast Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Summary: read the model read the data write the model
  • 43. @natbusa | linkedin: Natalino Busa Akka Feedback to the community: More Algorithms for machine learning! - DBSCAN, OPTICS, PAM - More metrics, non-euclidean spaces, etc - Non distributed algorithms: more scalanlp integration? Streaming all the way: Unify batch (Spark) and event streaming (Akka) computing
  • 44. @natbusa | linkedin: Natalino Busa Thanks! - Vision and strategy on an event-driven bank - ING CIO management team and awesome colleagues Spark, Cassandra, Akka communities !
  • 45. @natbusa | linkedin: Natalino Busa Resources Coral: event processing webapi https://github.com/coral-streaming/coral Spark + Cassandra: Clustering Events http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html Spark: Machine Learning, SQL frames https://spark.apache.org/docs/latest/mllib-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html Datastax: Analytics and Spark connector http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html Anomaly Detection Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
  • 46. @natbusa | linkedin: Natalino Busa Resources Datasets https://snap.stanford.edu/data/loc-gowalla.html E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011 https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. . Pictures: "DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data. svg#/media/File:DBSCAN-density-data.svg "DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File: DBSCAN-Illustration.svg "Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons - https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png "Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https: //commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg "Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot. svg#/media/File:Michelsonmorley-boxplot.svg