SlideShare a Scribd company logo
1 of 39
Download to read offline
Real-Time Anomaly Detection
with Spark MLlib, Akka and
Cassandra
Natalino Busa
Data Platform Architect at Ing
ING group
http://www.ing.com/About-us/Purpose-Strategy.htm
ING group
Empowering people to stay a step ahead
in life and in business.
http://www.ing.com/About-us/Purpose-Strategy.htm
ING group
http://www.ing.com/About-us/Purpose-Strategy.htm
Clear and Easy
Anytime, Anywhere
Empower
Keep getting better
Apply advanced, predictive analytics on live data
Event-Driven and exposed via APIs
Lean Architecture, Easy to integrate
Available, Consistent, Streaming, Real-time Data
Resilient, Distributed, Scalable, Maintainable
Clear and Easy
Anytime, Anywhere
Empower
Keep getting better
Data Principles
ING group
Big Data and Fast Data
population:events,transactions,
sessions,customers,etc
Why Fast Data?
1. Relevant up-to-date information.
2. Delivers actionable events.
Why Big Data?
1. Analyze and model
2. Learn, cluster, categorize, organize facts
10
Real Time APIs
Streaming Data
Data Sources,
Files, DB extracts
Batched Data
Training, Scoring and Exposing models
11
Real Time APIs
Streaming Data
Data Sources,
Files, DB extracts
Batched Data
Training, Scoring and Exposing models
12
Real Time APIs
Streaming Data
Data Sources,
Files, DB extracts
Batched Data
Training, Scoring and Exposing models
Cassandra+Akka+Spark: Machine Learning
Fast writes
2D Data Structure
Replicated
Tunable consistency
Multi-Data centers
C*Akka Spark
Very Fast processing
Distributed, Scalable computing
Actor-based Pipelines
Actor state can be persisted
Supervision strategies
Ad-Hoc Queries
Joins, Aggregate
User Defined Functions
Machine Learning,
Advanced Stats and Analytics
Akka-Cassandra-Spark Stack
Cassandra-Spark Connector
Cassandra
Spark
Streaming SQL MLlib Graphx
Extract
Data
Create Models,
Enrich, Transform
Fetch from other
Sources: Kafka
Fetch from other
Sources: DB’s, Files
Akka
Analytics, Statistics, Data
Science, Model Training
Access
Model
Persist
Actors’ State
Cassandra-Spark Connector
Cassandra: Store all the data
Spark: Analyze all the data
DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors
Storage! Analytics!
Data
Data Science: Anomaly Detection
An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a different
mechanism.
Hawkins, 1980
Data Science: Anomaly Detection
Distance Based Density Based
Example: Analyze gowalla check-ins
year | month | day | time | uid | lat | lon | ts | vid
------+-------+-----+------+--------+----------+-----------+--------------------------+---------
2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955
2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160
2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870
Check-ins dataset
Venues dataset
vid | name | lat | long ------+-------+-----+------+--------+----------+-----------
+--------------------------+---------
754108 | My Suit NY | 40.73474 | -73.87434
249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289
6919688 | Sky Asian Bistro | 40.67621 | -73.98405
Data Science: clustering venues
Data Science: clustering venues
Weekly visitors patterns!
Madison Square, Apple Store, Radio City Music Hall
Thursdays, Fridays, Saturdays are busy
Statue of Liberty, Jacob K. Javits Convention Center,
Whole Foods Market (Columbus Circle)
Not popular on midweek
Intuition:
Data Science: clustering with k-means
Histograms components as dimensions
Similar histograms would occupy similar places in
the feature space
How do I compare histograms:
- EMD
- Chi-squared distance
- Space transformation (DCT)
Intuition:
K-Means: Featurize data + cluster
val weekly_visits = checkins_venues.select("vid","ts")
.map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts"))
.reduceByKey(_ + _)
.mapValues(_ => featurize_histogram(_._1))
val numClusters = 15
val numIterations = 100
val clusters = KMeans.train(weekly_visits, numClusters, numIterations)
PairRDDs, weekly patterns per venue
cluster similar weekly patterns
How to use it
1) Classification
Classify venues to given groups
2) Anomaly Detection
Detect shift in the clustering assignment for a given venue for a given week
Keep monitoring weekly change in patterns, when it happens trigger a signal
week 26 week 27
Action
Data Science: clustering users’ venues
Data Science: clustering users’ venues
Users tend to stick in the same places
People have habits
By clustering the places together
We can identify anomalous locations
Size of the cluster matters
More points means less anomalous
Mini-clusters and single anomalies are
treated in similar ways ...
Intuition:
Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density
Does not require the number of cluster k beforehand.
Clusters are not spherical
Data Science: clustering users’ venues
val locs = checkins_venues.select("uid", "lat","lon")
.map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) ))
.reduceByKey(_ + _)
.mapValues( dbscan (_) )
Have a look at: scalanlp/nak
Data Science:
Two ways to find anomalies with clustering
- Cluster big amount of data with k-means and histograms
- Apply clustering independently to million of users,
to each identify the patterns with dbscan algorithm
MLlib vs PairRDDs
KMeans.train(FeaturesRDD, numClusters, numIterations)
UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan(_) )
RDDs map functions
Parallelism easy to exploit
The function runs locally for each Key
Pick your fav machine learning algorithms
Limited nr of points
Running in parallel for millions of Keys
MLlib
Truly distributed algorithm
Classify venues to given groups
Millions of datapoints
Limited amount of clusters
30
Real Time APIs
Streaming Data
Data Sources,
Files, DB extracts
Batched Data
Training, Scoring and Exposing models
Training vs Scoring: Latency budget
● Akka: millisecond response
● Spark: in-memory data models
Train: Spark
Score: Spark
Train: Spark
Score: Akka
slow: minutes fast: millisecs
Model Scoring
ModelTraining
slow:minutes
Akka
Mixed Load Cassandra Cluster
Coral: Web API for dynamic data flows
Akka
Web API for dynamic data flows
● a web api to define/manage/run streaming data-flows
● open source and community managed
● event processing as a service
coral-streaming/coral
Steven Raemaekers
Jasper van Zandbeek
Ger van Rossum
Hoda Alemi
Koen Verschuren
34
Real Time APIs
Streaming Data
Data Sources,
Files, DB extracts
Batched Data
Summary:
Akka
Feedback to the community:
More Algorithms for machine learning!
- DBSCAN, OPTICS, PAM
- More metrics, non-euclidean spaces, etc
- Non distributed algorithms: more scalanlp integration?
Streaming all the way:
Unify batch (Spark) and event streaming (Akka) computing
Thanks!
- Vision and strategy on an event-driven bank
- ING CIO management team and awesome colleagues
Spark, Cassandra, Akka communities !
webinar + live demo: Dec 9th
Resources
Coral: event processing webapi
https://github.com/coral-streaming/coral
Spark + Cassandra: Clustering Events
http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html
Spark: Machine Learning, SQL frames
https://spark.apache.org/docs/latest/mllib-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
Datastax: Analytics and Spark connector
http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html
Anomaly Detection
Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
Resources
Datasets
https://snap.stanford.edu/data/loc-gowalla.html
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2011
https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip
The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant
PTDC/EIA-EIA/109840/2009. .
Pictures:
"DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data.
svg#/media/File:DBSCAN-density-data.svg
"DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File:
DBSCAN-Illustration.svg
"Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons -
https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png
"Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https:
//commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg
"Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot.
svg#/media/File:Michelsonmorley-boxplot.svg

More Related Content

What's hot

Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offsetDaeMyung Kang
 
Data Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouseData Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouseJack Gao
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elkRushika Shah
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityApache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityKai Wähner
 
Threat Hunting
Threat HuntingThreat Hunting
Threat HuntingSplunk
 
Security Onion - Brief
Security Onion - BriefSecurity Onion - Brief
Security Onion - BriefAshley Deuble
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
BSidesLV 2016 - Powershell - Hunting on the Endpoint - Gerritz
BSidesLV 2016 - Powershell - Hunting on the Endpoint - GerritzBSidesLV 2016 - Powershell - Hunting on the Endpoint - Gerritz
BSidesLV 2016 - Powershell - Hunting on the Endpoint - GerritzChristopher Gerritz
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Threat Hunting with Splunk
Threat Hunting with SplunkThreat Hunting with Splunk
Threat Hunting with SplunkSplunk
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
 
Thoughts on kafka capacity planning
Thoughts on kafka capacity planningThoughts on kafka capacity planning
Thoughts on kafka capacity planningJamieAlquiza
 

What's hot (20)

Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
 
Data Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouseData Analyse Black Horse - ClickHouse
Data Analyse Black Horse - ClickHouse
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityApache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart City
 
Threat Hunting
Threat HuntingThreat Hunting
Threat Hunting
 
Security Onion - Brief
Security Onion - BriefSecurity Onion - Brief
Security Onion - Brief
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
BSidesLV 2016 - Powershell - Hunting on the Endpoint - Gerritz
BSidesLV 2016 - Powershell - Hunting on the Endpoint - GerritzBSidesLV 2016 - Powershell - Hunting on the Endpoint - Gerritz
BSidesLV 2016 - Powershell - Hunting on the Endpoint - Gerritz
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Threat Hunting with Splunk
Threat Hunting with SplunkThreat Hunting with Splunk
Threat Hunting with Splunk
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
Thoughts on kafka capacity planning
Thoughts on kafka capacity planningThoughts on kafka capacity planning
Thoughts on kafka capacity planning
 

Viewers also liked

Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP serverKazuho Oku
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration WarsKarl Isenberg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 

Viewers also liked (20)

Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration Wars
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 

Similar to Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaSpark Summit
 
Strata London 16: sightseeing, venues, and friends
Strata  London 16: sightseeing, venues, and friendsStrata  London 16: sightseeing, venues, and friends
Strata London 16: sightseeing, venues, and friendsNatalino Busa
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAmazon Web Services
 
OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware: extensible and standard-based XaaS platform to manage everything in...OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware: extensible and standard-based XaaS platform to manage everything in...OW2
 
OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware: extensible and standard-based XaaS platform to manage everything in...OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware: extensible and standard-based XaaS platform to manage everything in...OCCIware
 
OCCIware@OW2con 2016
OCCIware@OW2con 2016OCCIware@OW2con 2016
OCCIware@OW2con 2016Marc Dutoo
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
Real time machine learning visualization with spark -- Hadoop Summit 2016
Real time machine learning visualization with spark -- Hadoop Summit 2016Real time machine learning visualization with spark -- Hadoop Summit 2016
Real time machine learning visualization with spark -- Hadoop Summit 2016Chester Chen
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataCarlos Andrés García
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataVMware Tanzu
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your DataDanilo Poccia
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API'sNatalino Busa
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 

Similar to Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra (20)

Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
 
Strata London 16: sightseeing, venues, and friends
Strata  London 16: sightseeing, venues, and friendsStrata  London 16: sightseeing, venues, and friends
Strata London 16: sightseeing, venues, and friends
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
 
OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware: extensible and standard-based XaaS platform to manage everything in...OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware: extensible and standard-based XaaS platform to manage everything in...
 
OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware: extensible and standard-based XaaS platform to manage everything in...OCCIware: extensible and standard-based XaaS platform to manage everything in...
OCCIware: extensible and standard-based XaaS platform to manage everything in...
 
OCCIware@OW2con 2016
OCCIware@OW2con 2016OCCIware@OW2con 2016
OCCIware@OW2con 2016
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Real time machine learning visualization with spark -- Hadoop Summit 2016
Real time machine learning visualization with spark -- Hadoop Summit 2016Real time machine learning visualization with spark -- Hadoop Summit 2016
Real time machine learning visualization with spark -- Hadoop Summit 2016
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your Data
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 

More from Natalino Busa

Data Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationData Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationNatalino Busa
 
Data science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter NotebooksData science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter NotebooksNatalino Busa
 
7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networksNatalino Busa
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooksNatalino Busa
 
[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditingNatalino Busa
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analyticsNatalino Busa
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Natalino Busa
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayNatalino Busa
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsNatalino Busa
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Natalino Busa
 
Big and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analyticsBig and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analyticsNatalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsNatalino Busa
 
Strata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topicsStrata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topicsNatalino Busa
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesNatalino Busa
 

More from Natalino Busa (17)

Data Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationData Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovation
 
Data science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter NotebooksData science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter Notebooks
 
7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
 
[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing
 
Data in Action
Data in ActionData in Action
Data in Action
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analytics
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analyticsBig and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analytics
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Strata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topicsStrata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topics
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

  • 1. Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra Natalino Busa Data Platform Architect at Ing
  • 2.
  • 4. ING group Empowering people to stay a step ahead in life and in business. http://www.ing.com/About-us/Purpose-Strategy.htm
  • 5. ING group http://www.ing.com/About-us/Purpose-Strategy.htm Clear and Easy Anytime, Anywhere Empower Keep getting better
  • 6. Apply advanced, predictive analytics on live data Event-Driven and exposed via APIs Lean Architecture, Easy to integrate Available, Consistent, Streaming, Real-time Data Resilient, Distributed, Scalable, Maintainable Clear and Easy Anytime, Anywhere Empower Keep getting better Data Principles ING group
  • 7. Big Data and Fast Data population:events,transactions, sessions,customers,etc
  • 8. Why Fast Data? 1. Relevant up-to-date information. 2. Delivers actionable events.
  • 9. Why Big Data? 1. Analyze and model 2. Learn, cluster, categorize, organize facts
  • 10. 10 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Training, Scoring and Exposing models
  • 11. 11 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Training, Scoring and Exposing models
  • 12. 12 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Training, Scoring and Exposing models
  • 13. Cassandra+Akka+Spark: Machine Learning Fast writes 2D Data Structure Replicated Tunable consistency Multi-Data centers C*Akka Spark Very Fast processing Distributed, Scalable computing Actor-based Pipelines Actor state can be persisted Supervision strategies Ad-Hoc Queries Joins, Aggregate User Defined Functions Machine Learning, Advanced Stats and Analytics
  • 14. Akka-Cassandra-Spark Stack Cassandra-Spark Connector Cassandra Spark Streaming SQL MLlib Graphx Extract Data Create Models, Enrich, Transform Fetch from other Sources: Kafka Fetch from other Sources: DB’s, Files Akka Analytics, Statistics, Data Science, Model Training Access Model Persist Actors’ State
  • 15. Cassandra-Spark Connector Cassandra: Store all the data Spark: Analyze all the data DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics! Data
  • 16. Data Science: Anomaly Detection An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Hawkins, 1980
  • 17. Data Science: Anomaly Detection Distance Based Density Based
  • 18. Example: Analyze gowalla check-ins year | month | day | time | uid | lat | lon | ts | vid ------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870 Check-ins dataset Venues dataset vid | name | lat | long ------+-------+-----+------+--------+----------+----------- +--------------------------+--------- 754108 | My Suit NY | 40.73474 | -73.87434 249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405
  • 20. Data Science: clustering venues Weekly visitors patterns! Madison Square, Apple Store, Radio City Music Hall Thursdays, Fridays, Saturdays are busy Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle) Not popular on midweek Intuition:
  • 21. Data Science: clustering with k-means Histograms components as dimensions Similar histograms would occupy similar places in the feature space How do I compare histograms: - EMD - Chi-squared distance - Space transformation (DCT) Intuition:
  • 22. K-Means: Featurize data + cluster val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1)) val numClusters = 15 val numIterations = 100 val clusters = KMeans.train(weekly_visits, numClusters, numIterations) PairRDDs, weekly patterns per venue cluster similar weekly patterns
  • 23. How to use it 1) Classification Classify venues to given groups 2) Anomaly Detection Detect shift in the clustering assignment for a given venue for a given week Keep monitoring weekly change in patterns, when it happens trigger a signal week 26 week 27 Action
  • 24. Data Science: clustering users’ venues
  • 25. Data Science: clustering users’ venues Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ... Intuition:
  • 26. Data Science: clustering with DBSCAN DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical
  • 27. Data Science: clustering users’ venues val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) )) .reduceByKey(_ + _) .mapValues( dbscan (_) ) Have a look at: scalanlp/nak
  • 28. Data Science: Two ways to find anomalies with clustering - Cluster big amount of data with k-means and histograms - Apply clustering independently to million of users, to each identify the patterns with dbscan algorithm
  • 29. MLlib vs PairRDDs KMeans.train(FeaturesRDD, numClusters, numIterations) UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan(_) ) RDDs map functions Parallelism easy to exploit The function runs locally for each Key Pick your fav machine learning algorithms Limited nr of points Running in parallel for millions of Keys MLlib Truly distributed algorithm Classify venues to given groups Millions of datapoints Limited amount of clusters
  • 30. 30 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Training, Scoring and Exposing models
  • 31. Training vs Scoring: Latency budget ● Akka: millisecond response ● Spark: in-memory data models Train: Spark Score: Spark Train: Spark Score: Akka slow: minutes fast: millisecs Model Scoring ModelTraining slow:minutes
  • 32. Akka Mixed Load Cassandra Cluster Coral: Web API for dynamic data flows
  • 33. Akka Web API for dynamic data flows ● a web api to define/manage/run streaming data-flows ● open source and community managed ● event processing as a service coral-streaming/coral Steven Raemaekers Jasper van Zandbeek Ger van Rossum Hoda Alemi Koen Verschuren
  • 34. 34 Real Time APIs Streaming Data Data Sources, Files, DB extracts Batched Data Summary:
  • 35. Akka Feedback to the community: More Algorithms for machine learning! - DBSCAN, OPTICS, PAM - More metrics, non-euclidean spaces, etc - Non distributed algorithms: more scalanlp integration? Streaming all the way: Unify batch (Spark) and event streaming (Akka) computing
  • 36. Thanks! - Vision and strategy on an event-driven bank - ING CIO management team and awesome colleagues Spark, Cassandra, Akka communities !
  • 37. webinar + live demo: Dec 9th
  • 38. Resources Coral: event processing webapi https://github.com/coral-streaming/coral Spark + Cassandra: Clustering Events http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html Spark: Machine Learning, SQL frames https://spark.apache.org/docs/latest/mllib-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html Datastax: Analytics and Spark connector http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html Anomaly Detection Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
  • 39. Resources Datasets https://snap.stanford.edu/data/loc-gowalla.html E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011 https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. . Pictures: "DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data. svg#/media/File:DBSCAN-density-data.svg "DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File: DBSCAN-Illustration.svg "Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons - https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png "Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https: //commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg "Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot. svg#/media/File:Michelsonmorley-boxplot.svg