Banks are innovating. The purpose of this innovation is to transform bank services into meaningful and frictionless customer experiences. A key element in order to achieve that ambitious goal is by providing well tailored and reactive APIs and provide them as the building blocks for greater and smoother customer journeys and experiences. For these API’s to work, internal processes have to evolve as well from batch processing to real time event processing.
Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/fri/slot-35.html
4. @natbusa | linkedin: Natalino Busa
ING group
http://www.ing.com/About-us/Purpose-Strategy.htm
5. @natbusa | linkedin: Natalino Busa
ING group
Empowering people to stay a step ahead
in life and in business.
http://www.ing.com/About-us/Purpose-Strategy.htm
6. @natbusa | linkedin: Natalino Busa
ING group
http://www.ing.com/About-us/Purpose-Strategy.htm
Clear and Easy
Anytime, Anywhere
Empower
Keep getting better
7. @natbusa | linkedin: Natalino Busa
Apply advanced, predictive analytics on live data
Event-Driven and exposed via APIs
Lean Architecture, Easy to integrate
Available, Consistent, Streaming, Real-time Data
Resilient, Distributed, Scalable, Maintainable
Clear and Easy
Anytime, Anywhere
Empower
Keep getting better
Data Principles
ING group
8. @natbusa | linkedin: Natalino Busa
Big Data and Fast Data
10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m
time
population:events,transactions,
sessions,customers,etc
event streams
recent data
historical big data
10. @natbusa | linkedin: Natalino Busa
Why Big Data?
1. Analyze and model
2. Learn, cluster, categorize, organize facts
11. @natbusa | linkedin: Natalino Busa10
Distributed
Data Store
Real Time APIs
Streaming Data
Data Sources,
Files, DB extracts
Batched Data
API for mobile and web
Training, Scoring and Exposing models
12. @natbusa | linkedin: Natalino Busa11
Distributed
Data Store
Fast Analytics
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
API for mobile and web
Training, Scoring and Exposing models
read the data
write the model
13. @natbusa | linkedin: Natalino Busa12
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Training, Scoring and Exposing models
read the model
read the data
write the model
14. @natbusa | linkedin: Natalino Busa
Cassandra+Akka+Spark: Machine Learning
Fast writes
2D Data Structure
Replicated
Tunable consistency
Multi-Data centers
C*Akka Spark
Very Fast processing
Distributed, Scalable computing
Actor-based Pipelines
Actor state can be persisted
Supervision strategies
Ad-Hoc Queries
Joins, Aggregate
User Defined Functions
Machine Learning,
Advanced Stats and Analytics
15. @natbusa | linkedin: Natalino Busa
Akka-Cassandra-Spark Stack
Cassandra-Spark Connector
Cassandra
Spark
Streaming SQL MLlib Graphx
Extract
Data
Create Models,
Enrich, Transform
Fetch from other
Sources: Kafka
Fetch from other
Sources: DB’s, Files
Akka
Analytics, Statistics, Data
Science, Model Training
Access
Model
Persist
Actors’ State
16. @natbusa | linkedin: Natalino Busa
Cassandra-Spark Connector
Cassandra: Store all the data
Spark: Analyze all the data
DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors
Storage! Analytics!
Data
17. @natbusa | linkedin: Natalino Busa
Cassandra-Spark Stack
Cassandra: Store all the data
Spark: Distributed Data Processing
Executors and Workers
Cassandra-Spark Connector:
Data locality,
Reduce Shuffling
RDD’s to Cassandra
Partitions
DC3: replication factor 3 +
Spark Executors
18. @natbusa | linkedin: Natalino Busa
Data Science: Anomaly Detection
An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a different
mechanism.
Hawkins, 1980
19. @natbusa | linkedin: Natalino Busa
Data Science: Anomaly Detection (1)
Parametric BasedGaussian Model Based Histogram, nonparametric
20. @natbusa | linkedin: Natalino Busa
Data Science: Anomaly Detection (2)
Distance Based Density Based
21. @natbusa | linkedin: Natalino Busa
Example: Analyze gowalla check-ins
year | month | day | time | uid | lat | lon | ts | vid
------+-------+-----+------+--------+----------+-----------+--------------------------+---------
2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955
2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160
2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870
Check-ins dataset
Venues dataset
vid | name | lat | long ------+-------+-----+------+--------+----------+-----------
+--------------------------+---------
754108 | My Suit NY | 40.73474 | -73.87434
249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289
6919688 | Sky Asian Bistro | 40.67621 | -73.98405
22. @natbusa | linkedin: Natalino Busa
Data Science: clustering venues
Weekly visitors patterns!
Madison Square, Apple Store, Radio City Music Hall
Thursdays, Fridays, Saturdays are busy
Statue of Liberty, Jacob K. Javits Convention Center,
Whole Foods Market (Columbus Circle)
Not popular on midweek
Intuition:
23. @natbusa | linkedin: Natalino Busa
Data Science: clustering with k-means
Histograms components as dimensions
Similar histograms would occupy similar places in
the feature space
How do I compare histograms:
- EMD
- Chi-squared distance
- Space transformation (DCT)
Intuition:
24. @natbusa | linkedin: Natalino Busa
K-Means: Featurize data + cluster
val weekly_visits = checkins_venues.select("vid","ts")
.map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts"))
.reduceByKey(_ + _)
.mapValues(_ => featurize_histogram(_._1))
val numClusters = 15
val numIterations = 100
val clusters = KMeans.train(weekly_visits, numClusters, numIterations)
PairRDDs, weekly patterns per venue
cluster similar weekly patterns
25. @natbusa | linkedin: Natalino Busa
Assigning venues to clusters
val venues_clustered = checkins_venues.select("vid","ts").where("ts > dateof(now())")
.map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts"))
.reduceByKey(_ + _)
.mapValues(_ => clusters.predict(featurize_time(_)))
venues_clustered.saveToCassandra("lbsn", "venues_cl") score and assign each venue to a cluster
Store the cluster centers in cassandra
26. @natbusa | linkedin: Natalino Busa
How to use it
1) Classification
Classify venues to given groups
2) Anomaly Detection
Detect shift in the clustering assignment for a given venue for a given week
Keep monitoring weekly change in patterns, when it happens trigger a signal
week 26 week 27
alert!
27. @natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
Intuition:
Users tend to stick in the same places
People have habits
By clustering the places together
We can identify anomalous locations
Size of the cluster matters
More points means less anomalous
Mini-clusters and single anomalies are
treated in similar ways ...
28. @natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
Users tend to stick in the same places
People have habits
By clustering the places together
We can identify anomalous locations
Size of the cluster matters
More points means less anomalous
Mini-clusters and single anomalies are
treated in similar ways ...
Intuition:
29. @natbusa | linkedin: Natalino Busa
Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density
Does not require the number of cluster k beforehand.
Clusters are not spherical
30. @natbusa | linkedin: Natalino Busa
Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density
Does not require the number of cluster k beforehand.
Clusters are not spherical
It’s a graph!
31. @natbusa | linkedin: Natalino Busa
Data Science: clustering users’ venues
val locs = checkins_venues.select("uid", "lat","lon")
.map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) ))
.reduceByKey(_ + _)
.mapValues( dbscan cluster _ )
Have a look at: scalanlp/nak
32. @natbusa | linkedin: Natalino Busa
Data Science:
Two ways to find anomalies with clustering
- Cluster big amount of data with k-means and histograms
- Apply clustering independently to million of users,
to each identify the patterns with dbscan algorithm
33. @natbusa | linkedin: Natalino Busa
MLlib vs PairRDDs
KMeans.train(FeaturesRDD, numClusters, numIterations)
UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan cluster _ )
RDDs map functions
Parallelism easy to exploit
The function runs locally for each Key
Pick your fav machine learning algorithms
Limited nr of points
Running in parallel for millions of Keys
MLlib
Truly distributed algorithm
Classify venues to given groups
Millions of datapoints
Limited amount of clusters
34. @natbusa | linkedin: Natalino Busa33
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Training, Scoring and Exposing models
read the model
read the data
write the model
35. @natbusa | linkedin: Natalino Busa
Training vs Scoring: Time budgets
● Akka: millisecond response
● Spark: in-memory (big)-data modelsTrain: Spark
Score: Spark
Train: Spark
Score: Akka
slow: minutes fast: millisecs
Train: Akka
Score: Akka
Model Scoring
ModelTraining
slow:minutesfast:millisecs
36. @natbusa | linkedin: Natalino Busa
Which processing? .. and the granularity of data?
What is the latency and the throughput?
37. @natbusa | linkedin: Natalino Busa
It’s all about latency!
Map Reduce
Big Data
Batch based
RDDs
Big Data
Micro-Batch Based
CRDT’s + Monoids
Fast Data
Event Based
MillWheel
Fast + Big Data
Event & Window
38. @natbusa | linkedin: Natalino Busa
Akka
Mixed Load Cassandra Cluster
Coral: Web API for dynamic data flows
39. @natbusa | linkedin: Natalino Busa
Data
events
POST http://coral/api/actors/23/in
{
"amount":23.45,
"user": 76232,
"city": "Berlin"
}
@natalinobusa | linkedin.com/in/natalinobusa
Coral: Streaming data via Web APIs
40. @natbusa | linkedin: Natalino Busa
Trigger Emit
Params
State
GET http://coral/api/actors/23
{
"actors": {
"def": {
"type": "stats",
"params": {
"field": "amount"
}
},
"state": {
"count": 134,
"avg": 39.84,
"min": 1.99,
"max": 204.19,
"sd": 38.01
}
}
}
Coral: Streaming data via Web APIs
41. @natbusa | linkedin: Natalino Busa
Akka
Coral: Web API for dynamic data flows
● a web api to define/manage/run streaming data-flows
● open source and community managed
● event processing as a service
● connect to Cassandra to access models
● connect to kafka to consume and produce events
Steven Raemaekers
Jasper van Zandbeek
Ger van Rossum
Hoda Alemi
Koen Verschuren
42. @natbusa | linkedin: Natalino Busa41
Distributed
Data Store
Fast Analytics
Event Processing
Real Time APIs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
Summary:
read the model
read the data
write the model
43. @natbusa | linkedin: Natalino Busa
Akka
Feedback to the community:
More Algorithms for machine learning!
- DBSCAN, OPTICS, PAM
- More metrics, non-euclidean spaces, etc
- Non distributed algorithms: more scalanlp integration?
Streaming all the way:
Unify batch (Spark) and event streaming (Akka) computing
44. @natbusa | linkedin: Natalino Busa
Thanks!
- Vision and strategy on an event-driven bank
- ING CIO management team and awesome colleagues
Spark, Cassandra, Akka communities !
46. @natbusa | linkedin: Natalino Busa
Resources
Datasets
https://snap.stanford.edu/data/loc-gowalla.html
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2011
https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip
The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant
PTDC/EIA-EIA/109840/2009. .
Pictures:
"DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data.
svg#/media/File:DBSCAN-density-data.svg
"DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File:
DBSCAN-Illustration.svg
"Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons -
https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png
"Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https:
//commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg
"Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot.
svg#/media/File:Michelsonmorley-boxplot.svg