SlideShare a Scribd company logo
1 of 47
Download to read offline
@natbusa | linkedin.com: Natalino Busa
Sightseeing, venues, and friends:
Predictive analytics with Spark ML and Cassandra
Natalino Busa
Head of Applied Data Science at Teradata
@natbusa | linkedin.com: Natalino Busa
Distributed computing Machine Learning
Statistics Big/Fast Data Streaming Computing
Head of Applied Data Science at
Teradata
@natbusa | linkedin.com: Natalino Busa
Location Based Transaction Networks
Locations Network Transactions
@natbusa | linkedin.com: Natalino Busa
You find them everywhere!
Emails
Social Networks
Check-ins
Payments
@natbusa | linkedin.com: Natalino Busa
You find them everywhere!
Emails Industrial IoT
networks
Social Networks
Check-ins
Social Networks
Ratings
CallsPayments
@natbusa | linkedin.com: Natalino Busa
You find them everywhere!
Emails Industrial IoT
networks
Transportation
Social Networks
Check-ins
Social Networks
Ratings
Traveling
Services
Messaging
Remote Assistance
CallsPayments
@natbusa | linkedin.com: Natalino Busa
Use cases
● Clustering
● Classification
● Anomaly Detection
● Fraud Detection
● Cyber Security
● Forecasting
● Process Mining
● Communities
@natbusa | linkedin.com: Natalino Busa
Location Based Transaction Networks
Locations Network Transactions
@natbusa | linkedin.com: Natalino Busa
Transform
Aggregate
Machine Learning
Score, Train
SQL, Graph, ML
E T L
Extract
Streaming Real-Time Events
Batch Extracts from DBs
Consume from APIs
Extract from no-SQL datastores
Load
Expose as APIs
Load to DBs
Produce to APIs
Push to noSQL datastores
Evolution of ETL
@natbusa | linkedin.com: Natalino Busa
Fast writes
2D Data Structure
Replicated
Tunable consistency
Multi-Data centers
CassandraKafka Spark
Streaming Events
Distributed, Scalable Transport
Events are persisted
Decoupled Consumer-Producers
Topics and Partitions
Ad-Hoc Queries
Joins, Aggregate
User Defined Functions
Machine Learning,
Advanced Stats and Analytics
Kafka+Cassandra+Spark:
Streaming Machine Learning
@natbusa | linkedin.com: Natalino Busa
Example: Analyze gowalla events
uid | ts | lat | lon | vid
-------+--------------------------+----------+-----------+--------
46073 | 2010-10-16 12:01:40+0000 | 51.51339 | -0.13101 | 219097
46073 | 2010-10-16 11:20:02+0000 | 51.51441 | -0.141921 | 23090
46073 | 2010-10-13 12:08:19+0000 | 59.33101 | 18.06445 | 136676
46073 | 2010-10-12 23:28:43+0000 | 59.45591 | 17.81232 | 96618
46073 | 2010-09-15 18:14:13+0000 | 59.45591 | 17.81232 | 96618
Events dataset
Venues dataset
vid | name | lat | long
--------+----------------------------+----------+-----------
754108 | My Suit NY | 40.73474 | -73.87434
249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289
6919688 | Sky Asian Bistro | 40.67621 | -73.98405
Users dataset
uid | id
-------+--------
1535 | 2
1535 | 1691
1535 | 8869
1535 | 23363
1535 | 52998
82694 | 7860
82694 | 8026
82694 | 10624
82694 | 30683
82694 | 57534
82694 | 75763
82694 | 76154
82694 | 76223
82694 | 88708
82694 | 151476
@natbusa | linkedin.com: Natalino Busa
Cassandra: Store all the data
Spark: Analyze all the data
DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors
Storage! Analytics!
Data
Spark and Cassandra: distributed goodness
@natbusa | linkedin.com: Natalino Busa
Cassandra - Spark Connector
Cassandra: Store all the data
Spark: Distributed Data Processing
Executors and Workers
Cassandra-Spark Connector:
Data locality,
Reduce Shuffling
RDD’s to Cassandra Partitions
DC3: replication factor 3 +
Spark Executors
@natbusa | linkedin.com: Natalino Busa
Joining Tables
val df_venues = cc.sql("select vid, name from lbsn.venues")
.as("venues")
val df_events = cc.sql("select * from lbsn.events")
.as("events")
val events = df_events
.join(df_venues, df_events("events.vid") === df_venues("venues.vid"))
.select("ts", "uid", "lat", "lon", "venues.vid","venues.name")
events.count()
@natbusa | linkedin.com: Natalino Busa
Venues and Events
@natbusa | linkedin.com: Natalino Busa
Basic Statistics
val venues_count = events.groupBy("name").count()
venues_count.sort(venues_count("count").desc).show(5, false)
+---------------------------------+-----+
|name |count|
+---------------------------------+-----+
|LGA LaGuardia Airport |1679 |
|JFK John F. Kennedy International|1653 |
|Times Square |1089 |
|Grand Central Terminal |1005 |
|The Museum of Modern Art (MoMA) |392 |
+---------------------------------+-----+
@natbusa | linkedin.com: Natalino Busa
K-Means Clustering
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
val locs = events.select("lat","lon")
.map(s => Vectors.dense(s.getDouble(0), s.getDouble(1)))
.cache()
val numClusters = 20
val numIterations = 20
val clusters = KMeans.train(locs, numClusters, numIterations)
@natbusa | linkedin.com: Natalino BusaEvents k-means clustering
@natbusa | linkedin.com: Natalino Busa
K-Means Scoring
import org.apache.spark.sql.functions.udf
val func = (lat:Double, lon:Double) =>
clusters.predict(Vectors.dense(lat,lon))
val sqlfunc = udf(func)
val locs_cid = events
.withColumn("cluster", sqlfunc(events("lat"), events("lon")))
@natbusa | linkedin.com: Natalino Busa
Top venues per cluster
@natbusa | linkedin.com: Natalino Busa
Storing k-means clusters top venues in Cassandra
CREATE TABLE lbsn.topvenues (
cluster int,
name text,
count int,
PRIMARY KEY (cl,name)
);
locs_cid.select("cluster", "name")
.groupBy("cluster", "name").
.agg(Map("name" -> "count")).
.sort($"cluster", $"COUNT(name)".desc)
.saveToCassandra("lbsn", "topvenues",
SomeColumns("cluster", "name", "count"))
@natbusa | linkedin.com: Natalino Busa
Storing k-means clusters top venues in Cassandra
CREATE TABLE lbsn.topvenues (
cluster int,
name text,
count int,
PRIMARY KEY (cl,name)
);
locs_cid.select("cluster", "name")
.groupBy("cluster", "name").
.agg(Map("name" -> "count")).
.sort($"cluster", $"COUNT(name)".desc)
.saveToCassandra("lbsn", "topvenues",
SomeColumns("cluster", "name", "count"))
Works in Spark Streaming too!
@natbusa | linkedin.com: Natalino Busa
Process mining
1
2
4
5
3
6
7
8
9
@natbusa | linkedin.com: Natalino Busa
Process mining
time
time
time
1
2
4
5
3
6
7
8
9
@natbusa | linkedin.com: Natalino Busa
Process mining
time
time
time
Source id Target id
1 3
3 8
5 2
1
2
4
5
3
6
7
8
9
@natbusa | linkedin.com: Natalino Busa
https://imgflip.com/i/155v6p
@natbusa | linkedin.com: Natalino Busa
https://imgflip.com/i/155vah
@natbusa | linkedin.com: Natalino Busa
Process flow: Sankey Diagrams
http://sankeymatic.com/build/
by Steve Bogart (Twitter: @nowthis).
@natbusa | linkedin.com: Natalino Busa
Flow network: Cord diagram
size id label
0 4075 135 Times Square
1 3173 10 Empire State Building
2 3055 89 The Museum of Modern Art (MoMA)
3 3041 4 Columbus Circle
4 2831 116 Madison Square Garden
6 2486 54 Grand Central Terminal
Circos - http://circos.ca/
By Martin Krzywinski
@natbusa | linkedin.com: Natalino Busa
Page Ranking clusters
import org.graphframes.GraphFrame
val g = GraphFrame(df_vertices, df_edges)
val results = g.pageRank.resetProbability(0.01).maxIter(20).run()
val results_sorted = results
.vertices
.select("id", "pagerank")
.sort($"pagerank".desc)
results_sorted.head(20)
http://graphframes.github.io/
@natbusa | linkedin.com: Natalino Busa
Venues Importance: Page Rank
@natbusa | linkedin.com: Natalino Busa
10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m
time
population:events,transactions,
sessions,customers,etc
event streams
recent data
historical big data
Big Data and Fast Data
@natbusa | linkedin.com: Natalino Busa
Why Fast Data?
1. Relevant up-to-date information.
2. Delivers actionable events.
@natbusa | linkedin.com: Natalino Busa
Why Big Data?
1. Analyze and model
2. Learn, cluster, categorize, organize facts
@natbusa | linkedin.com: Natalino Busa
Batch -> Streaming ETL
Hive -> Flink, Akka, Spark, Kafka
@natbusa | linkedin.com: Natalino Busa
Stream-Centric Architectures
@natbusa | linkedin.com: Natalino Busa
IngestData
Streaming ETL
37
Distributed
Data Store
Advanced Analytics
Real Time APIs
Event Streams/Logs
Streaming Data
Data Modeling
Data Sources,
Files, DB extracts
Batched Data
Alerts and Notifications
API for mobile and web
read the model
read the data
write the model
Microservices
Microservices on ML on SQL on Streams
@natbusa | linkedin.com: Natalino Busa
Data Science: Anomaly Detection
An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a different
mechanism.
Hawkins, 1980
Anomaly detection is the search for items or events which do not
conform to the expected pattern
Chandola, 2009
@natbusa | linkedin.com: Natalino Busa
Anomaly Detection
Parametric BasedGaussian Model Based Histogram, nonparametric
@natbusa | linkedin.com: Natalino Busa
Clustering with k-means
Distance based, space fully covered
Not a silver bullet though
Intuition:
@natbusa | linkedin.com: Natalino Busa
Clustering users’ venues
Intuition:
Users tend to stick in the same places
People have habits
By clustering the places together
We can identify anomalous locations
Size of the cluster matters
More points means less anomalous
Mini-clusters and single anomalies are
treated in similar ways ...
@natbusa | linkedin.com: Natalino Busa
Data Science: clustering with DBSCAN
DBSCAN find clusters based on neighbouring density
Does not require the number of cluster k beforehand.
Clusters are not spherical
@natbusa | linkedin.com: Natalino Busa
Data Science: clustering users’ venues
val locs = checkins_venues.select("uid", "lat","lon")
.map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) ))
.reduceByKey(_ ++ _)
.mapValues( gdbscan cluster _ )
Refernce: scalanlp/nak
@natbusa | linkedin.com: Natalino Busa
User events DBSCAN
clustering
@natbusa | linkedin.com: Natalino Busa
Resources
Spark + Cassandra: Clustering Events
http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html
Spark: Machine Learning, SQL frames
https://spark.apache.org/docs/latest/mllib-guide.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
Datastax: Analytics and Spark connector
http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html
Anomaly Detection
Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
@natbusa | linkedin.com: Natalino Busa
Datasets
https://snap.stanford.edu/data/loc-gowalla.html
E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2011
https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip
The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant
PTDC/EIA-EIA/109840/2009. .
Pictures:
https://commons.wikimedia.org
Resources
@natbusa | linkedin.com: Natalino Busa
Spark ML Tutorial today at 15.15
Capital Lounge 3, second floor

More Related Content

What's hot

AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
Databricks
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Natalino Busa
 

What's hot (20)

Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerApache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
 
An efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and CassandraAn efficient data mining solution by integrating Spark and Cassandra
An efficient data mining solution by integrating Spark and Cassandra
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 

Viewers also liked

Big Data Ecosystem at InMobi, Nasscom ATC 2013 Noida
Big Data Ecosystem at InMobi, Nasscom ATC 2013 NoidaBig Data Ecosystem at InMobi, Nasscom ATC 2013 Noida
Big Data Ecosystem at InMobi, Nasscom ATC 2013 Noida
Sharad Agarwal
 

Viewers also liked (17)

Sejarahjawadwipa
SejarahjawadwipaSejarahjawadwipa
Sejarahjawadwipa
 
31 Page Report
31 Page Report31 Page Report
31 Page Report
 
Meeting the challenges of big data
Meeting the challenges of big dataMeeting the challenges of big data
Meeting the challenges of big data
 
Openobject features
Openobject featuresOpenobject features
Openobject features
 
Is it all about the money? Holding on to your best people
Is it all about the money? Holding on to your best peopleIs it all about the money? Holding on to your best people
Is it all about the money? Holding on to your best people
 
Persamaan rumus kimia
Persamaan rumus kimiaPersamaan rumus kimia
Persamaan rumus kimia
 
Webconférence - Quelle Due Diligence pour quel risque ?
Webconférence - Quelle Due Diligence pour quel risque ?Webconférence - Quelle Due Diligence pour quel risque ?
Webconférence - Quelle Due Diligence pour quel risque ?
 
Big Data Ecosystem at InMobi, Nasscom ATC 2013 Noida
Big Data Ecosystem at InMobi, Nasscom ATC 2013 NoidaBig Data Ecosystem at InMobi, Nasscom ATC 2013 Noida
Big Data Ecosystem at InMobi, Nasscom ATC 2013 Noida
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Behavioral Analytics for Financial Intelligence
Behavioral Analytics for Financial IntelligenceBehavioral Analytics for Financial Intelligence
Behavioral Analytics for Financial Intelligence
 
Social Media TOS Crash Course (Redacted #heweb13 presentation)
Social Media TOS Crash Course (Redacted #heweb13 presentation)Social Media TOS Crash Course (Redacted #heweb13 presentation)
Social Media TOS Crash Course (Redacted #heweb13 presentation)
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by Qubole
 
Auge totalitarismos entreguerras
Auge totalitarismos entreguerrasAuge totalitarismos entreguerras
Auge totalitarismos entreguerras
 
Lonja de valencia
Lonja de valenciaLonja de valencia
Lonja de valencia
 
Αριθμοί μέχρι το 7.000 Κεφ.40
Αριθμοί μέχρι το 7.000 Κεφ.40Αριθμοί μέχρι το 7.000 Κεφ.40
Αριθμοί μέχρι το 7.000 Κεφ.40
 
Case study - Hotel Silicrest
Case study - Hotel SilicrestCase study - Hotel Silicrest
Case study - Hotel Silicrest
 
Challenges and Risks for the CIO from Outsourcing in the digital era
Challenges and Risks for the CIO from Outsourcing in the digital eraChallenges and Risks for the CIO from Outsourcing in the digital era
Challenges and Risks for the CIO from Outsourcing in the digital era
 

Similar to Strata London 16: sightseeing, venues, and friends

DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
Timothy Spann
 

Similar to Strata London 16: sightseeing, venues, and friends (20)

DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Presentation
PresentationPresentation
Presentation
 
Apache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-DataApache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-Data
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Spark DataFrames for Data Munging
Spark DataFrames for Data MungingSpark DataFrames for Data Munging
Spark DataFrames for Data Munging
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Splunk App for Stream - Einblicke in Ihren Netzwerkverkehr
Splunk App for Stream - Einblicke in Ihren NetzwerkverkehrSplunk App for Stream - Einblicke in Ihren Netzwerkverkehr
Splunk App for Stream - Einblicke in Ihren Netzwerkverkehr
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
 
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
 
Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
What You Need To Know About The Top Database Trends
What You Need To Know About The Top Database TrendsWhat You Need To Know About The Top Database Trends
What You Need To Know About The Top Database Trends
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 

More from Natalino Busa

Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
Natalino Busa
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Natalino Busa
 

More from Natalino Busa (17)

Data Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationData Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovation
 
Data science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter NotebooksData science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter Notebooks
 
7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
 
[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing
 
Data in Action
Data in ActionData in Action
Data in Action
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analytics
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analyticsBig and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analytics
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Strata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topicsStrata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topics
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 

Recently uploaded

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Recently uploaded (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 

Strata London 16: sightseeing, venues, and friends

  • 1. @natbusa | linkedin.com: Natalino Busa Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra Natalino Busa Head of Applied Data Science at Teradata
  • 2. @natbusa | linkedin.com: Natalino Busa Distributed computing Machine Learning Statistics Big/Fast Data Streaming Computing Head of Applied Data Science at Teradata
  • 3. @natbusa | linkedin.com: Natalino Busa Location Based Transaction Networks Locations Network Transactions
  • 4. @natbusa | linkedin.com: Natalino Busa You find them everywhere! Emails Social Networks Check-ins Payments
  • 5. @natbusa | linkedin.com: Natalino Busa You find them everywhere! Emails Industrial IoT networks Social Networks Check-ins Social Networks Ratings CallsPayments
  • 6. @natbusa | linkedin.com: Natalino Busa You find them everywhere! Emails Industrial IoT networks Transportation Social Networks Check-ins Social Networks Ratings Traveling Services Messaging Remote Assistance CallsPayments
  • 7. @natbusa | linkedin.com: Natalino Busa Use cases ● Clustering ● Classification ● Anomaly Detection ● Fraud Detection ● Cyber Security ● Forecasting ● Process Mining ● Communities
  • 8. @natbusa | linkedin.com: Natalino Busa Location Based Transaction Networks Locations Network Transactions
  • 9. @natbusa | linkedin.com: Natalino Busa Transform Aggregate Machine Learning Score, Train SQL, Graph, ML E T L Extract Streaming Real-Time Events Batch Extracts from DBs Consume from APIs Extract from no-SQL datastores Load Expose as APIs Load to DBs Produce to APIs Push to noSQL datastores Evolution of ETL
  • 10. @natbusa | linkedin.com: Natalino Busa Fast writes 2D Data Structure Replicated Tunable consistency Multi-Data centers CassandraKafka Spark Streaming Events Distributed, Scalable Transport Events are persisted Decoupled Consumer-Producers Topics and Partitions Ad-Hoc Queries Joins, Aggregate User Defined Functions Machine Learning, Advanced Stats and Analytics Kafka+Cassandra+Spark: Streaming Machine Learning
  • 11. @natbusa | linkedin.com: Natalino Busa Example: Analyze gowalla events uid | ts | lat | lon | vid -------+--------------------------+----------+-----------+-------- 46073 | 2010-10-16 12:01:40+0000 | 51.51339 | -0.13101 | 219097 46073 | 2010-10-16 11:20:02+0000 | 51.51441 | -0.141921 | 23090 46073 | 2010-10-13 12:08:19+0000 | 59.33101 | 18.06445 | 136676 46073 | 2010-10-12 23:28:43+0000 | 59.45591 | 17.81232 | 96618 46073 | 2010-09-15 18:14:13+0000 | 59.45591 | 17.81232 | 96618 Events dataset Venues dataset vid | name | lat | long --------+----------------------------+----------+----------- 754108 | My Suit NY | 40.73474 | -73.87434 249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405 Users dataset uid | id -------+-------- 1535 | 2 1535 | 1691 1535 | 8869 1535 | 23363 1535 | 52998 82694 | 7860 82694 | 8026 82694 | 10624 82694 | 30683 82694 | 57534 82694 | 75763 82694 | 76154 82694 | 76223 82694 | 88708 82694 | 151476
  • 12. @natbusa | linkedin.com: Natalino Busa Cassandra: Store all the data Spark: Analyze all the data DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics! Data Spark and Cassandra: distributed goodness
  • 13. @natbusa | linkedin.com: Natalino Busa Cassandra - Spark Connector Cassandra: Store all the data Spark: Distributed Data Processing Executors and Workers Cassandra-Spark Connector: Data locality, Reduce Shuffling RDD’s to Cassandra Partitions DC3: replication factor 3 + Spark Executors
  • 14. @natbusa | linkedin.com: Natalino Busa Joining Tables val df_venues = cc.sql("select vid, name from lbsn.venues") .as("venues") val df_events = cc.sql("select * from lbsn.events") .as("events") val events = df_events .join(df_venues, df_events("events.vid") === df_venues("venues.vid")) .select("ts", "uid", "lat", "lon", "venues.vid","venues.name") events.count()
  • 15. @natbusa | linkedin.com: Natalino Busa Venues and Events
  • 16. @natbusa | linkedin.com: Natalino Busa Basic Statistics val venues_count = events.groupBy("name").count() venues_count.sort(venues_count("count").desc).show(5, false) +---------------------------------+-----+ |name |count| +---------------------------------+-----+ |LGA LaGuardia Airport |1679 | |JFK John F. Kennedy International|1653 | |Times Square |1089 | |Grand Central Terminal |1005 | |The Museum of Modern Art (MoMA) |392 | +---------------------------------+-----+
  • 17. @natbusa | linkedin.com: Natalino Busa K-Means Clustering import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors val locs = events.select("lat","lon") .map(s => Vectors.dense(s.getDouble(0), s.getDouble(1))) .cache() val numClusters = 20 val numIterations = 20 val clusters = KMeans.train(locs, numClusters, numIterations)
  • 18. @natbusa | linkedin.com: Natalino BusaEvents k-means clustering
  • 19. @natbusa | linkedin.com: Natalino Busa K-Means Scoring import org.apache.spark.sql.functions.udf val func = (lat:Double, lon:Double) => clusters.predict(Vectors.dense(lat,lon)) val sqlfunc = udf(func) val locs_cid = events .withColumn("cluster", sqlfunc(events("lat"), events("lon")))
  • 20. @natbusa | linkedin.com: Natalino Busa Top venues per cluster
  • 21. @natbusa | linkedin.com: Natalino Busa Storing k-means clusters top venues in Cassandra CREATE TABLE lbsn.topvenues ( cluster int, name text, count int, PRIMARY KEY (cl,name) ); locs_cid.select("cluster", "name") .groupBy("cluster", "name"). .agg(Map("name" -> "count")). .sort($"cluster", $"COUNT(name)".desc) .saveToCassandra("lbsn", "topvenues", SomeColumns("cluster", "name", "count"))
  • 22. @natbusa | linkedin.com: Natalino Busa Storing k-means clusters top venues in Cassandra CREATE TABLE lbsn.topvenues ( cluster int, name text, count int, PRIMARY KEY (cl,name) ); locs_cid.select("cluster", "name") .groupBy("cluster", "name"). .agg(Map("name" -> "count")). .sort($"cluster", $"COUNT(name)".desc) .saveToCassandra("lbsn", "topvenues", SomeColumns("cluster", "name", "count")) Works in Spark Streaming too!
  • 23. @natbusa | linkedin.com: Natalino Busa Process mining 1 2 4 5 3 6 7 8 9
  • 24. @natbusa | linkedin.com: Natalino Busa Process mining time time time 1 2 4 5 3 6 7 8 9
  • 25. @natbusa | linkedin.com: Natalino Busa Process mining time time time Source id Target id 1 3 3 8 5 2 1 2 4 5 3 6 7 8 9
  • 26. @natbusa | linkedin.com: Natalino Busa https://imgflip.com/i/155v6p
  • 27. @natbusa | linkedin.com: Natalino Busa https://imgflip.com/i/155vah
  • 28. @natbusa | linkedin.com: Natalino Busa Process flow: Sankey Diagrams http://sankeymatic.com/build/ by Steve Bogart (Twitter: @nowthis).
  • 29. @natbusa | linkedin.com: Natalino Busa Flow network: Cord diagram size id label 0 4075 135 Times Square 1 3173 10 Empire State Building 2 3055 89 The Museum of Modern Art (MoMA) 3 3041 4 Columbus Circle 4 2831 116 Madison Square Garden 6 2486 54 Grand Central Terminal Circos - http://circos.ca/ By Martin Krzywinski
  • 30. @natbusa | linkedin.com: Natalino Busa Page Ranking clusters import org.graphframes.GraphFrame val g = GraphFrame(df_vertices, df_edges) val results = g.pageRank.resetProbability(0.01).maxIter(20).run() val results_sorted = results .vertices .select("id", "pagerank") .sort($"pagerank".desc) results_sorted.head(20) http://graphframes.github.io/
  • 31. @natbusa | linkedin.com: Natalino Busa Venues Importance: Page Rank
  • 32. @natbusa | linkedin.com: Natalino Busa 10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m time population:events,transactions, sessions,customers,etc event streams recent data historical big data Big Data and Fast Data
  • 33. @natbusa | linkedin.com: Natalino Busa Why Fast Data? 1. Relevant up-to-date information. 2. Delivers actionable events.
  • 34. @natbusa | linkedin.com: Natalino Busa Why Big Data? 1. Analyze and model 2. Learn, cluster, categorize, organize facts
  • 35. @natbusa | linkedin.com: Natalino Busa Batch -> Streaming ETL Hive -> Flink, Akka, Spark, Kafka
  • 36. @natbusa | linkedin.com: Natalino Busa Stream-Centric Architectures
  • 37. @natbusa | linkedin.com: Natalino Busa IngestData Streaming ETL 37 Distributed Data Store Advanced Analytics Real Time APIs Event Streams/Logs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web read the model read the data write the model Microservices Microservices on ML on SQL on Streams
  • 38. @natbusa | linkedin.com: Natalino Busa Data Science: Anomaly Detection An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Hawkins, 1980 Anomaly detection is the search for items or events which do not conform to the expected pattern Chandola, 2009
  • 39. @natbusa | linkedin.com: Natalino Busa Anomaly Detection Parametric BasedGaussian Model Based Histogram, nonparametric
  • 40. @natbusa | linkedin.com: Natalino Busa Clustering with k-means Distance based, space fully covered Not a silver bullet though Intuition:
  • 41. @natbusa | linkedin.com: Natalino Busa Clustering users’ venues Intuition: Users tend to stick in the same places People have habits By clustering the places together We can identify anomalous locations Size of the cluster matters More points means less anomalous Mini-clusters and single anomalies are treated in similar ways ...
  • 42. @natbusa | linkedin.com: Natalino Busa Data Science: clustering with DBSCAN DBSCAN find clusters based on neighbouring density Does not require the number of cluster k beforehand. Clusters are not spherical
  • 43. @natbusa | linkedin.com: Natalino Busa Data Science: clustering users’ venues val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) )) .reduceByKey(_ ++ _) .mapValues( gdbscan cluster _ ) Refernce: scalanlp/nak
  • 44. @natbusa | linkedin.com: Natalino Busa User events DBSCAN clustering
  • 45. @natbusa | linkedin.com: Natalino Busa Resources Spark + Cassandra: Clustering Events http://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html Spark: Machine Learning, SQL frames https://spark.apache.org/docs/latest/mllib-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html Datastax: Analytics and Spark connector http://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html Anomaly Detection Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.
  • 46. @natbusa | linkedin.com: Natalino Busa Datasets https://snap.stanford.edu/data/loc-gowalla.html E. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011 https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zip The project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. . Pictures: https://commons.wikimedia.org Resources
  • 47. @natbusa | linkedin.com: Natalino Busa Spark ML Tutorial today at 15.15 Capital Lounge 3, second floor