SlideShare a Scribd company logo
1 of 29
Download to read offline
BUILDING  A  GRAPH  OF  ALL  
U.S.  BUSINESSES  USING  
SPARK  TECHNOLOGIES.  
Alexis  Roos @alexisroos
Engineering  manager  /  Lead  Data  Science  Engineer
Radius  Intelligence
AGENDA
• Radius  Intelligence
• Objectives  and  terminology
• Business  graph  data  pipeline
• Lessons  learned
• Q&As
2
Radius  Messaging  Framework
June  8,  2015  
Radius transforms the way B2B
marketers discover markets, acquire
customers, and measure success.
RADIUS  INTELLIGENCE
Our software is powered by the
Radius Intelligence Cloud
a proprietary data science engine
which provides predictive analytics,
powerful segmentation, and
seamless integrations.
Predictive Marketing software
RADIUS  INTELLIGENCE
➔ Founded  in  2009
➔ 125+  employees (and  growing)
➔ Headquartered  in  San  Francisco,  CA
➔ $125  million in  funding  to  date
➔ Cutting  edge  in  Data  Science and  Data  
Engineering  Technologies
4
Produce  an  accurate and comprehensive  Business  Graph from  records coming  from  
dozens  of  sources  comprising  hundreds  of  data  points  such  as:
BUSINESS  GRAPH  OBJECTIVE
5
…
Answer  complex  connected  questions  such  as:
6
BUSINESS  GRAPH  BENEFITS
• Employees:  Find  engineering  executives  at  a  given  company  
focusing  on  Big  Data  or  security
• News:  Show  News related  to  companies  in  my  marketing  segment  
such  as  Funding  announcements  or  mergers  and  acquisitions
• Intent:  Show  companies  with  a  buying  Intent for  my  product  or  
engineering  managers  looking  for  security  product
• Organization:  Show  all  locations  or  Employees  for  a  given  business
etc
o Business:  a  company  or  organization
ex:  Blue  Bottle  company
BUSINESS  GRAPH  TERMINOLOGY
7
E I
N T
o Location:  a  business  at  a  particular  address
ex:  Blue  Bottle  at  Sansome St,  San  Francisco
o Attributes:  information  associated  with  a
business  and/or  location
ex:  address,  website,  phone  number,  etc.
o Additional  Data  types:
ex:  Employees,  Intent,  News,  Technologies,  etc.
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
8
Dozens sources
7B+ records
50B+ Data  Points
• Crawling
• Licensing
Raw  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
9
• Standardization
• Validation
• Normalization
Normalized  RecordsRaw  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
10
Cluster  records  on:
• Address
• Name  of  the  business
• Websites
• Phone  numbers
• Employees
• Description
• Category  codes
• Etc.
Clusters  of  records  by  
address/business
Normalized  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering Construction
-­ Locations      
-­ Businesses  
&  more
Data  acquisition
11
Construct  Locations:
• Filter  bad  clusters
• Use  all  records  in  cluster  to  
create  Location  with  attributes:
o Select  best  value(s)
o Score  values:  confidence  
based  on  specific  record  
features  including  historical
o Impute  missing  values
• Etc.
Locations
MLLib
Clusters
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation ClusteringData  acquisition Construction
Businesses  
&  more
12
Construct  business  &  more:
• Create  business  and  link  
locations
• Add  business  level  info:  
news,  employees,  intent,  etc.
• Correct/  impute  fields  at  
business  level
• Etc.
Locations Business  graph
MLLib
E
I
N
T
E
GraphX models  a  “property  graph”  which is  an  multi  directed,  attributed graph.
SPARK  /  GRAPHX
13
(1,  Location) (2,  Domain)
(1,  2,  source1)
(3,  Phone)
(1,  2,  source2)
class Graph[VD,  ED]  {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
val triplets: RDD[EdgeTriplet[VD,  ED]]  
}
VertexRDD[VD] extends RDD[(VertexID,  VD)]
EdgeRDD[ED] extends RDD[Edge[ED]]
VD  -­ Vertex  data attribute type
ED  -­ Edge data attribute type
Graph  operations  are  like  RDD  operations:  functional   and  lazily  evaluated
SPARK  /  GRAPHX
14
Operations  from  Graph  and  GraphOps:
• Information:  numEdges,  numVertices,  inDegrees,  outDegrees,  degrees
• Collections:  vertices,  edges,  triplets
• Transformations  and  modifications:  mapVertices,  mapEdges,  mapTriplets,
filter,  reverse,  subgraph,  mask,  groupEdges,  convertToCanonicalEdges
• Join:  joinVertices,  outerJoinVertices
• Aggregations:  collectNeighbor(Id)s,  aggregateMessages,  sendMsg,  mergeMsg,  tripletFields
• Caching  and  partitioning: persist,  cache,  checkpoint,  unpersist(Vertices),  partitionBy
• Graph  algorithms:  pregel,  pageRank,  connectedComponents,  triangleCount
And  data  operations  on  VertexRDD and  EdgeRDD:
• VertexRDD:  filter,  mapValues,  minus,  diff,  innerJoin,  leftJoin,  aggregateUsingIndex
• EdgeRDD:  mapValues,  reverse,  innerJoin
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Data  preparation ClusteringData  acquisition
Construct  business  graph:
• Create  business  and  link  
locations
• Add  business  level  info:  
news,  employees,  intent,  
etc.
• Correct/  impute  fields  at  
business  level
• Etc.
Locations Business  graph
Construction
Businesses  
&  more
15
E
I
N
T
E
1. Create  vertex  for  each  location  and  each  known  business
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
16
VertexRDD[Location]
Clusters
map
VertexRDD[Business]
Curated  list
map
2. Link  Locations  to  Known  Business  per  relationships  (such  as  domain)
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
flatMap
flatMap
attributes
Create
Edges
Location Business
17
Locations
Businesses
join
3.  Create  vertex  for  unique  top  attributes  such  as  address,  phone,  domain,  etc.
and  link  locations  to  these  attributes
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
flatMap
attributes
groupBy
Create
Vertices
& Edges
Location  and  attributes
18
Locations
Aggregate  messages:  
sendMsg,  mergeMsg
SPARK  /  GRAPHX
Seq
19
sendMsg:  (srcID,  name) mergeMsg:  x  ++  y
Location  1
Location  2
Domain
sendMsg mergeMsg
4.  Create  location  to  location  edges  using  attributes  relationship  and  graph  cutting
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Locations  and  attributes
aggregate
Messages
.combinations(2)
20
Locations
.filter(isSimilar)  *
*  such  as  name  matching
Note:  .combinations  (2)  and  bottleneck
PERFORMANCE  TIP
.combinations(2)
O(n):  N^2
5
1
4
2
3
21
5
1
4
2
3
smarter  combination
O(n):  N
1 2 3 4 5
SPARK  /  GRAPHX
22
Connected  components
Group  1
Group  2
connected
Components
Locations
5.  Call  connected  components  on  linked  locations  and  create  businesses
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
connected
Components
Locations
Location Business
Create
Vertices
& Edges
23
GraphX Connected  components  implementation  already  performs  optimizations:
– Incremental  Updates  to  Mirror  Caches
– Join  Elimination
– Index  Scanning  for  Active  Sets
– Local  Vertex  and  Edge  Indices
– Index  and  Routing  Table  Reuse
But  algorithm  is  iterative  so  optimization  matters:
PERFORMANCE  TIP
24
• Vertices  and  edges  should  be  in  memory:  more  efficient  to
subset  /  keep  the  graph  minimal  or  filter  and  rejoin  later
• Check-­pointing  improves  performance  (less  lineage)
6.  More  steps:
• Assign  headquarter
• Join  additional  information:  Employees,  News,  Intent,  etc.
• Further  corrections  /  imputations:  attributes,  revenues,  etc.
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Location Business
more
operations
25
Business  graph
E
I
N
T
E
GRAPHICAL  VIEW
26
From  Locations  with  attributes  to  Locations  with  Business
GRAPHICAL  VIEW
27
From  Locations  with  attributes  to  Locations  with  Business
• Using  a  Graph  allows  us  to  naturally  model  business  relationships
and  append  new  data  types  over  time.  Also  allowed  us  to  debug  data!
LESSONS  LEARNED
28
• GraphX scales:  250M  vertices  and  1b  edges  in  clustering
• Creating  and  updating  the  graph  is  easy  using  RDD  operations
• Edge  cutting  is  an  art  as  much  as  a  science
• Connected  Components  is  an  expensive  operation
THANK  YOU.
WE  ARE  HIRING!
http://radius.com/jobs/
Alexis  Roos
alexis@radius.com
@alexisroos

More Related Content

What's hot

Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovDatabricks
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Databricks
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataDatabricks
 
Gender Prediction with Databricks AutoML Pipeline
Gender Prediction with Databricks AutoML PipelineGender Prediction with Databricks AutoML Pipeline
Gender Prediction with Databricks AutoML PipelineDatabricks
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph✔ Eric David Benari, PMP
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 

What's hot (20)

Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Gender Prediction with Databricks AutoML Pipeline
Gender Prediction with Databricks AutoML PipelineGender Prediction with Databricks AutoML Pipeline
Gender Prediction with Databricks AutoML Pipeline
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on Embeddings
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 

Viewers also liked

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkAshutosh Trivedi
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analyticsSigmoid
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Machine learning in finance using python
Machine learning in finance using pythonMachine learning in finance using python
Machine learning in finance using pythonEric Tham
 
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)Takayuki Shimizukawa
 
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015Chun-Yu Tseng
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Andrei Nikolaenko
 
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )Shamim bhuiyan
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Spark Summit
 
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Антон Шестаков
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike GualtieriSpark Summit
 
Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Linkurious
 

Viewers also liked (20)

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache Spark
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Machine learning in finance using python
Machine learning in finance using pythonMachine learning in finance using python
Machine learning in finance using python
 
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
 
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
 
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 
Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017
 

Similar to Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingPrithwis Mukerjee
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Neo4j
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyNeo4j
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneMongoDB
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Graph all the things - PRathle
Graph all the things - PRathleGraph all the things - PRathle
Graph all the things - PRathleNeo4j
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to GraphsNeo4j
 
Lightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional dataLightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional dataPavel Hardak
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
 
GraphTalk Copenhagen - Introduction to Graphs and Neo4j
GraphTalk Copenhagen - Introduction to Graphs and Neo4jGraphTalk Copenhagen - Introduction to Graphs and Neo4j
GraphTalk Copenhagen - Introduction to Graphs and Neo4jNeo4j
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudKaran Singh
 
RahulSoni_ETL_resume
RahulSoni_ETL_resumeRahulSoni_ETL_resume
RahulSoni_ETL_resumeRahul Soni
 

Similar to Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos (20)

OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Graph all the things - PRathle
Graph all the things - PRathleGraph all the things - PRathle
Graph all the things - PRathle
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
Lightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional dataLightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional data
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
 
GraphTalk Copenhagen - Introduction to Graphs and Neo4j
GraphTalk Copenhagen - Introduction to Graphs and Neo4jGraphTalk Copenhagen - Introduction to Graphs and Neo4j
GraphTalk Copenhagen - Introduction to Graphs and Neo4j
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
RahulSoni_ETL_resume
RahulSoni_ETL_resumeRahulSoni_ETL_resume
RahulSoni_ETL_resume
 
Sandeep Grandhi (1)
Sandeep Grandhi (1)Sandeep Grandhi (1)
Sandeep Grandhi (1)
 
Big Data Modeling
Big Data ModelingBig Data Modeling
Big Data Modeling
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

  • 1. BUILDING  A  GRAPH  OF  ALL   U.S.  BUSINESSES  USING   SPARK  TECHNOLOGIES.   Alexis  Roos @alexisroos Engineering  manager  /  Lead  Data  Science  Engineer Radius  Intelligence
  • 2. AGENDA • Radius  Intelligence • Objectives  and  terminology • Business  graph  data  pipeline • Lessons  learned • Q&As 2
  • 3. Radius  Messaging  Framework June  8,  2015   Radius transforms the way B2B marketers discover markets, acquire customers, and measure success. RADIUS  INTELLIGENCE Our software is powered by the Radius Intelligence Cloud a proprietary data science engine which provides predictive analytics, powerful segmentation, and seamless integrations. Predictive Marketing software
  • 4. RADIUS  INTELLIGENCE ➔ Founded  in  2009 ➔ 125+  employees (and  growing) ➔ Headquartered  in  San  Francisco,  CA ➔ $125  million in  funding  to  date ➔ Cutting  edge  in  Data  Science and  Data   Engineering  Technologies 4
  • 5. Produce  an  accurate and comprehensive  Business  Graph from  records coming  from   dozens  of  sources  comprising  hundreds  of  data  points  such  as: BUSINESS  GRAPH  OBJECTIVE 5 …
  • 6. Answer  complex  connected  questions  such  as: 6 BUSINESS  GRAPH  BENEFITS • Employees:  Find  engineering  executives  at  a  given  company   focusing  on  Big  Data  or  security • News:  Show  News related  to  companies  in  my  marketing  segment   such  as  Funding  announcements  or  mergers  and  acquisitions • Intent:  Show  companies  with  a  buying  Intent for  my  product  or   engineering  managers  looking  for  security  product • Organization:  Show  all  locations  or  Employees  for  a  given  business etc
  • 7. o Business:  a  company  or  organization ex:  Blue  Bottle  company BUSINESS  GRAPH  TERMINOLOGY 7 E I N T o Location:  a  business  at  a  particular  address ex:  Blue  Bottle  at  Sansome St,  San  Francisco o Attributes:  information  associated  with  a business  and/or  location ex:  address,  website,  phone  number,  etc. o Additional  Data  types: ex:  Employees,  Intent,  News,  Technologies,  etc.
  • 8. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 8 Dozens sources 7B+ records 50B+ Data  Points • Crawling • Licensing Raw  Records
  • 9. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 9 • Standardization • Validation • Normalization Normalized  RecordsRaw  Records
  • 10. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 10 Cluster  records  on: • Address • Name  of  the  business • Websites • Phone  numbers • Employees • Description • Category  codes • Etc. Clusters  of  records  by   address/business Normalized  Records
  • 11. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering Construction -­ Locations       -­ Businesses   &  more Data  acquisition 11 Construct  Locations: • Filter  bad  clusters • Use  all  records  in  cluster  to   create  Location  with  attributes: o Select  best  value(s) o Score  values:  confidence   based  on  specific  record   features  including  historical o Impute  missing  values • Etc. Locations MLLib Clusters
  • 12. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation ClusteringData  acquisition Construction Businesses   &  more 12 Construct  business  &  more: • Create  business  and  link   locations • Add  business  level  info:   news,  employees,  intent,  etc. • Correct/  impute  fields  at   business  level • Etc. Locations Business  graph MLLib E I N T E
  • 13. GraphX models  a  “property  graph”  which is  an  multi  directed,  attributed graph. SPARK  /  GRAPHX 13 (1,  Location) (2,  Domain) (1,  2,  source1) (3,  Phone) (1,  2,  source2) class Graph[VD,  ED]  { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] val triplets: RDD[EdgeTriplet[VD,  ED]]   } VertexRDD[VD] extends RDD[(VertexID,  VD)] EdgeRDD[ED] extends RDD[Edge[ED]] VD  -­ Vertex  data attribute type ED  -­ Edge data attribute type
  • 14. Graph  operations  are  like  RDD  operations:  functional   and  lazily  evaluated SPARK  /  GRAPHX 14 Operations  from  Graph  and  GraphOps: • Information:  numEdges,  numVertices,  inDegrees,  outDegrees,  degrees • Collections:  vertices,  edges,  triplets • Transformations  and  modifications:  mapVertices,  mapEdges,  mapTriplets, filter,  reverse,  subgraph,  mask,  groupEdges,  convertToCanonicalEdges • Join:  joinVertices,  outerJoinVertices • Aggregations:  collectNeighbor(Id)s,  aggregateMessages,  sendMsg,  mergeMsg,  tripletFields • Caching  and  partitioning: persist,  cache,  checkpoint,  unpersist(Vertices),  partitionBy • Graph  algorithms:  pregel,  pageRank,  connectedComponents,  triangleCount And  data  operations  on  VertexRDD and  EdgeRDD: • VertexRDD:  filter,  mapValues,  minus,  diff,  innerJoin,  leftJoin,  aggregateUsingIndex • EdgeRDD:  mapValues,  reverse,  innerJoin
  • 15. GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Data  preparation ClusteringData  acquisition Construct  business  graph: • Create  business  and  link   locations • Add  business  level  info:   news,  employees,  intent,   etc. • Correct/  impute  fields  at   business  level • Etc. Locations Business  graph Construction Businesses   &  more 15 E I N T E
  • 16. 1. Create  vertex  for  each  location  and  each  known  business GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN 16 VertexRDD[Location] Clusters map VertexRDD[Business] Curated  list map
  • 17. 2. Link  Locations  to  Known  Business  per  relationships  (such  as  domain) GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN flatMap flatMap attributes Create Edges Location Business 17 Locations Businesses join
  • 18. 3.  Create  vertex  for  unique  top  attributes  such  as  address,  phone,  domain,  etc. and  link  locations  to  these  attributes GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN flatMap attributes groupBy Create Vertices & Edges Location  and  attributes 18 Locations
  • 19. Aggregate  messages:   sendMsg,  mergeMsg SPARK  /  GRAPHX Seq 19 sendMsg:  (srcID,  name) mergeMsg:  x  ++  y Location  1 Location  2 Domain sendMsg mergeMsg
  • 20. 4.  Create  location  to  location  edges  using  attributes  relationship  and  graph  cutting GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Locations  and  attributes aggregate Messages .combinations(2) 20 Locations .filter(isSimilar)  * *  such  as  name  matching
  • 21. Note:  .combinations  (2)  and  bottleneck PERFORMANCE  TIP .combinations(2) O(n):  N^2 5 1 4 2 3 21 5 1 4 2 3 smarter  combination O(n):  N 1 2 3 4 5
  • 22. SPARK  /  GRAPHX 22 Connected  components Group  1 Group  2 connected Components Locations
  • 23. 5.  Call  connected  components  on  linked  locations  and  create  businesses GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN connected Components Locations Location Business Create Vertices & Edges 23
  • 24. GraphX Connected  components  implementation  already  performs  optimizations: – Incremental  Updates  to  Mirror  Caches – Join  Elimination – Index  Scanning  for  Active  Sets – Local  Vertex  and  Edge  Indices – Index  and  Routing  Table  Reuse But  algorithm  is  iterative  so  optimization  matters: PERFORMANCE  TIP 24 • Vertices  and  edges  should  be  in  memory:  more  efficient  to subset  /  keep  the  graph  minimal  or  filter  and  rejoin  later • Check-­pointing  improves  performance  (less  lineage)
  • 25. 6.  More  steps: • Assign  headquarter • Join  additional  information:  Employees,  News,  Intent,  etc. • Further  corrections  /  imputations:  attributes,  revenues,  etc. GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Location Business more operations 25 Business  graph E I N T E
  • 26. GRAPHICAL  VIEW 26 From  Locations  with  attributes  to  Locations  with  Business
  • 27. GRAPHICAL  VIEW 27 From  Locations  with  attributes  to  Locations  with  Business
  • 28. • Using  a  Graph  allows  us  to  naturally  model  business  relationships and  append  new  data  types  over  time.  Also  allowed  us  to  debug  data! LESSONS  LEARNED 28 • GraphX scales:  250M  vertices  and  1b  edges  in  clustering • Creating  and  updating  the  graph  is  easy  using  RDD  operations • Edge  cutting  is  an  art  as  much  as  a  science • Connected  Components  is  an  expensive  operation
  • 29. THANK  YOU. WE  ARE  HIRING! http://radius.com/jobs/ Alexis  Roos alexis@radius.com @alexisroos