11. > 10,000
Superset charts and
dashboards
> 6,000
Experiments and
metrics
> 6,000
Tableau workbooks
and charts
> 1,500
Knowledge posts
Data resources
Beyond the data warehouse
15. Portland
San Francisco
Los Angeles
Toronto
New York
Miami
Sao Paulo
Dublin
London
Paris
Barcelona
Berlin
Milan
Copenhagen
New Delhi
Seoul
Beijing
Tokyo
Sydney
Singapore
Washington, DC
> 20
Offices around the world
34. Databases
6
APIs
4
Airflow DAG
1
We leverage all these data resources to build a graph in Hive
comprising of nodes and relationships
The workflow is run everyday though the graph is left to soak to prevent
flickering
37. Persistent vs. transient relationships
Persistent relationships represent a snapshot in time
createdSpoke 3
38. Persistent vs. transient relationships
Transient relationships represent events which are somewhat sporadic in nature
M Tu W Th F
consumedSpoke 3
39. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
40. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
41. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
42. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
43. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
44. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
45. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
46. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
47. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
48. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
49. The winding data path
Airflow
Data transfer
Python
Graph datastore
neo4j-driver
Python Neo4j driver
Neo4j
Graph database
GraphAware
Neo4j/Elasticsearch plugin
Elasticsearch
Search engine
Flask
Python web framework
Hive
Data warehouse
50. Logical
Given our data is
represented as a graph
it is logical to use a
graph database to
store the data
Nimble
Performance wins
when dealing with
connected data versus
relational databases
Popular
It is the world’s leading
graph database and
the community edition
is free
Integrative
It integrates well with
Python and
Elasticsearch
Why we choose Neo4j for our database
The four main reasons
51. The Neo4j and Elasticsearch symbiotic relationship
Courtesy of two GraphAware plugins
Neo4j plugin
Provides bi-directional integration which transparently and asynchronously replicate data from
Neo4j to Elasticsearch
Elasticsearch plugin
Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the
search rankings by leveraging the graph topology
55. From local to global uniqueness
A mechanism to reference nodes in an abstract manner
GraphAware UUID plugin
Transparently assigns a globally unique UUID property to newly created elements (nodes and
relationships) which cannot be changed or deleted
Globally unique
Enables us to uniquely identify a single node via the Entity label and UUID property which
allows for parameterized queries which leads to faster query and execution times
61. Technical data power
user; the epitome of a
tribal knowledge
holder
Daphne Data
User personas
Less data literate;
needs to keep tabs on
her team’s resources
Manager Mel
New employee, new
team, or new to data;
has no idea what’s
going on
Nathan New
62. Designing for data exploration, discovery, and trust
Company dataSearch
Resource details
&metadata
User data Group data
65. Search
Resource details
&metadata
Company dataUser data Group data
Surface relationships,
everything’s a link to promote
exploration
Metadata & consumption
Description, external link, social
66. Column details & value distributions
Table lineage
Enrich metadata on the fly
Search
Resource details
&metadata
Company dataUser data Group data
68. User details &
metadata
What they make,
what they consume
Search
Resource details
&metadata
Company dataUser data Group data
69. Former employees also
hold tribal knowledge
Search
Resource details
&metadata
Company dataUser data Group data
70. Group overview
Search
Resource details
&metadata
Company dataUser data Group data
Thumbnails for maximum context
Basic organization functionality
Pinterest-like curation &
suggested content
71. We gather over 15,000 thumbnails from
Tableau, Superset, and the Knowledge Repo
76. Proxy nodes
Abstracting complexity
where necessary while
accurately modeling
the data ecosystem
Graph merging
Non-trivial Git-like
merging of graph
updates
Data-dense design
Balancing simplicity and
functionality is hard;
most internal design
resources are not made
for data-rich apps
Complex
dependencies
An umbrella data tool is
vulnerable to changes
in upstream resource
dependencies
The challenges
78. Game-ification
Provide content
producers with a sense
of value
Alerts&
recommendations
Move from active
exploration to deliver
relevant updates and
content suggestions
Certified content
Use certification to build
trust and enable users to
filter through a sea of
stale content
Network analysis
Determine obsolete
nodes, critical paths,
lines of
communication, etc.
The future
80. The Dataportal team
Analytics&Experimentation Products
John Bodley
Software Engineer
Eli Brumbaugh
Experience Designer
Jeff Feng
Product Manager
Michelle Thomas
Software Engineer
Chris Williams
Data Visualization
90. Efficient data retrieval
Solution
Create an index for every label keyed by the ID and UUID properties which in addition to index
hints provides optimal node retrieval
Problem
Indexes provide for efficient data retrieval similar to a RDBMS primary key, however they are
only defined for a single label as opposed to our tuple of hierarchical labels
Restrictions and workarounds with Neo4j indexes