Wanderu is a consumer-focused search engine for buses and trains. Eddy will recount the architectural, modeling and other technical “lessons learned” and “lessons unlearned” in implementing our geospatial and search features using Neo4j in the context of a NoSQL polyglot solution.
4. From pt A to pt B
A Shortest Path Problem as a function of
depart, arrive, price, duration, date times
Philly
A: NYC
MEG, $9, 11/07/2013
MEG, $4, 11/07/2013
BOLT, $13, 11/07/2013
Nomenclature: Stations, Trips
B: DC
6. Our Story
• 2 yr startup, Tech started about 1+ yr ago
• Beta in Mar 2013, Launch in Aug 2013
• Knew nothing about Neo4j when we
started (Jun 2012)
• Did not like the relational model: wanted
schema-less and no self-joins
• Wanted a graph model
9. Our Situation
• Data is written only in one direction
• Users search for paths, then segments
• Searches are done by date
• Needed online capability
• Trip info (price/avail) could change on some
11. MongoConnector
•
•
•
•
•
•
•
MongoDB Lab project, open source, unsupported
Uses Replica Mechanism: Oplog
Eventually Consistent (not real time)
Written in Python
Main methods: Upserts and Deletes, passes doc
Implement DocMgr->Neo4jDocMgr->py2neo
We can add new properties easily on the fly
12. Polyglot Arch
BOS, NYC
BOS, PHL
NYC, DC
NYC, PHL
Scraping
Bus Websites
JSON
Non-uniform
Data
Replica
Mechanism
MongoDB
REST
Server
Nodes & Edges
Neo4j
Mongo
Conn
14. Our Story
• We tried to “dump” all data into Neo4j
• Edges had dates -> too many Edges ->
“Super Node Problem”
• Query perf was terrible (1+ mins) and
worse as # edges increased
• Tried Gremlin -> No improvements
• Needed range queries on Edges
15. “Dehydate”
• Don’t store everything in the Neo4j, only
metadata
• Use Neo4j as a “connection index”
• Don’t store entities in Nodes, only keys
• Don’t store heavy properties in Edges
17. Our Solution
• Serve paths from Neo4j
• Segments from MongoDB (with date
constraints)
• Back to “Joins”
• “Join” across Neo4j + MongoDB:
1 != 525d9031e6c9236072114387
18. Joins across DBs
MongoDB: Stations
Neo4j: Nodes
BOS
NYC
DC
DC
...
generated by dbs
BOS
NYC
• Forget seq id
...
• Use a human-created
“UUID” string for id
MongoDB: Trips
Neo4j: Edges
BOS-NYC
BOS-NYC
BOS-DC
BOS-DC
NYC-DC
NYC-DC
...
...
• Convert pair into id:
depart-arrive
• For example: BOSNYC
21. Lessons of Lessons
• Really understand the Neo4j Runtime
Model
• Pick universal human generated ids
• Join across dbs better than RDBMS: 10s
paths x 100s segments vs. 500k x 500k
• Glad to have picked Neo4j: doing content
gen and more geo features now