Why Choose Brain Inventory For Ecommerce Development.pdf
GraphTour - Albelli: Running Neo4j on a large scale image platform
1. Running Neo4j on a Large
Scale Image Platform
Wouter Crooy & Ruben Heusinkveld
15-02-2018
2. Who are we
• Wouter Crooy – Solution Architect
• Ruben Heusinkveld – Solution Architect
• Neo4j Certified Professionals
@wcrooy
@rheusinkveld
3. The photo organizer
• Deliver well organized, easy to use and secure storage for all
your images
• Ease the process of selecting photos for creating photo
products
• Started as part of a R&D ‘Skunk works’ project
10. The challenge
• Replace legacy system with the new photo organizer
• Move 1.3 PB of photos from on premise to cloud storage
• Analyze & organize all photos (511 million)
• Data cleansing while importing
• Using the same technology / architecture during import and
after
• Ability to add features while importing
• Core of the systems are built in .NET
11. The import
• Hard deadline
• Factory closing that holds the data center with all photos
• Started 1st of April
• Minimum processing of 150 images / second
• ~500 queries / second to database (Neo4j)
• Up to 700 EC2 instances on AWS
15. Why we choose Neo4j
• Close to domain model
• Not an ordinary (relational) database
• Looking for relations between photos/users
• Scalable
• Flexible schema
• Natural / fluent queries
• ACID / data consistency
20. Our Neo4j database
• More than 1.4 billion nodes
• 4.6 billion properties
• 3.7 billion relations
• Total store size of 890 GB
21. Improve your model
• Training
• Neo4j on Slack
• Support engineers
• Query planner
• Query logging (use with care!)
dbms.logs.query.allocation_logging_enabled=true
dbms.logs.query.time_logging_enabled=true
dbms.logs.query.page_logging_enabled=true
23. Command Query Responsibility Segregation
• Seperation between writing and reading data
• Different model between Query and Command API
• Independent scaling
UI
Cache
DB
Component
Component
Update
Publish
Write
Query
Command
24. CQRS Seperate Reads & Writes
• No active event publishing in place
• Specific scenarios for updating / writing data
• Ability to create seperate model for read and write
• Updates (pieces) of the user graph
• Requires reliable and consistent read
• Scale out -> overloading locking of (user) graph
• After import
• Low performance scenarios -> cache with lower update priority
25. Read after write consistency
• All reads should contain the very latest and most accurate
data
• Replication delay between servers
• Split on consistency
• Article by Aseem Kishore:
• https://neo4j.com/blog/advanced-neo4j-fiftythree-reading-writing-
scaling/
26. Graph locking
• Concurrency challenge
• Scale-out => more images from the same user
• Manage the input
• High spread of user/image combination
• Prevent concurrent analysis of multiple images from the same user
27. Graph design considerations
• Property scan
• (User)<-[:BelongsTo]-(Photo)
• More photos
• Property search => full-graph-scan
• Differentiating property
• Create node
• Making changes to the schema….
• For 550+ million nodes
28. Managing a (very) large graph
• Finding the sweet spot between cluster size
• Make sure the reads are busy enough
• More nodes => more load on write master
• Optimal memory vs graph store size
• 1:1
• We are not afraid of polyglot persistency
• We moved EXIF data to document store (DynamoDB)
• User tend to upload photos in (big)batch
• Manage input to prevent race conditions
• Simultaneous creation of nodes => more concurrent locking of the
user graph
29. Managing a (very) large graph (2)
• APOC (Periodic Iterate / Commit) (550mln photos)
• Adding labels to nodes to be updated => 120 minutes
• Adding additional relation, removing of additional label => 3
minutes
• Combining writes if possible for faster write performance
• APOC Do When