Generative AI on Enterprise Cloud with NiFi and Milvus
03 introduction to graph databases
1.
2. ● Components of a Property Graph
● Introducing Game of Thrones
● Introducing Cypher
● Introducing APOC
● Cypher strikes back
● Property Graph Modeling 101
3.
4. Yes indeed. While Graph Theory has been around since 1736 there
are quite a few ways to implement it. Currently there are two that
matter in the context of databases
● RDF Graph
● Property Graph
A discussion on these is not within the scope of this training, Neo4j
implements the Property Graph (and can be considered the inventor of
said type of graph). If you want more information/a comparison, your
instructor will be happy to provide that offline.
6. Node - vertex, entity, object, thing
Label - Class(ification) of a Node
Alien
Person
7. Node - vertex, entity, object, thing
Label - Class(ification) of a Node
Relationship - edge, physical link
between two nodes
8. Node - vertex, entity, object, thing
Label - Class(ification) of a Node
Relationship - edge, physical link
between two nodes
Type - Class(ification) of a Relationship
COMMANDS
9. Node - vertex, entity, object, thing
Label - Class(ification) of a Node
Relationship - edge, physical link
between two nodes
Type - Class(ification) of a Relationship
Property - Key Value Pair
COMMANDS
stardate: 41054.3
11. Relational Graph
Each column must have
value (or null)
Nodes with the same label
are not required to have the
same properties
Joins are calculated at read
time
Relationships are created at
write time
A row belongs only to one
table
A node can have many (or
no) labels
Relational Graph
14. Based on the book series A Song of Ice and Fire by George R. R.
Martin, what is special about it is the sheer number of characters
that have a big storyline (and - importantly - all interact with each
other). In the movies its dozens (in the books it runs in the
hundreds).
Rather than put of audiences, this turned it into a hit. And it also
made this one hell of a graph database usecase …
15. One of the two databases we'll be using in this training (and that
you should have ready to go) is a Game Of Thrones interaction
database.
So this is finally where you start doing stuff. And lets start with
looking what model that database has.
16. As explained in the preparation email for this training, you should
now have
● a Neo4j instance running
● a gameofthrones database loaded in that instance
● a Neo4j Browser session connected to that instance
I'll not be flipping between slides and code execution. You put me
in a window now and follow along either in the browser guide or by
cutting and pasting from a website ...
17. So it's either … inside the Neo4j Browser you run
:play http://neo4jguides.tomgeudens.io/gdsintro.html
(note that this requires a neo4j.conf setting to whitelist the host)
or you open a regular browser session too and go to
https://bit.ly/neo4j-gds-intro
and cut-and-paste the commands from there
18.
19. The syntax to show the model of a database is ...
// Make sure you're in the correct database
:use gameofthrones
// Show the model
CALL db.schema.visualization();
20.
21. That wasn't exactly what you got, was it?
For me - as human - labels like Knight, Dead, King are obviously all
extra labels, subclassing the Person label and thus I manually
removed them for clarity. The database makes no such distinctions
and does not subclass. All labels are separate. And that's why you
got what you got.
Note that Rome almost had a horse as consul at some point. Allegedly,
true, but Neo4j could have handled that for real!
24. ● A declarative query language
● Focus on the what, not the how
● Uses ASCII art to represent nodes and relationships
("`-''-/").___..--''"`-._
`6_ 6 ) `-. ( ).`-.__.`)
(_Y_.)' ._ ) `._ `. ``-..-'
_..`--'_..-_/ /--'_.'
((((.-'' ((((.' (((.-'
It won't get this complex, I promise ...
25. Diving straight in I'm afraid I have to start by taking away your
precioussss SELECT. A Cypher query always starts with a pattern
match and thus the logical keyword is MATCH
// Find and return all nodes in the database
MATCH (n)
RETURN n;
26. ● What you just did is actually a pretty bad idea. While the
database has no issue with you asking for all the nodes, simple
visualization tools (such as the Neo4j Browser) typically can't
handle what comes back.
● If you want to think along the lines of what you know about
SQL queries, you could see the MATCH as a preprocessing step
and the RETURN as the select/projection.
27. The trick to a good graph query (which I'll repeat ad nauseam) is to
efficiently find your starting points. Using labels is a great way to do
that.
// Find and return all House nodes in the database
MATCH (n:House)
RETURN n;
28. Since a node can have multiple labels, using multiple labels in the
base syntax implies AND and returns nodes that have all the
specified labels
// Find and return all Person nodes in the database that also have
the King and the Dead label
MATCH (n:Person:King:Dead)
RETURN n;
29. ● How many (:Person:King:Dead) nodes did you get?
● Did you notice (for all queries so far) something strange in the
results?
● 9
● Did you ask for relationships in your query? So why then are
there relationships on screen?
30. The whole point of a graph database is to be able to traverse
relationships (which in the case of graph native Neo4j means
hopping pointers). So here we go ...
// Which houses where present in which battles?
MATCH (h:House)-[]->(b:Battle)
RETURN h.name, b.name;
31. The more specific you can make a query, the more efficient it -
generally speaking - becomes.
// Which houses attacked in which battles?
MATCH (h:House)-[a:ATTACKER]->(b:Battle)
RETURN h,a,b LIMIT 30;
32. As a relationship has one and only one type, using multiple
relationships in the base syntax implies OR.
// Which houses attacked or defended in which battles?
MATCH (h:House)-[ad:ATTACKER|DEFENDER]->(b:Battle)
RETURN h,ad,b LIMIT 30;
33. ● The LIMIT is imposed to make sure the browser doesn't blow
up, the database is fine, thank you very much.
● A Cypher query can RETURN quite a few different things and it is
your application's job to handle them ...
36. ● Awesome Procedures On Cypher
● A library of user defined procedures (UDP) and functions
(UDF) that extends the Cypher query language.
● In this particular case it is a supported library, that provides
tons of convenience tools to Cypher.
It's funny, because in The Matrix, Cypher brutally kills Apoc. In reality
you could argue that Apoc totally saves Cypher!
37. Stored procedures, are pieces of slightly enhanced/wrapped query
syntax (PL/SQL comes to mind) and are - as the name implies -
stored inside the database itself.
APOC is a library of Java code that is deployed - as a jar file - to a
folder together with the database software. Yes, of course you can
also create such libraries of your own (many customers do) but
forget about what you know about stored procedures. UDPs and
UDFs are neither stored or created similarly!
38. ● Returns one value
● If it does stuff on the database itself, this is always read-only
● Is used inline
Try it …
// Generate a UUID and return it and the versions of APOC and GDS
RETURN apoc.create.uuid() as uuid, apoc.version() as apocversion,
gds.version() as gdsversion;
39. ● YIELDs a stream of results using a predefined signature
● If it does stuff on the database itself, this can be read or write
● Is CALLed
Try it
// Find out about the signature of a procedure
CALL apoc.help("help") YIELD signature
RETURN signature;
40. Try it some more
// Show which available procedures can load data into the database
CALL apoc.help("load") YIELD type, name, text, signature, core
RETURN type, name, text, signature, core;
41.
42. So in order to start doing Graph Data Science on the Neo4j
database you're going to need a bit more advanced Cypher than
we've seen so far ...
● WHERE clauses
● Aggregation stuff … COLLECT/UNWIND, COUNT, ...
● Intermediate projection with … WITH
● Result deduplication with … DISTINCT
All of this was a prerequiste though … ;-)
43. I find it always calms audiences down when - after brutally ripping
away their SELECT - I can give (back) the WHERE clause …
// So these are braces ...
MATCH (h:House {name:"Darry"})-[d:DEFENDER]->(b:Battle)
RETURN h,d,b;
// And they are the equivalent of checking for equality
MATCH (h:House)-[d:DEFENDER]->(b:Battle)
WHERE h.name = "Darry"
RETURN h,d,b;
44. Between the braces you can only do equality, but the WHERE clause
comes with the full range of options ...
MATCH (p:Person)-[:BELONGS_TO]->(h:House)
WHERE p.death_year >= 300 AND p.death_year <= 1200
RETURN p.name, h.name;
MATCH (p:Person)-[:BELONGS_TO]->(h:House)
WHERE 300 <= p.death_year <= 1200
RETURN p.name, h.name;
45. It came as a bit of a surprise (to me at least) that there is indeed
such a debate. One that has in fact been raging ever since
aggregation became a thing in a database query language. Neo4j
does IMPLICIT aggregation, there is NO group by.
Try it
// How many persons does each house have?
MATCH (a:Person)-[:BELONGS_TO]->(h:House)
RETURN h.name as Housename, count(*) as Household;
46. An aggregation in a projection (RETURN or WITH) will implicitely be
done on everything that is not aggregated ...
Read that again … and again, now (and I'm not giving you syntax)
tell me how big the household of Brotherhood Without Banners
is ...
6
Check again if you didn't get that ...
47. There are quite a few aggregation functions, but the two you'll use
most in this training are count and collect.
Try it
// Which commanders didn't learn the first time?
MATCH
(ac:Person)-[:ATTACKER_COMMANDER]->(b:Battle)<-[:DEFENDER_C
OMMANDER]-(dc:Person)
RETURN ac.name, dc.name, count(b) AS commonBattles,
collect(b.name) AS battlenames
ORDER BY commonBattles DESC LIMIT 5
48. ● I guess explaining what a count does is a bit pointless (wait
until after the deduplication to start laughing though).
● A collect aggregates detail into a list.
● ORDER BY … SKIP … LIMIT
work exactly the same as you're used to … all three are needed
if you want to do paging.
49. I can explain this to you, but I can't understand it for you. So try the
following and answer the - relatively - simple question … How
many times is Jon Snow on screen?
// Just Jon, right?
MATCH (p:Person {name: "Jon Snow"})-[:INTERACTS]->()
RETURN p;
50. I would say your eyes deceived you … allow the
machine to count.
// Counting Jon
MATCH (p:Person {name: "Jon Snow"})-[:INTERACTS]->()
RETURN count(p);
89
// I am the only Jon
MATCH (p:Person {name: "Jon Snow"})-[:INTERACTS]->()
RETURN DISTINCT p;
51. This is one of the biggest issues people have with Cypher queries. It
does pattern matching. Every match returns a pattern that is
unique as a whole, not in it's parts.
Jon Snow shows up in 89 unique interaction patterns.
I made several attempts to add something to the previous line, but the
point is made best if you understand exactly what that line (no more,
no less) says. Please take a moment to do that, I'll wait ...
52. WITH
● provides an intermediate projection with exactly the same
functionality as a RETURN (but without ending the query)
● controls the visibility of variables beyond the WITH … what isn't
projected is gone
● comes with an additional WHERE clause that is the equivalent
of the HAVING clause in SQL
● is the main building block for building Cypher pipeline queries
53. It's only the second slide … remember that they kept that ruse
going for almost 8 seasons … and you kept watching!
// The point of the query is moot, but it shows what WITH can do ...
MATCH (p:Person)-[:ATTACKER_COMMANDER]->(b:Battle)
WITH p.name as name, count(*) AS numBattles, collect(b.name) as
battles
WHERE numBattles = 2
RETURN name, numBattles, battles;
54. It's only the second slide … remember that they kept that ruse
going for almost 8 seasons … and you kept watching!
// The point of the query is moot, but it shows what WITH can do ...
MATCH (p:Person)-[:ATTACKER_COMMANDER]->(b:Battle)
WITH p.name as name, count(*) AS numBattles, collect(b.name) as
battles
WHERE numBattles = 2
RETURN name, numBattles, battles;
55.
56. UNWIND takes a collection and turns it into rows … nice … but what
exactly does that mean?
// How many results come out?
WITH range(1,5) as acollection
UNWIND acollection as arow
WITH arow, acollection, apoc.coll.shuffle(acollection) as ashuffle
UNWIND ashuffle as anotherrow
RETURN arow, anotherrow, acollection,ashuffle;
57. So a collect aggregates anything into a list (= collection) and an
UNWIND turns it back into rows … aaaaarrrgghhhh make it stop !!!
// aggressive lot, these
MATCH (a:House)-[:ATTACKER]->()
WITH collect(DISTINCT a) AS attackers
MATCH (d:House)-[:DEFENDER]->()
WITH attackers, collect(DISTINCT d) AS defenders
UNWIND apoc.coll.removeAll(attackers,defenders) AS houses
RETURN houses.name as Names;
58.
59. To lead you into this list of modeling steps …
1. Define high-level domain requirements
2. Create sample data for modeling purposes
3. Define the questions for the domain
4. Identify entities
5. Identify connections between entities
6. Test the model against the questions
7. Test scalability
And I'm sure that is all very interesting and … yadi yada …
60. So what we are going to do instead
is take a good look at that Game Of
Thrones model again and see what
does / does not make sense!
61.
62.
63. ● Are you capturing all the information? For example … who is
actually fighting in these battles? And why do we only care about
commanders and kings?
● Instance modeling helps you to ask the questions … for
example (and I haven't seen a single episode of the whole thing
myself) … are you really telling me that all the battles are so
pitched/clearcut that no house and not a single commander ever
changes side during one of them?
● ...
64.
65. ● Why? Why is this not just a label (like Knight and King and Dead
… and a word on those too in the next bulletpoint)? Is this
adding anything to the questions we want to answer?
● Does status never change over time? In fact, that goes for
Knight and King and definitely Dead too. The model seems to
support a current situation for Person nodes only … but when
is this exactly? Having a time dimension for some things but
not for all can put you on the path to false conclusions!
● ...
66. I can keep this up for several hours (and then we call that a
Modeling training) … but here are the top tips …
● Work concrete question driven. As the model is very flexible
you can easy iterate and adapt as your understanding grows.
● Leverage instance modeling to show if you've missed
information. Your instance model should show all cases.
● Be expressive in your relationshiptypes. The only
touching-stone is whether the business users understand it
and see their business reflected in it!