05 neo4j gds graph catalog

● What is the Graph Catalog?
● Named graphs versus Anonymous graphs
● Native projection versus Cypher projection
● Mutability
● Graph Catalog management

We are still on the gameofthrones database and you can either
run the following guide inside the Neo4j Browser
:play http://neo4jguides.tomgeudens.io/gdscatalog.html
(note that this requires a neo4j.conf setting to whitelist the host)
or you open a regular browser session too and go to
https://bit.ly/neo4j-gds-catalog
and cut-and-paste the commands from there

The shape of the graph you use for analytics (and algorithms) is
signiﬁcantly diﬀerent from the one you have to run the complex
business queries in real time and do the transactional work. To
reiterate the technical terms …

● is a single set of nodes that are interconnected
● is what you need for the majority of the graph algorithms
If you ever wondered why Facebook (or people leveraging Facebook
data) is so - notoriously - good at analytics … think about what the core
Facebook graph is like ...

● two set of nodes that are connected but the sets themselves
are not interconnected
● great as input for algorithms (such as node similarity) that are
used to create a monopartite graph
If you've done basic Neo4j trainings … the Movie graph is also a
bipartite graph.

● lots of sets of nodes and lots of types of relationships between
them
● ideal for describing a domain or business and for real time
complex queries
This is how we teach you to model in graph modeling classes … did I hit
the point home enough now?

Procedures (part of the GDS library) that let you reshape and
subset your transactional graph so you have the right data in the
right shape to run analytical algorithms.
This is what you
already know ...
Native Graph Storage
Page Cache

Procedures (part of the GDS library) that let you reshape and
subset your transactional graph so you have the right data in the
right shape to run analytical algorithms.
Mutable In-Memory
Workspace

While the in-memory workspace disappears when the database is
stopped (it's ephemeral to use a fancy word) it is also not just a one
reshape, one algorithm run, do-it-all-over-again setup. You can
re-use previous reshapes, mutate them, name them, reuse them.
It's a catalog.
In order to fully grasp that we'll shortly list all the modes in which
you can do the Graph Data Science and then explore them in detail
...

Rather than give you some dry explanation … try it out. I (or rather
pageranking) give(s) you … Jon Snow!
CALL gds.pageRank.stream({
nodeProjection: "Person",
relationshipProjection: "INTERACTS"
}) YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC, name ASC
LIMIT 20;
Bummer, that didn't work out ...

● I can't even show you the real Graph Catalog stuﬀ here
(although it is used under the hood) because this really is the
one-shot-ﬁre-and-forget-doing-the-algorithm method.
● Which is relatively easy to learn.
● And as the Person, INTERACTS subgraph is a monopartite
graph, a native projection (aka Look ma, no hands) was possible
● ...
You're not remembering the series or the books wrong though, Jon
Snow should have come out on top … so something was wrong!

This time we're going for those that are most prominent in the
battles ...
nodeQuery: "MATCH (p:Person) RETURN id(p) AS id",
relationshipQuery: "MATCH (p1:Person)-[]->(:Battle)<-[]-(p2:Person)
RETURN id(p1) AS source, id(p2) AS target"
LIMIT 10;

● Again, not a lot of Graph Catalog stuﬀ to show, the
monopartite graph is shaped on the ﬂy …
● While somewhat more complex (you need to write the queries
to do the projection), the results should immediately be more
relevant (as you're in control) … a great approach for proof of
concepts!
● ...

Exactly the same question as we had in Mode II, but this time we're
going to name the graph.
CALL gds.graph.create.cypher(
"gds-brutes",
"MATCH (p:Person) RETURN id(p) AS id",
"MATCH (p1:Person)-[]->(:Battle)<-[]-(p2:Person) RETURN id(p1) AS source,
id(p2) AS target"
) YIELD graphName, nodeCount, relationshipCount
RETURN *;

Wait … we haven't actually done the algorithm yet ...
CALL gds.pageRank.stream('gds-brutes') YIELD nodeId, score
ORDER BY score DESC, name ASC LIMIT 10;
CALL gds.betweenness.stream('gds-brutes') YIELD nodeId, score
LIMIT 10;
But now … we can just keep going ...

● Now we're getting somewhere … a named graph remains
available in between runs of (potentially) different algorithms.
● Rather than going for an adhoc fire-and-forget, this moves the
ball more towards flexible workflows.
● While Cypher projection is a great tool, it comes with the
downside of being - relatively - slow for huge workloads, …
● ...
Don't get impatient, we'll dig deeper into Catalog management in a
minute … allow me to finish the Fab Four first though … also, did you
notice the difference in who came out on top?

Exactly the same question as we had in Mode I, but this time we're
going to name the graph.
CALL gds.graph.create(
"gds-interaction",
"Person",
"INTERACTS"
) YIELD graphName, nodeCount, relationshipCount
RETURN *;

And run ...
CALL gds.pageRank.stream('gds-interaction') YIELD nodeId, score
LIMIT 10;
And keep going ...
CALL gds.betweenness.stream('gds-interaction') YIELD nodeId, score
LIMIT 10;

● So this is the whole nine yards. And it runs at huge scale
(which you can't see here so you'll have to take my word for it)
● There's a chicken and egg problem though, the monopartite
graph must be in the database already.
● ...
So we ﬁnally did get Jon Snow, but pagerank should also have gotten
him. Can anybody venture a guess by now on what we're doing wrong
there?

Anonymous
Named
Native
Cypher Performant at
scale
Easy to learn
Flexible
workﬂows
Quick proof of
concepts

● The in-memory workspace is the secret sauce of the Graph
Data Science library and is super-eﬃcient. It can handle huge
graph projections.
● It does however require memory and you will quickly run out if
you don't manage it properly.
● Also … you will forget what you put in there if you look at it as a
bottomless pit, thus creating overhead for yourself.
● ...

There's a very interesting tool that gives you an overview of the
in-memory workspace. Try it
CALL gds.graph.list();
If you followed along so far, you should get two results …
gds-brutes and gds-interacts. You can also examine them
individually. Try it
CALL gds.graph.list('gds-brutes');
Btw, a CALL requires a YIELD … except when it is a statement by itself.
Hence the missing YIELD and RETURN (for brevity) here ...

Done with a named graph? Drop it! As there is something not right
with our interactions one, lets get rid of it
CALL gds.graph.drop('gds-interaction');
And verify with the list command that it's indeed gone ...
CALL gds.graph.list();

By popular request the engineering team has been working on a
way to actually persist the complete named projection. And as of
the very latest GDS that tool is there (unpolished for now though) ...
CALL gds.graph.export('gds-brutes',{dbName:"brutes"});
WARNING
You will not ﬁnd this in the guides and I do not want you to try it
now as the steps will confuse a lot of people. Do try this (and
everything else) at home though!

I'm not really supposed to show you this one and there's no
guarantee it will stay in the future, but I ﬁnd this one extremely
useful myself ...
CALL gds.debug.sysInfo();
Very useful for say … quickly ﬁguring out how low you are on heap and
such ...

No, not really
● Unless you're improvising a one-shot thing and even then … the
syntax of these things (unless you're doing a trivial demo) is not
easy, you should follow a workﬂow and use a Named graph.
● Unless you're using an algorithm that hasn't been converted to
using the workspace yet … well … you don't really have a choice
then … (Pathﬁnding comes to mind)

I tried all of the syntax for all of my presentations during these two
days … as you would/should …
● The original decks still had 3.5.x syntax, Emil Eifrem (our CEO)
has sworn to shoot everybody that still shows 3.5.x stuﬀ
● Obviously I also want to show you the latest GDS library
● There are subtle diﬀerences about how to write the projections
in the named syntax versus those in the anonymous syntax
● ...
So spare yourself the frustration and pain and learn the syntax you'll
be using for production. Named graphs. Thank me later!

Jon Snow didn't show up as the top dog based on the pagerank
algorithm. And I actually showed you earlier what the issue is ...
A person's interaction with another person is obviously undirected
(or bi-directional, whichever you prefer), but the Property Graph
is directed and in modeling trainings you'll hear to not create a
second relationship (as that would duplicate data) then.

However, how would an algorithm know that the domain implies
an undirected relationship as the Property Graph has no schema that
speciﬁes / enforces such information?
The algorithm makes the reasonable (default) assumption that
INTERACTS is a directed relationship. Persons that are on the target
end of them are thus not considered in the pagerank. And it turns
out (and this is purely based on how the data was loaded) that Jon
Snow is frequently the target, rarely the source.

nodeProjection: "Person",
relationshipProjection: {
INTERACTS: {
type: "INTERACTS",
orientation: "UNDIRECTED"
}
}
LIMIT 20;

● It takes the Person nodes and puts them in the workspace
(again as Person and note that it didn't have to be).
● It takes the INTERACTS relationships and puts them in the
workspace (again as INTERACTS … idem). Because we specify
the orientation as undirected this will eﬀectively result in
doubling the number of them in the workspace ...
I don't always ﬁnd all this reshaping that obvious myself. Planning
upfront what you are aiming for is a good idea!

I just showed you how to ﬁx the problem for an Anonymous graph,
but now we want it as a Named graph …
● Take the syntax from the Mode IV example and create the
named graph again, this time as gds-interaction-natural
● Try to modify the syntax and create a second named graph,
gds-interaction-undirected
● Using gds.graph.list on both named graphs, can you recognize
the diﬀerence? Note it down!
When you are ready (give everybody a bit of a chance though), paste
your solution (to second and third bulletpoint) in the chat ...

CALL gds.graph.create("gds-interaction-undirected","Person",{
INTERACTS: {
type: "INTERACTS",
orientation: "UNDIRECTED"
}
})
The relationshipCount should have doubled, for me (yours may
be slightly diﬀerent) they are 3907 and 7814

Nodes
● label(s)
● properties
Relationships
● type(s)
● orientation
● aggregation
● properties
And all those can (but also must) be
controlled with either a Native or a Cypher
projection.
● Cypher gives you complete ﬂexiblity,
Native gives you complete
performance.
● Cypher leaves your original graph
standing as is, Native may require
constructs

Lets consider the ﬁnancial practices of Dewey, Cheatum and Howe ...

Less clutter, same information … right … right … RIGHT???

Instead of going to jail for 25 years, Dewey, Cheatum and Howe avoided
the law for another 10 years of money laundering. False names, true
story ...
Because … while aggregation is great for most analytics usecases,
it also destroyed the clear 1% mule kickback scheme that you could
almost literally see with the naked eye … Transactional fraud
detection.
If only there was a way to shape data eﬃciently - depending on the
usecase - without destroying the more expressive set that describes our
business ...

If you remember one thing (ok, one thing + the puppies) of this
session about the Graph Catalog, that is it. That is the purpose of it
and that's why Neo4j can rightfully claim a prominent place in this
game.
And as an aside … the Native Projection can very eﬃciently (much more
eﬃcient than Cypher Projection) do aggregations for analytical
purposes.

Yes, I know it's an empty slide … how could I possibly ﬁt all of it on such
a thing … allow me to swap to my code editor for a second ...

CALL gds.graph.create.cypher('gds-ultimate-cypher',
'MATCH (p:Person) RETURN id(p) as id, p.birth_year as birthyear',
'MATCH
(p1:Person)-[:APPEARED_IN]->(b:Book)<-[:APPEARED_IN]-(p2:Person
)
RETURN id(p1) as source, id(p2) as target, count(DISTINCT b) as
weight');

Who cares as long as we all agree that this and not Jon Snow is the top
dog!

Each of the algorithms comes with eight procedures.
Try typing
CALL gds.wcc
in the browser without completing the line (or entering) and see
what you get ...

Algorithm Task
gds.wcc.stats statistics about the run
gds.wcc.write writes result back to database
gds.wcc.mutate writes result back to in-memory graph
gds.wcc.stream streams result
gds.wcc.stats.estimate estimated memory usage statistics
gds.wcc.write.estimate estimated memory usage write
gds.wcc.mutate.estimate estimated memory usage mutate
gds.wcc.stream.estimate estimated memory usage stream

A result-stream out of an algorithm
is quite like the printouts we used
to get at work. Nobody ever looked
at the things and they end up as
drawing paper for the kids … ok, the
similarity stopped a bit before that
point, but you get what I mean.

Yes, that is how that is spelled, it's not Segway, that's one of those weird
electrical devices that has you balance on two wheels ...
Any-way … have you ever wondered about how underused the
results of a machine learning pipeline often are? You've spend tons
of energy into learning something and then … it ends up on a four
coloured bar chart in Tableau?
So while we're on the topic … there's this thing called a Property
Graph that allows very ﬂexible modeling of your data and would
happily take good care of your newly learned fact ...

One of the reasons I've been using the Graph Data Science library
right from the start (back when it was still called algo) is that it can
write back the results to the database.
Unsure who originally thought of that (I suspect it was by incident),
but it was a stroke of genious. And in order to corroborate that, I
have to talk about ...

Did you know about this monopartite and bipartite stuﬀ? And how
it relates to analytics? I mean, know before you heard about it
today and had it spelled out to you?
All of you did? Wow … I'm superimpressed now ...
What has been impressing customers ever since we have Graph
Data Science is the unfailing (golden) combination of similarity
followed by community detection.
Similarity turns bipartite subgraphs into monopartite graphs.
Community detection then segments <whatever it is you want to
segment>. Kerching!

Has that become a not-PC sentence yet? It will soon no doubt ...
Writing similarity back (as a relationship) to a graph has some other
nice eﬀects. Suddenly doing recommendations becomes a whole
lot easier. If you know (with a simple pointerhop) who is similar to
me … I'm sure you can ﬁnd ways to tell me what I like.
Those relationships do clutter up the graph though. Wouldn't it be nice
if I could do the golden combination and only get the communities back
as properties?

It has taken a while to make my point but I wanted you to fully
understand why being able to mutate the in-memory workspace is
so useful. Now let us ﬁnish this session by putting it in practice ...
CALL gds.graph.create('house-bipartite',
['House','Person'],
{ BELONGS_TO: { type: 'BELONGS_TO', orientation: 'REVERSE'}});

CALL gds.nodeSimilarity.mutate('house-bipartite', {
similarityCutoﬀ: 0.05,
mutateRelationshipType: 'SIMILAR',
mutateProperty: 'score'
});
CALL gds.louvain.write('house-bipartite', { writeProperty:
'community'});

MATCH (h:House)
WITH h.community as community, count(*) as members,
collect(h.name) as membernames
RETURN * ORDER BY members DESC LIMIT 10;

05 neo4j gds graph catalog

Recommended

Recommended

More Related Content

Similar to 05 neo4j gds graph catalog

Similar to 05 neo4j gds graph catalog (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

05 neo4j gds graph catalog