Internal links are an important way of telling search engines about your site's most important pages, but most of the internal linking structures we see aren't optimised to take advantage of this.
Here, our Head of Data, James Bardsley, dives into some of the techniques, from scraping to machine learning, that we use to identify internal linking problems and find solutions at scale for enterprise businesses including Expedia and Gumtree.
2. ENTERPRISE DIGITAL MARKETING & ANALYTICS |
JAMES
JAMES BARDSLEY
HEAD OF DATA & ENGINEERING
HI MY NAME IS JAMES
● I’m the head of data and engineering at
IMWT
● I’m based in locked down Melbourne
● Background in software engineering,
where I worked on BI products.
● I’ve been working on large site internal
linking projects for about 4 years
● Ask me about craft beer
CLICKTOCONTINUE
2
3. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
WHAT ARE
WE ACTUALLY
DOING?
3
When we talk about optimising internal links,
what do we mean?
We mean we want to identify the most valuable
pages on the site, then create links in a way that…
● Maximises the number of links to the most
important pages.
● Minimises the click depth to the most
important pages.
● Maximises the quality of links to the most
important pages.
4. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
NUMBER OF
LINKS
4
This one’s simple: more links should go to the
pages we care about than the pages we don’t.
5. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
CLICK DEPTH
5
This means that our most important pages should
be as few clicks from the homepage as possible.
6. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
QUALITY OF
LINKS
6
This could mean a lot of things, but for the sake of
our work we consider a quality link to be a link
that:
● Comes from a page which is itself highly visible
to Google
● Has content relevant to the page we’re linking
from
7. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
LUCKILY
THERE’S A
METRIC
WHICH
COMBINES
ALL THESE
7
In order to get an idea of how well a page is linked in
accordance with these three principals we measure PageRank.
PageRank is Google’s initial algorithm, and essentially follows
the principle that important pages are likely to be linked to by
other important pages. A quick PageRank FAQ…
Q) Isn’t PageRank dead?
A) No, it’s just no longer visible.
Q) Hasn’t PageRank evolved so much you can no longer
accurately calculate it yourself?
A) Kind of, yes. But we don’t expect to be 100% accurate -
again, we’re using it as a composite measure. We’ve also
found “traditional” PageRank has a statistically significant
positive linear correlation with crawl rate.
9. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
ARE YOU SICK
OF NOT BEING
ABLE TO GET
A LLAMA ON
DEMAND?
9
Introducing LlamasToHome.com!
Llamas To Home is your one stop shop for all your
llama needs. And here it is:
10. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
LlamasToHome
: FIVE
COUNTRIES
DRIVE MY
PROFIT
11. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
LlamasToHome
: SEARCH
TRAFFIC IS
MOSTLY FOR
CITY PAGES
Nearly 50% of our organic traffic comes from
people searching for llamas in particular cities.
12. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
GROUPING
ENTITIES
Before we begin gathering data we need to think
about the different groups our entities fall in that
we may want to report on.
13. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
WE HAVE A
PREFERENCE
FOR USING
OUR OWN
CRAWLER
There are commercial tools for crawling huge
sites and some of them are great (we have a
preference for Botify), however when practicable
we prefer to use a crawler developed in-house,
this is because...
● Can join dimensions from external datasets
(e.g. geographic dimensions) on the fly.
● Easy access to all the data.
● More specialised for adding dimensions to data
at a super granular level (e.g. we can look at
how much PageRank individual links
distribute).
● The price is right
14. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
WHICH PAGE
TYPES
RECEIVE
PAGERANK?
Probably the simplest question we can ask is
“which page types are we flowing our PageRank
to?”
We can pretty quickly see that, despite Llama
cities being our most important page type, most
of our PageRank goes to city routes:
15. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
DIAGNOSING
THE PROBLEM
Using our data we can check which page types flow most of
this PageRank to the route pages. We can see that it’s
largely other route pages. In fact, >45% of the site’s total
PageRank is route pages linking to other route pages:
By having these pages link to each other at random we’ve
created “crosslink subnetworks” which link within
themselves.
16. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
SUGGESTED
ACTIONS
Now that we know our creation of sub-networks
is a huge drain on our PageRank distribution we
can begin to think of solutions and create actions
based on these.
Perhaps we should cut down on the sub-networks
by not linking to routes in cities with under 1
million people, as those are likely to be less
popular:
Action 1: Do not link to routes between cities of
under one million people.
17. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
WHICH
COUNTRIES
RECEIVE
PAGERANK?
Next we want to look at our geographic
distribution of PageRank. I can see that it doesn’t
line up at all with the countries where I make my
money. Instead it seems to be more lined up with
the countries with high populations:
18. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
DIAGNOSING
THE PROBLEM
Using the data we’ve gathered we can ask the
question: “what are the links to pages for China?”
and group them together by the module they
belong to.
We can see that the major culprits for linking to
China are fromRoute, toRoute and our crosslinks:
19. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
SUGGESTED
ACTIONS
Based on this knowledge we can begin to suggest
actions to improve the logic behind the link
modules with issues. We’ve already planned an
improvement to the route modules, so we’ll focus
on the crosslinking module. Perhaps we can…
Action 2: Prioritise my top 5 markets when
generating crosslinks.
20. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
IMPLEMENTING
CHANGES
Based on my link data I’ve very quickly been able
to come up with two actions:
1. Do not link to routes between cities of under
one million people.
2. Prioritise my top 5 markets when generating
my crosslinks.
Let’s implement them, re-crawl and see how this
affects our site...
21. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
DISTRIBUTIO
N ACROSS
COUNTRIES
LOOKS
BETTER!
We still lose PageRank to China, but generally our
distribution by country looks better for our target
markets.
22. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
DISTRIBUTIO
N ACROSS
PAGE TYPES A
BIT BETTER,
TOO
City-to-City routes still take more PageRank than
we’d like, but it’s an improvement.
23. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
ITERATIVE
IMPROVEMENT
HAS HIGH
POTENTIAL
COST
We’ve seen how changing our internal linking
logic resolved some of our issues. However, it
wasn’t a perfect solution and we’ve now created a
bug in our crosslinking:
Implementing changes, then re-crawling and re-
analysing, gets expensive quickly. Both in terms
of time and resources required.
24. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
WE BUILT A
“SIMULATOR”
IN ORDER TO
SUPPORT
RAPID
PROTOTYPING
To solve this problem we built a tool that we use
to quickly test how removing and generating links
with given logic will affect the state of the site.
26. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
MACHINE
LEARNING
APPLICATIONS
FOR INTERNAL
LINKING
OPTIMISATIO
N
After implementing quick wins, things begin to
get harder.
This has brought us to the work we’re currently
doing:
We’re using machine learning to identify pages
which are likely to benefit from receiving
additional internal links.
27. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
IDENTIFYING
PAGES THAT
HAVE
RANKING
POTENTIAL
We know that internal PageRank, click depth and
other variables we’ve looked at so far are ranking
signals. But they’re just a few of many.
We also have to consider content quality,
external backlinks, competitor data, the page’s
relevance to the keywords it’s ranking for… the
list goes on.
28. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
IDENTIFYING
PAGES WITH
RANKING
POTENTIAL
CONT.
A current project involves training a machine learning model to
predict how likely a page is to rank in Google’s top 10 results
based on all the factors we know about it, then testing to see if
the likelihood will increase if we change that page’s internal
PageRank.
A simple way to think of this is that we’re creating a formula
that PageRank can be plugged into. We then test different
PageRanks to see how the output changes…
Pr(ranks) = (RFx * 4) + (RFy * 2) - (RFz
* 3) * (INTERNAL PAGERANK)
29. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
ALLOCATING
PAGERANK
DISTRIBUTION
An interesting side problem that comes along
with this approach is a resource allocation
problem - we have a limited amount of internal
PageRank to distribute and want to maximise its
distribution to the pages that will actually benefit
from it.
An analogy I like to use for this problem is giving
out money for treats...
30. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
PAGERANK
DISTRIBUTION
MODELLED BY
CONFECTION
FUND
ALLOCATION
Imagine we have five friends who all want to buy
some treats. We have a spare $10, and would like
to help them buy these treats.
For some reason, though, they refuse to tell us
how much each delicious delight will cost…
Friend Wants A...
Paul Jaffa Cake
Cecilia Snickers
Ben Croissant
Kirsty Tim Tam
Freddy Eclair
31. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
TREAT FUND
ALLOCATION
2: THE SEEING
Luckily we’ve been blessed with eyes that we can use to tell
whether each friend thinks they’ve received enough money
to buy their treat. If they’re happy, they think they can
afford it. If they’re sad, they don’t.
In the real world our machine learning model effectively
acts as our eyes, telling us whether a page is given enough
internal PageRank to rank.
Friend Wants A... Money Face
Paul Jaffa Cake $2 😊
Cecilia Snickers $2 😞
Ben Croissant $2 😊
Kirsty Tim Tam $2 😞
Freddy Eclair $2 😞
32. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
TREAT FUND
ALLOCATION
3: REVENGE OF
THE EYES
Now we need to find a method to distribute our money in a
way that results in as many friends as possible receiving
their treats.
This is a complex problem - if we take $1 from Paul and Ben
and give those $2 to Cecilia it’s possible all three of them
will get their treat… but it’s also possible none of them will.
Friend Wants A... Money Face
Paul Jaffa Cake $1 😞
Cecilia Snickers $4 😞
Ben Croissant $1 😞
Kirsty Tim Tam $3 😊
Freddy Eclair $1 😞
33. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
TREAT FUND
ALLOCATION
4: THE MAD
GENETICIST
This problem becomes infinitely more complex when we
scale it out to the potentially millions of pages we need to
allocate internal PageRank to.
We’ve had great success solving this at scale through the
use of a genetic algorithm that optimises towards positive
results for as many pages as possible - also accounting for
the fact that some treats remain unattainable.
Friend Wants A... Money Face
Paul Jaffa Cake $2 😊
Cecilia Snickers $0 😞
Ben Croissant $2 😊
Kirsty Tim Tam $3 😊
Freddy Eclair $3 😊
34. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
OTHER AREAS
WE’RE USING
ML TO MAKE
LINKS BETTER
● Grouping URLs by page type. Usually we use
regular expressions, but for sites with lots of
page types they’re pretty tedious and time
consuming.
● Identifying the URLs which are most relevant
to the page we’re currently on - geographically
or semantically or otherwise.
35. DIGITAL MARKETING & ANALYTICS |ENTERPRISE DIGITAL MARKETING & ANALYTICS |
INTERNAL
LINKING
OPTIMISATIO
N REALLY
WORKS
● We consistently see it perform well in analyses of
ranking signals (e.g. by Moz)
● John Mueller pretty much explicitly states its
importance:
“...[if] from the homepage it takes multiple clicks to actually get to one of these stores, then
that makes it a lot harder for us to understand that these stores are actually pretty
important.
On the other hand, if it’s one click from the home page to one of these stores then that tells
us that these stores are probably pretty relevant, and that probably we should be giving
them a little bit of weight in the search results as well...”
● We’ve seen it for ourselves! We’ve seen statistically
significant uplifts of sessions in the range of 10%-20%
after improving internal links on large sites.