Last month for our Data-Driven Digital community webinar, we spoke about Internal Linking Optimisation with our Head of Data, James Bardsley. Internal links are an important way of telling search engines about your site’s most important pages, but most of the internal linking structures we see aren’t optimised to take advantage of this.
In this webinar, James will talk about some of the techniques, from scraping to machine learning, that we use to identify internal linking problems and find solutions at scale for enterprise businesses including Expedia and Gumtree.
Watch the webinar below on internal linking optimisation or read on to get the slides and more info.
Data-Driven Internal Linking Optimisation
When we talk about optimising internal links, what do we mean?
We mean we want to identify the most valuable pages on the site, then create links in a way that…
- Maximises the number of links to the most important pages
- Minimises the click depth to the most important pages
- Maximises the quality of links to the most important pages
Number of Internal Links
This one’s simple: more links should go to the pages we care about than the pages we don’t.
This means that our most important pages should be as few clicks away from the homepage as possible.
Quality of Links
This could mean a lot of things, but for the sake of our work we consider a quality link to be a link that:
- Comes from a page which is itself highly visible to Google
- Has content relevant to the page we’re linking from
Luckily, there’s a metric which combines all of these
In order to get an idea of how well a page is linked in accordance with these three principles (number of internal links, click depth, and quality of links), we measure PageRank.
PageRank is Google’s initial algorithm, and essentially follows the principle that important pages are likely to be linked to by other important pages.
A quick PageRank FAQ:
Q) Isn’t PageRank dead?
A) No, it’s just no longer visible.
Q) Hasn’t PageRank evolved so much you can no longer accurately calculate it yourself?
A) Kind of, yes. But we don’t expect to be 100% accurate – again, we’re using it as a composite measure. We’ve also found “traditional” PageRank has a statistically significant positive linear correlation with crawl rate.
We’ve definitely found that traditionally PageRank is still a good measure of how well a page will perform or how important Google thinks it is as well. With multiple tests we’ve done in the past, we’ve tried to correlate the traditional PageRank with the crawl rate of GoogleBot on a page and we think the crawl rate is a pretty good proxy measure of whether Google thinks the page is important and we always find that there’s a statistically significant positive linear correlation with crawl rate so basically what that means is that the PageRank that we calculate correlates with how important Google thinks the page is so it’s still a valuable metric for us to be using.
Internal Linking Case Study
For the case study, we recommend watching the video to truly understand. Start watching at 8:40 minutes.
This is an example of why and how to optimise your internal linking structure.
Internal Linking Optimisation Example
Are you sick of not being able to get a Llama on demand? Introducing llamastohome.com
Let’s say that 5 countries drive the vast majority of profit for llamastohome.com
Nearly 50% of our organic traffic comes from people searching for llamas in particular cities.
Before we begin gathering data we need to think about the different groups our entities fall in that we may want to report on.
We Use Our Own Crawler
There are commercial tools for crawling huge sites and some of them are great (we have a preference for Botify), however when practicable we prefer to use a crawler developed in-house, this is because…
- Can join dimensions from external datasets (e.g. geographic dimensions) on the fly
- Easy access to all the data
- More specialised for adding dimensions to data at a super granular level (e.g. we can look at how much PageRank individual links distribute)
- The price is right – so there is no additional cost to us or our clients if we’re using our own crawler
Which Page Types Receive PageRank?
Probably the simplest question we can ask is:
“Which page types are we flowing our PageRank to?”
We can pretty quickly see that, despite Llama cities being our most important page type, most of our PageRank goes to city routes.
Diagnosing the Problem
Using our data we can check which page types flow most of this PageRank to the route pages. We can see that it’s largely other route pages. In fact, >45% of the site’s total PageRank is route pages linking to other route pages.
By having these pages link to each other at random we’ve created “crosslink subnetworks” which link within themselves.
Now that we know our creation of sub-networks is a huge drain on our PageRank distribution we can begin to think of solutions and create actions based on these.
Perhaps we should cut down on the sub-networks by not linking to routes in cities with under 1 million people, as those are likely to be less popular:
Action 1: Do not link to routes between cities of under one million people.
Which Countries Receive PageRank?
Next, we want to look at our geographic distribution of PageRank. I can see that it doesn’t line up at all with the countries where I make my money. Instead, it seems to be more lined up with countries with high populations:
Diagnosing the Problem
Using the data we’ve gathered we can ask the question: “what are the links to pages for China?” and group them together by the module they belong to.
We can see that the major culprits for linking to China are fromRoute, toRoute and our crosslinks.
Based on this knowledge we can begin to suggest actions to improve the logic behind the link modules with issues. We’ve already planned an improvement to the route modules, so we’ll focus on the crosslinking module. Perhaps we can…
Action 2: Prioritise my top 5 markets when generating crosslinks.
Based on my link data I’ve very quickly been able to come up with two actions:
- Do not link to routes between cities of under one million people
- Prioritise my top 5 markets when generating my crosslinks
Let’s implement them, re-crawl, and see how this affects our site…
Distribution Across Countries Looks Better!
We still lose PageRank to China, but generally, our distribution by country looks better for our target markets.
Distribution Across Page Types a Bit Better Too
City-to-City routes still take more PageRank than we’d like, but it’s an improvement.
Iterative Improvement Has High Potential Cost
We’ve seen how changing our internal linking logic resolved some of our issues. However, it wasn’t a perfect solution and we’ve now created a bug in our crosslinking.
Implementing changes, then re-crawling and re-analysing, gets expensive quickly. Both in terms of time and resources required.
We Built a Simulator in Order to Support Rapid Prototyping
To solve this problem we built a tool that we use to quickly test how removing and generating links with given logic will affect the state of the site.
Moving Beyond Manual Analysis
Machine Learning Applications for Internal Linking Optimisation
After implementing quick wins, things begin to get harder.
This has brought us to the work we’re currently doing:
We’re using machine learning to identify pages that are likely to benefit from receiving additional internal links.
Identifying Pages that Have Ranking Potential
We know that internal PageRank, click depth and other variables we’ve looked at so far are ranking signals. But they’re just a few of many.
We also have to consider content quality, external backlinks, competitor data, the page’s relevance to the keywords it’s ranking for… the list goes on.
A current project we’re working on involves training a machine learning model to predict how likely a page is to rank in Google’s top 10 results based on all the factors we know about it, then testing to see if the likelihood will increase if we change that page’s internal PageRank.
A simple way to think of this is that we’re creating a formula that PageRank can be plugged into. We then test different PageRanks to see how the output changes…
Pr(ranks) = (RFx * 4) + (RFy * 2) - (RFz * 3) * (INTERNAL PAGERANK)
Allocating PageRank Distribution
An interesting side problem that comes along with this approach is a resource allocation problem – we have a limited amount of internal PageRank to distribute and want to maximise its distribution to the pages that will actually benefit from it.
An analogy I like to use for this problem is giving out money for treats…
Imagine we have five friends who all want to buy some treats. We have a spare $10 and would like to help them buy these treats.
For some reason, though, they refuse to tell us how much each delicious delight will cost…
Luckily we’ve been blessed with eyes that we can use to tell whether each friend thinks they’ve received enough money to buy their treat. If they’re happy, they think they can afford it. If they’re sad, they don’t.
In the real world, our machine learning model effectively acts as our eyes, telling us whether a page is given enough internal PageRank to rank.
Now we need to find a method to distribute our money in a way that results in as many friends as possible receiving their treats.
This is a complex problem – if we take $1 from Paul and Ben and give those $2 to Cecilia it’s possible all three of them will get their treat… but it’s also possible none of them will.
This problem becomes infinitely more complex when we scale it out to the potentially millions of pages we need to allocate internal PageRank to.
We’ve had great success solving this at scale through the use of a genetic algorithm that optimises towards positive results for as many pages as possible – also accounting for the fact that some treats remain unattainable.
Other Areas We’re Using Machine Learning to Make Links Better
- Grouping URLs by page type. Usually, we use regular expressions, but for sites with lots of page types they’re pretty tedious and time-consuming.
- Identifying the URLs which are most relevant to the page we’re currently on – geographically or semantically or otherwise.
Internal Linking Optimisation Really Works
We consistently see it perform well in analyses of ranking signals (e.g. by Moz).
John Mueller pretty much explicitly states its importance:
“…[if] from the homepage it takes multiple clicks to actually get to one of these stores, then that makes it a lot harder for us to understand that these stores are actually pretty important.
On the other hand, if it’s one click from the home page to one of these stores then that tells us that these stores are probably pretty relevant, and that probably we should be giving them a little bit of weight in the search results as well…”
We’ve seen it for ourselves! We’ve seen statistically significant uplifts of sessions in the range of 10%-20% after improving internal links on large sites.