Building a Solid Index Presence by Optimizing your Crawl Budget

Send to Kindle

A website’s crawl budget and its effective utilization can have a tremendous effect on the success of the site in achieving visibility in the SERPs (search engine results pages). Simply explained, the crawl budget is the number of pages a search engine will crawl each time it visits a site.

Optimizing for a site’s crawl budget can help direct the search bots to crawl and index the pages that will have the most beneficial effect on rankings. While the search engines’ algorithms are ultimately in control of what pages will be indexed, we can ensure we’re sending the right signals.

Large websites, such as ecommerce or news sites with diverse content, can sometimes experience particular difficulty ranking for specific search terms, due to crawl limitations. This can happen when a fresh blog post is targeting the same terms as older pages, resulting in some pages being indexed slowly, or in extreme circumstances, not at all.

Linking from the blog post to relevant product or category pages, of course, is a standard practice to help overcome this issue. It has the added benefit of providing more information to the users and establishing further relevance for the search engines.

There are several factors that come into play in the algorithms’ selection of pages to be crawled and indexed, and only the search engines know for certain what those factors are and how they are weighted. Yet, some factors appear to be important in that selection.

A few of those factors are: depth of navigation, PageRank, CTR (click-through rate) and freshness QDF. For the purposes of this article, we’ll be looking mostly at QDF, both its impact and how to take advantage of it.

Freshness

A major part of the problem (which can also be part of the solution) can be Google’s QDF (Query Deserves Freshness) algorithm. Fresher content will often get attention before content that may be more important for ranking.

For instance, in the case of an ecommerce site that’s trying to rank for specific products, frequent blog entries can sometimes eclipse product pages, making it more difficult to rank those pages. The first step should be to determine whether your crawl budget is being misapplied.

Careful examination of a site’s visits by the search engines’ bots can quickly identify which pages are being crawled and how often. If you find that your newest (or most recently updated) pages are being crawled, while your more important pages are being neglected, some adjustment in your strategy is called for.

Another aspect of a lack of freshness may be that pages that go prolonged periods without being re-crawled seem to lose their authority. Whether that’s actually due to a lack of freshness or a lack of crawl activity is unknown, but there does seem to be a correlation.

Understanding the Index

There was a time that we could do a site: search on Google and see a fairly reliable estimate of how many results were in the main index and how many were in the supplemental index. That provided valuable information to webmasters.

However, in July of 2007, Google stopped labeling the supplemental results, preferring to simply roll them into the displayed results. A great many SEOs and webmasters are under the mistaken impression that this meant the supplemental index no longer exists, but they are very mistaken.

In fact, when Google announced the change, they ended their post with this:

“The distinction between the main and the supplemental index is therefore continuing to narrow. Given all the progress that we’ve been able to make so far, and thinking ahead to future improvements, we’ve decided to stop labeling these URLs as “Supplemental Results.” Of course, you will continue to benefit from Google’s supplemental index being deeper and fresher.”

That plainly states that the supplemental index still exists… it’s simply not as obvious.
So, when do we see pages from the supplemental index in the SERPs? Basically, when there aren’t sufficient relevant results in the main index. We simply can’t differentiate between them as easily as we once could.

Herding the Bots to Where They’re Most Needed

While we can’t really control the bots, there are ways that we can steer them in the right direction. But first, let’s be clear about something: I’m not talking about PageRank sculpting. This isn’t done simply to try to manipulate PageRank.

The purpose of herding the bots toward the pages we most want crawled is to get that content into the index. That may be because of a new price, a new product or simply updated information.
Spider CrawlingThe longer it has been since a page has been crawled, the closer it may be to getting moved to the supplemental index. Ironically, as we generate new, valuable content, we can also be making that happen even faster with old, equally valuable content.

Unfortunately, that process can seem agonizingly slow, when we have several important pages that aren’t getting indexed because our crawl budget is exhausted. That’s the sort of situation where we want to speed things along a bit.
One of the major factors in determining crawl frequency is PageRank, so lower-ranked pages are more likely to be ignored by the bots. You can update old pages and ping the search engines, but that still doesn’t guarantee a re-crawl.

As you may have guessed by now, optimizing your crawl budget requires a balancing act. You need to identify the pages you most want updated in the index, as well as those that don’t particularly matter to you, and find a way to direct the bots where you want them to go.

Let’s use an example:

Say you have 1000 pages on your site, and determine that only 800 have been crawled. In your analysis, you determine that 300 of the pages in the main index are of little importance, but there are 125 very important pages in the 200 that haven’t yet been indexed. There are a couple of different things you can do.

First, you could implement a 304 response on125 or more of the stale pages that are of little benefit to you. However, this doesn’t guarantee anything, because presumably, the search engines should already know that those pages haven’t been updated recently.

Updating the 125 pages that haven’t yet been indexed may push them up in the queue, but if your crawl budget is already spent before those pages are reached, this still won’t accomplish what you need, either.

So the key is to free up some crawl budget by letting the bots know, via a 304 response, that some of your pages haven’t been changed recently and needn’t be indexed again. That will free the bots to dig a little deeper into your site.

If you have 100 indexed pages returning a 304, the bots will spot that immediately and not re-index them. That means they will now crawl another 100 previously neglected pages for inclusion in the index.

Which will be the 100 pages to be selected for addition is somewhat of an unsure thing. As previously stated, freshness is a big factor, but PageRank is a factor as well. Hierarchical placement on the site may be another factor.

As you go through all the iterations to get the rest of your pages indexed, you can see indications of what is carrying the most weight in your situation and adjust your strategy accordingly. The number of iterations necessary may vary considerably from site to site.

Things that May or May Not have an Effect

Some people believe that simply providing an XML sitemap to the search engines is adequate notification of additions and updates to your site. That’s true for additions, but the search engines aren’t likely to heed the last modified date provided in the sitemap, as it’s too easily faked to be accepted as reliable.

There are some things that can assist in discovery of new content and possibly of updates, however. A fetch as Google run on a new page appears to provide for rapid discovery. What is unsure is whether or not last modified data picked up in such a fetch is communicated to the regular crawl bots.

robotPinging the search engines upon any change to a page seems to help in discovery as well, although it’s unlikely that they blindly accept that a change has or has not taken place, as this could also be easily faked.

One thing that is often neglected, but does seem to be taken into consideration by the search engines, is the priority assigned to pages in the XML sitemap. Many online XML sitemap generators plug in default values here, and don’t even offer an opportunity to edit them.

For non-coders, there are plugins for platforms like WordPress and Joomla that will automatically generate and update sitemaps, even submit them and ping the search engines. The best of these offer editable priorities.

Sadly, many webmasters don’t bother to set those priorities, missing a great opportunity. This is where you can give specific pages a little edge over their sister pages, with whom they compete for the bots’ attention.

Building a Solid Index Presence

In order to be placed in the index and stay there, pages must possess a certain amount of PageRank. All other things being equal, pages with higher PageRank will normally be indexed first and remain longer.

And PageRank is also the main determinant of how many pages a site will have included in the main index. The “overflow” pages – those not in the main index – will fall into the supplemental. And from there, they’ll only be seen when there aren’t enough main index results to service a query.

There are other factors, however, that can send a page to the supplementals (or keep it there):

  • 1. Pages that are deemed to be duplicate or highly similar content, in the absence of appropriate canonicalization;
  • 2. Pages with thin content;
  • 3. Orphaned pages, with no inbound links;
  • 4. Pages subject to poor site navigation;
  • 5. Keyword stuffed pages;
  • 6. Pages with little or no PageRank;
  • 7. Pages with poorly structured URLs;
  • 8. Error pages;
  • 9. Pages that are suspicious for spamdexing or linking to bad neighborhoods;
  • 10. Old cached pages.

Summary

Using a 304 If-Modified-Since HTTP header can free up portions of your crawl budget for other, more important pages. Careful analysis of your logs can allow you to plan a strategy to gradually get all your pages into the index.

When that has been accomplished, you can then begin periodic updates to pages to keep your most important pages at the forefront. In summary:

  • Determine which pages are in the main index or haven’t been indexed;
  • Implement a 304 If-Modified-Since HTTP header;
  • Implement an XML sitemap and set priorities;
  • Begin updating the pages that will yield the most benefit by being added to the main index;
  • Continue the above, until all your important pages have been indexed;
  • Begin periodically updating pages in order to keep your most important pages in the main index.

It’s important to mention that just as with many SEO tasks, we often can find more than one viable method to achieve a goal. Using 304 response codes to optimize a crawl budget is just one technique.

If you’ve already done this and would like to share your experience, or if you have any questions or comments on the process, please let us know in the comments below.

Print Friendly

John Britsios

Founder and Chief Information Officer (CIO) of SEO Workers and Chief Executive Officer (CEO) of Webnauts Net, a qualified Forensic SEO & Social Semantic Web Consultant, specializing in Semantic, Forensic & Technical Predictive Search Engine Optimization, Content Marketing, Web Content Accessibility, Usability Testing, Social Semantic Web based Responsive Web Design & Ecommerce Development, UX & Funnel Optimization, Conversion Rate Optimization.

More Posts - Website - Twitter - Facebook - Google Plus - StumbleUpon - YouTube