Gone are the days when you could easily hack search engines by loading a page with keywords and creating artificial backlinks.

Today, Google is consistently rolling out changes to its algorithms to reward quality.

Unlike the past, if you have low-quality pages on your website, it can negatively impact your overall ranking.

What’s a low-quality page?

It’s one that isn’t used or visited, is full of duplicate content from other pages, has thin content or very low engagement in the eyes of Google. Some people call these “zombie pages”. 

Is it ok to remove low quality content like this? Yes!

Here’s the thing:

It’s entirely possible that you have dozens, hundreds, or thousands of low-quality pages on your site — in the eyes of Google — and you might not even realize it.

We call this problem index bloat.

It happens when Google has indexed a lot of URLs for your website that it views as low-quality.

In this article, we’ll show you:

  • An example of index bloat
  • Common causes
  • The exact steps you can take to see if you have a problem

Note: We can help you spot and fix issues on your website that are harming your overall ranking.  Contact us here.

Index Bloat: A Real-life Example

We recently started working with an eCommerce client and discovered something fascinating (and troubling) as we did our standard checks to evaluate their site.

After talking to them, we expected the site to have somewhere around 10,000 pages.

When we looked in Google Webmaster Tools (now Google Search Console), we saw — to our surprise — that Google had indexed 38,000 pages for the website. Find this chart here: Web Tools > Search Console > Google Index > Index Status.

A real-life example of index bloat.

That was way too high for the size of the site.

We also saw that the number had risen dramatically.

In July of 2017, the site had only 16,000 pages indexed in Google Analytics.

What happened?

How a Hidden Technical Glitch Caused Massive Index Bloat

It took a while to figure out what had gone wrong with our client’s site.

Eventually, we found a problem in their software that was creating thousands of unnecessary product pages.

At a high-level, any time the website sold out of their inventory for a brand (which happened often), the site’s pagination system created hundreds of new pages.

Put another way, the site had a technical glitch that was creating index bloat.

The company had no idea their site had this problem, which is common with a site that has a technical glitch.

For eCommerce sites that automatically generate new pages for products, brands, or categories, things like this can easily happen.

It’s one common cause of index bloat, but not the only one.

Types of “Zombie” pages: 

  • Archive pages
  • Tag pages (on WordPress) 
  • Search results pages (mostly on eCommerce websites)
  • Old press releases/event pages
  • Demo/boilerplate content pages
  • Thin landing pages (<50 words)
  • Pages with a query string in the URL (tracking URLs)
  • Images as pages (Yoast bug)
  • Auto-generated user profiles
  • Custom Post types
  • Case Study Pages
  • Thank You pages
  • Individual Testimonial Pages

How to fix? All of these need to be set to have a noindex directive in the header (tutorial below).

Don’t think you’re safe just because your list of indexed pages looks like this:

Even if the overall number of pages on your site isn’t going up, you might still be carrying unnecessary pages from months or years ago.

Even if the overall number of pages on your site isn’t going up, you might still be carrying unnecessary pages from months or years ago — pages that could be slowly chipping away at your relevancy scores as Google makes changes to its algorithm.

The good news is: it’s relatively easy to identify and remove pages that are causing index bloat on your site.

We also have a free tool you can use that will help.

How to Identify and Remove Thin Content and Poor Performing Pages

Here’s the step-by-step process we use with our clients to identify and remove poor performing pages:

(1) Estimate the number of pages you should have.

Estimate to the number of products you carry, the number of categories, blog posts, and support pages, and add them together. Your total indexed pages should be something close to that number.

(2) Use the Cruft Finder Tool to find poor-performing pages.

The Cruft Finder tool is a free tool we created to identify poor-performing pages. It’s designed to help eCommerce site managers find and remove thin content pages that are harming your SEO ranking.

The tool sends a Google query about your domain and — using a recipe of site quality parameters — returns page content we suspect might be harming your index ranking.

Mark any page that:

  • Is identified by the Cruft Finder tool
  • Gets very little traffic (as seen in Google Analytics)

These are pages you should consider removing from your site.

(3) Decide what to keep and what to remove.

For years, you’ve been told that adding fresh content on your site increases traffic and improves SEO. You should be blogging at least once a week, right?

Well, maybe.

If a blog post has been on your website for years, has no backlinks pointing to it, and no one ever visits it, that old content could be hurting your rankings. You should remove that outdated content. 

Recently, we deleted 90% of one client’s blog posts. Why? Because they weren’t generating backlinks or traffic.

If no one is visiting a URL, and it doesn’t add value to your site, it doesn’t need to be there. It’s using up crawl budget for no reason. 

(4) Revise and revamp necessary pages with little traffic.

If a URL has valuable content you want people to see — but it’s not getting any traffic — it’s time to restructure.

Could you consolidate pages? Could you promote the content better through internal links? Could you change your navigation to push traffic to that particular page?

Also, make sure that all your static pages have robust, unique content. When Google’s index includes thousands of pages on your site with sparse or similar content, it can lower your relevancy score.

(5) Make sure your search results pages aren’t being indexed.

Not all pages on your site should be indexed. The main example of this is search results pages.

You almost never want search pages to be indexed because there are better pages to funnel traffic that have better quality content. These are not meant to be entry pages.

This is a common issue.

For example, here’s what we found using the Cruft Finder tool for one major retail site: over 5,000 search pages indexed by Google.

Examples of how the Cruft Finder tool can help you find index bloat.

If you find this issue on your own site, follow Google’s instructions to get rid of search result pages.

We recommend reading their instructions carefully before you remove or noindex these pages. They include nice details on that page about temporary versus permanent solutions, and when to delete pages vs. using a noindex tag, and more. If this gets too far into “technical SEO” for you, feel free to reach out to our SEO team for consultation or advice.

How to Fix? 

Follow these basic instructions to noindex pages on your website. If your eCommerce website has a lot of zombie pages on it, see our in-depth SEO guide for fixing thin and duplicate content.

(1) Use noindex in meta robot tag. 

This tag is better than blocking pages with robots.txt. We want to tell search engines what to do with a page definitively whenever possible. The noindex tag tells search engines like Google whether or not they should index the page. 

This tag is easy to implement. It can also be automated for any CMS when needed.

The tag to add at the top of the page’s HTML code to noindex it is:

<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>

“Noindex, Follow” means search engines should not index the page, but they can crawl/follow any links on that page.

(2) Setup the proper HTTP status code (2xx, 3xx, 4xx).

If old pages with thin content exist, remove and redirect (through a 301 redirect) to relevant content on the site. This maximizes site authority if old pages had backlinks pointing to them.

It also helps to reduce 404s (if they exist) by redirecting removed pages to current, relevant pages on the site. 

Set the HTTP status code to “410” if content is no longer needed or not relevant to the website’s existing pages. A 404 status code is also okay, but a 410 is faster to get a site out of search engine’s index.

(3) Setup proper canonical tags.

Adding a canonical tag in the header tells search engines which version they should index. 

Ensure that product variants (mostly setup using query strings or URL parameters) have a canonical tag pointing to the preferred product page.

This will usually be the main product page, without query strings or parameters in the URL that filter to the different product variants.

(4) Robots.txt

The robots.txt file tells search engines what pages they should crawl and what pages not to crawl.

Adding the “Disallow” directive within the robots.txt file stops Google from crawling zombie/thin pages, but keeps those pages in Google’s index.

Robots.txt does not remove a site from Google’s index because the page is already indexed and might have internal links from other pages of the site. You’ll want to remove internal links completely if their destination page is set to “Disallow.”

If your goal is to prevent your site from being indexed, add the “noindex” tag to your site’s header instead.

(5) Use the URL Removals Tool to remove the pages from Google’s index (and search results).

This tool is mostly used in cases when certain pages are blocked through robots.txt, but Google is still indexing those pages (often because the page still has internal links from other pages).

Adding the “noindex” directive might not be a quick fix and Google might keep indexing the pages, which is why the URL Removals tool can be handy at times.That said, use this method as a temporary solution. When you use it, pages are removed from Google’s index quickly (usually within a few hours depending on the number of requests).

The Removals Tools is best if used together with noindex directive. Note that a Removal you make can also be reversed.

The Results and Impact on Traffic and Revenue

What kind of impact can index bloat have on your results?

And what kind of positive effect have we seen after correcting it?

Here’s a graph of indexed pages from a recent client that was letting their search result pages get indexed — the same way we explained above. We helped them implement a technical fix so those pages wouldn’t be indexed anymore.

Index bloat can impact both your traffic and revenue.

In the Google Analytics graph, the the blue dot is where the fix was implemented. The number of indexed pages continued to rise for a bit, then dropped significantly.

Year over year, here’s what happened to the site’s organic traffic and revenue:

3 Months Before the Technical Fix

  • 6% decrease in organic traffic
  • 5% increase in organic revenue

3 Months After the Technical Fix

  • 22% increase in organic traffic
  • 7% increase in organic revenue

Before vs. After

  • 28% total difference in organic traffic
  • 2% total increase in organic revenue

Remember that not all pages on your site should be indexed.

This process takes time.

The important technical SEO lesson here is that blocking is different from noindexing. Most websites end up blocking these types of pages from robots.txt, which is not the right way to fix the index bloat issue.

Blocking these pages with a robots.txt file would not get these pages out of Google’s index if they are already indexed, or if they have internal links from other pages on the website.

For this client, it took three full months before the number of indexed pages returned to the mid 13,000s, where it should have been all along.

Note: Interested in a personalized strategy to reduce index bloat and raise your SEO ranking? We can help.  Contact us here.