As an SEO, when you’re managing a website for years and publishing content consistently there comes a point where things can seem outside of your control. You’ve published so many test pages, thank you pages, and articles that you’re not sure which URL is relevant anymore.
What’s even worse is that sometimes a technical error can cause the number of pages indexed by Google to skyrocket out of nowhere. And with every Google algorithm update comes a traffic spike for websites that focus on quality over quantity. So you need to focus on making sure every URL on your website that’s crawled by search engines serves a purpose and is valuable to the end-user.
Knowing every URL Google has in its index allows you to flag any potential technical errors on your website and allows you to clean up any low-quality pages to help you keep your website quality score high.
Today, if you have too many low-quality pages on your site, Google won’t bother crawling every page on your site. By allowing your website to grow in unnecessary size, you’re potentially leaving rankings on the table and wasting valuable crawl budget.
In this article, we’re covering how to identify index bloat by finding every URL indexed by Google and how to fix these issues to save your crawl budget.
What We’re Covering
- What is Index Bloat
- Identifying Index Bloat
- How to Find All Indexed Pages on Your Site
- Deciding Which Pages to Remove
- How to Remove URLs from Google’s Index
- Results from Fixing Index Bloat
Note: We can help you spot and fix issues on your website that are harming your overall ranking. Contact us here.
What is Index Bloat?
Index bloat is when your website has dozens, hundreds, or thousands of low-quality pages indexed by Google that don’t serve potential visitors. This causes search crawlers to spend more time crawling through unnecessary pages on your site and not focusing their efforts on pages that help your business. It also causes a poor user experience for your website visitors.
Index bloat is common on eCommerce sites with a large number of products, categories, and customer reviews. Technical issues can cause the site to be inundated with low-quality pages picked up by search engines.
You want a clean site indexed by search engines with the only indexed URLs being the ones you want people to find. Index bloat will slow down your site and waste crawl budget on your site.
Index Bloat: A Real-life Example
There was an eCommerce site that we worked with a few years ago. After talking to them, we expected the site to have somewhere around 10,000 pages.
When we looked in Google Webmaster Tools (now Google Search Console), we saw — to our surprise — that Google had indexed 38,000 pages for the website. Find this chart here: Web Tools > Search Console > Google Index > Index Status.
That was way too high for the size of the site. We also saw that the number had risen dramatically in a short period. In July of 2017, the site had only 16,000 pages indexed in Google Analytics.
The “Technical Glitch” That Caused the New Indexed Pages
Eventually, we found a problem in their software that was creating thousands of unnecessary product pages. At a high-level, any time the website sold out of their inventory for a brand (which happened often), the site’s pagination system created hundreds of new pages.
Put another way, the site had a technical glitch that was creating index bloat. The company had no idea their site had this problem and it’s common for some eCommerce sites to automatically generate new pages for products, brands, or categories. All of these need to be set to have a noindex directive in the header.
Identifying Index Bloat
Even if the overall number of pages on your site isn’t going up, you might still be carrying unnecessary pages from months or years ago. These pages could be slowly chipping away at your relevancy scores as Google makes changes to its algorithm.
With too many low-quality pages in the index it’s possible that Google decides to ignore important pages on your site because they’re wasting too much time crawling other parts of your site.
The good news is: it’s relatively easy to identify and remove pages that cause index bloat on your site.
Here are some common examples of “low quality” pages you can find on your website.
- Archive pages
- Tag pages (on WordPress)
- Search results pages (mostly on eCommerce websites)
- Old press releases/event pages
- Demo/boilerplate content pages
- Thin landing pages (<50 words)
- Pages with a query string in the URL (tracking URLs)
- Images as pages (Yoast bug)
- Auto-generated user profiles
- Custom Post types
- Case Study Pages
- Thank You pages
- Individual Testimonial Pages
But, if you start noticing a sharp increase in the number of indexed pages you have on your site, that’s also a sign you’re dealing with an index bloat issue.
How to Find All Indexed Pages on Your Site
Estimate to the number of products you carry, the number of categories, blog posts, and support pages, and add them together. Your total indexed pages should be something close to that number.
Start by taking inventory and gathering all the information you have on your site:
- Create a URL list from your sitemap – Ideally, every URL you want to be indexed will be in your sitemap. This is your starting point for creating a valid list of URLs for your website. Use this tool to create a list of URLs from your sitemap URL.
- Download your published URLs from your CMS – Using a plugin like Export All URLs you can download a CSV file of all published pages on your website. Assuming you’re using WordPress as your CMS.
- Run a Site Search query – Run a search query for your website like this. site:website.com and make sure to replace website.com with your actual domain name. The results page will give you the number of URLs in Google’s Index. Use this tool to scrape a list of URLs from the SERPs page.
- Look at your Index Coverage Report in Search Console – Inside Google Search Console there’s a report called “Index Coverage” which tells you how many valid pages are indexed by Google. Download the report as a CSV.
- Analyze your Log Files – Access your log files directly from your hosting provider backend or contact them and ask for the files. Log files tell you which pages on your website are most visited and will point out potential pages you didn’t know users or search engines were visiting. A log file analysis should reveal underperforming pages.
- Use Google Analytics – You want a list of URLs that drove pageviews in the last year. Go to behavior→ site content → all pages and in show rows, you want to see as many rows as URLs that you have. Export as a CSV.
How to Decide What Pages to Remove
After consolidating all of the URLs collected, removing duplicates, and removing URLs with parameters, there’s a final list of URLs. Using a site crawling tool like screaming frog and connecting it with Google Analytics, Google Search Console, and Ahrefs you can pull traffic data, click data, and backlink data to start analyzing your website.
All of this data will give you a clear understanding of which URLs on your website are underperforming and don’t belong on your site.
(Bonus) Use the Cruft Finder Tool to find poor-performing pages
The Cruft Finder tool is a free tool we created to identify poor-performing pages. It’s designed to help eCommerce site managers find and remove thin content pages that are harming your SEO ranking.
The tool sends a Google query about your domain and — using a recipe of site quality parameters — returns page content we suspect might be harming your index ranking.
Mark any page that:
- Is identified by the Cruft Finder tool
- Gets very little traffic (as seen in Google Analytics)
These are pages you should consider removing from your site.
For years, you’ve been told that adding fresh content on your site increases traffic and improves SEO. But when you have too many pages on your website that don’t add value to your users, there’s a better approach to handle this.
You have 3 options available for you to use:
- Keep the page “as-is” by adding internal linking to it and finding the right place for it on your website;
- Leave it unchanged because it’s specific to a campaign but add a noindex tag;
- Delete the page but set up a 301 redirect to it.
If a blog post has been on your website for years, has no backlinks pointing to it, and no one ever visits it, that old content could be hurting your rankings. You should remove that outdated content.
Recently, we deleted 90% of one client’s blog posts. Why? Because they weren’t generating backlinks or traffic. If no one is visiting a URL, and it doesn’t add value to your site, it doesn’t need to be there. It’s using up crawl budget for no reason.
Update Necessary Pages with Little Traffic
If a URL has valuable content you want people to see — but it’s not getting any traffic — it’s time to restructure.
Ask yourself this when evaluating content:
- Is it possible to consolidate pages?
- Could you promote the content better through internal links?
- Could you change your navigation to push traffic to that particular page?
Also, make sure that all your static pages have robust, unique content. When Google’s index includes thousands of pages on your site with sparse or similar content, it can lower your relevancy score.
Prevent Internal Search Results Pages from Being Indexed
Not all pages on your site should be indexed. The main example of this is the search results pages. You almost never want search pages to be indexed because there are better pages to funnel traffic that have better quality content. These are not meant to be entry pages.
For example, here’s what we found using the Cruft Finder tool for one major retail site:
Over 5,000 search pages indexed by Google. If you find this issue on your own site, follow Google’s instructions to get rid of search result pages.
We recommend reading their instructions carefully before you remove or noindex these pages. They include details about temporary versus permanent solutions, and when to delete pages vs. using a noindex tag, and more. If this gets too far into “technical SEO” for you, feel free to reach out to our SEO team for consultation or advice.
How to Remove URLs from Google’s Index
Follow these basic instructions to noindex pages on your website. If your eCommerce website has a lot of zombie pages on it, see our in-depth SEO guide for fixing thin and duplicate content.
1. Use noindex in meta robot tag.
This tag is better than blocking pages with robots.txt. We want to tell search engines what to do with a page definitively whenever possible. The noindex tag tells search engines like Google whether or not they should index the page.
This tag is easy to implement. Automation is also available for most CMS.
The tag to add at the top of the page’s HTML code to noindex it is:
<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
“Noindex, Follow” means search engines should not index the page, but they can crawl/follow any links on that page.
2. Setup the Proper HTTP Status Code (2xx, 3xx, 4xx)
If old pages with thin content exist, remove and redirect (through a 301 redirect) to relevant content on the site. This maximizes site authority if old pages had backlinks pointing to them.
It also helps to reduce 404s (if they exist) by redirecting removed pages to current, relevant pages on the site.
Set the HTTP status code to “410” if content is no longer needed or not relevant to the website’s existing pages. A 404 status code is also okay, but a 410 is faster to get a site out of search engine’s index.
3. Setup Proper Canonical Tags
Adding a canonical tag in the header tells search engines which version they should index. Ensure that product variants (mostly setup using query strings or URL parameters) have a canonical tag pointing to the preferred product page.
This will usually be the main product page, without query strings or parameters in the URL that filter to the different product variants.
4.Update the Robots.txt File
The robots.txt file tells search engines what pages they should crawl and what pages not to crawl. Adding the “Disallow” directive within the robots.txt file stops Google from crawling zombie/thin pages, but keeps those pages in Google’s index.
Robots.txt does not remove a site from Google’s index because the page is already indexed and might have internal links from other pages of the site. Remove internal links completely when the destination page is set to “Disallow.”
If your goal is to prevent your site from being indexed, add the “noindex” tag to your site’s header instead.
5. Use the URL Removals Tool in Google Search Console
This tool is mostly used in cases when certain pages are blocked through robots.txt, but Google is still indexing those pages (often because the page still has internal links from other pages).
Adding the “noindex” directive might not be a quick fix and Google might keep indexing the pages, which is why the URL Removals tool can be handy at times. That said, use this method as a temporary solution. When you use it, pages are removed from Google’s index quickly (usually within a few hours depending on the number of requests).
The Removals Tools is best if used together with noindex directive. Remember that removals you make are reversible in the future.
What Kind of Results Can You See by Fixing Index Bloat?
Here’s a graph of indexed pages from a recent client that was letting their search result pages get indexed — the same way we explained above. We helped them implement a technical fix so those pages wouldn’t be indexed anymore.
In the Google Analytics graph, the blue dot is when we implemented the fix was implemented. The number of indexed pages continued to rise for a bit, then dropped significantly.
Year over year, here’s what happened to the site’s organic traffic and revenue:
3 Months Before the Technical Fix
- 6% decrease in organic traffic
- 5% increase in organic revenue
3 Months After the Technical Fix
- 22% increase in organic traffic
- 7% increase in organic revenue
Before vs. After
- 28% total difference in organic traffic
- 2% total increase in organic revenue
This process takes time.
The important technical SEO lesson here is that blocking is different from noindexing. Most websites end up blocking these types of pages from robots.txt, which is not the right way to fix the index bloat issue.
Blocking these pages with a robots.txt file won’t remove these pages from Google’s index if the page is already in the index. Or, if there are internal links from other pages on the website.
For this client, it took three full months before the number of indexed pages returned to the mid 13,000s, where it should have been all along.
Your website needs to be a useful resource for search visitors. If you’ve been in business for a long time there’s maintenance that should be performed every year. Analyze your pages frequently and make sure they’re still relevant. You want to confirm Google isn’t indexing pages that you want hidden.
As a site owner and SEO, knowing all of the pages indexed on your site can help you discover new opportunities to rank higher without needing to always publish new content. Regularly maintaining and updating a website is the best way to stay ahead of any algorithm update and keep growing your rankings.
Note: Interested in a personalized strategy for your business to reduce index bloat and raise your SEO ranking? We can help. Contact us here.