Editor’s Note: This post was originally published in November of 2013. While a lot of the original content still stands, algorithms and strategies are always changing. So our team has updated this post for 2020 and we hope that it will continue to be a helpful resource.

For eCommerce SEO professionals, issues surrounding duplicate content, and also thin/low-quality content, can spell disaster in the search engine rankings. As Google, Bing, and other search engines become more sophisticated, they reward websites that present only quality, unique content to their search bots for indexation.

In this resource guide, we dig into a wide variety of duplicate content scenarios commonly found on eCommerce websites.

eCommerce Duplicate Content Can Be Hard to Find

Table of Contents

What is Duplicate Content?

Internal Technical Duplicate Content

Internal Editorial Duplicate Content

Offsite Duplicate Content

What is Thin Content?

Tools for Finding & Diagnosing Duplicate eCommerce Content

What is Duplicate Content?

Google began taking duplicate, scraped, and thin content very seriously on February 24th, 2011 when they launched their first Panda algorithm update. According to their Content Guidelines, Google defines duplicate content as:

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:

  • Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
  • Store items shown or linked via multiple distinct URLs
  • Printer-only versions of web pages

Basically, duplicate content is the exact same website copy found on multiple pages anywhere on the internet. For example, if you have content on a URL on your site that can be found word for word somewhere else on your site then you’re dealing with a duplicate content issue. 

However, duplicate content is not a ranking penalty directly but it can hurt your overall website rankings if Google doesn’t know which page to rank. Search engines will have a difficult job figuring out which web page is more relevant to the user. 

What we should be worried about is having a large number of web pages on our websites that are mostly duplicate content or product pages with such short product descriptions that the content can be deemed thin, and thus, not valuable (to neither Google nor the reader).

Our job as website publishers and content managers is to ensure that we are providing the most robust information possible to our readers. When we take this approach, we are rewarded by Google since this meets their quality guidelines.

But not all duplicate content is editorially created. There are a number of technical situations that can lead to duplicate content issues that can lead to Google penalizing your site.

We’ll dive into many of these situations within this chapter so that you’re fully prepared to avoid duplicate content across your entire website.

We can help you spot and fix issues on your website that are harming your overall ranking.  Contact us here.

Internal Technical Duplicate Content

Duplicate content can exist internally on an eCommerce site in a number of ways, both due to technical and editorial causes. We’ll dive into some of the more popular instances where internal duplicate content can rear its ugly head.

The following instances of duplicate content are typically caused by technical reasons with the Content Management System (CMS) and other code-related aspects of eCommerce websites.

Non-Canonical URLs

Canonical URLs, help search engines understand that there is only a single version of the page’s URL that should be indexed no matter what other URL versions are rendered in the browser, linked to from external websites, etc.

Canonical URLs are extremely important in the case of tracking URLs, where tracking code (i.e. – affiliate tracking, social media source tracking, etc.) is appended to the end of a URL on the site (i.e. – ?a_aid=, ?utm_source, etc.).

They are also very helpful in fine-tuning indexation of category page URLs on eCommerce websites in instances where sorting, functional and filtering parameters are added to the end of the base category URLs to produce a different ordering of products on a category page (i.e. – ?dir=asc, ?price=10-, etc.).

Ensuring that the Canonical URL (in the <head> of the source code) is the same as the base category URL will prevent search engines from indexing these duplicate URLs.

URL/Page TypeVisible URLCanonical URL
Base Category URLhttp://www.domain.com/page-slughttp://www.domain.com/page-slug
Social Tracking URLhttp://www.domain.com/page-slug?utm_source=twitterhttp://www.domain.com/page-slug
Affiliate Tracking URLhttp://www.domain.com/page-slug?a_aid=123456http://www.domain.com/page-slug
Sorted Category URLhttp://www.domain.com/page-slug?dir=asc&order=pricehttp://www.domain.com/page-slug
Filtered Category URLhttp://www.domain.com/page-slug?price=-10http://www.domain.com/page-slug

It might also be beneficial to disallow crawling of the commonly used URL parameters via the /robots.txt file, in order to maximize crawl budget. Example:

User-agent: *
Disallow: *?dir=*
Disallow: *&order=*
Disallow: *?price=*

Session IDs

Many eCommerce websites use session IDs in URLs (i.e. – ?sid=) to track user behavior. The problem for search engines is that this creates a duplicate of the core URL of whatever page the session ID is applied to.

One common approach to fix this is to use cookies to track user sessions, instead of appending session ID code to URLs. However, if session IDs are appended to URLs, it’s easy to fix this by canonicalizing the session ID URLs to the page’s core URL.

A backup approach could be to set URLs with session IDs to noindex, but this limits page-level link equity potential in the event that someone links to a page’s URL that includes the session ID. It might also be beneficial to disallow crawling of session ID URLs via the /robots.txt file as long as the CMS system does not produce session IDs for search bots (which could cause major crawlability issues).


User-agent: *
Disallow: *?sid=*

Shopping Cart Pages

When users add products to their cart on your eCommerce website, and views their cart, most CMS systems implement URL structures that are specific to the shopping cart experience.

They might have “cart,” “basket,” or some other word as the unique identifier within these shopping cart URLs. It’s important to realize that these are not the types of pages that search engines wish to index, so identifying them and then setting them to “noindex,nofollow” via a meta robots tag or X-robots tag (and also disallowing crawling of them via the /robots.txt file) will help prevent search engines from indexing this low-quality content.

Internal Search Results

Internal search result pages are produced when someone conducts a search using an eCommerce website’s internal search feature. They have no unique content, only repurposed snippets of content from other pages on your eCommerce website.

Google’s own Matt Cutts has clearly stated that they do not want to send users from their search results to your search results (source). Instead, they want to send users to true content pages (product pages, category pages, static site pages, blog posts, and articles). This is an extremely common issue with eCommerce websites.

Many CMS systems do not set internal search result pages to “noindex, follow” by default, so a developer will need to apply this rule in order to fix this problem. It’s also recommended to disallow search bots from crawling internal search result pages within the /robots.txt file once all of your internal search result pages are removed from the index or before any of the pages do get indexed.

It’s an easy fix, yet an important one since it can lead to ranking penalties under Google’s Panda algorithm if there are too many internal search results in Google’s index.

Duplicate URL Paths

How CMS systems handle URL structures where products are placed in multiple categories of taxonomy can get tricky. For example, if a product is placed in both category A and category B, and if category directories are used within the URL structure of product pages, then the CMS could potentially create two different URLs for the same product.

Duplicate Product Page URLs

As one can imagine, this can lead to devastating duplicate content problems for product pages, which are typically the highest converting pages on an eCommerce website. Common approaches to fix this are:

  • Use root-level product page URLs (unfortunately this removes keyword-rich, category-level URL structure benefits and also limits trackability in Analytics software).
Using Root Level Product Page URLs to Avoid Duplicate URLs
  • Use /product/ URL directories for all products (which at least offers grouped trackability of all products in Analytics software).
Product Directory for Product Page URLs
  • Use product URLs built upon category URL structures, but ensure that each product page URL has a single, designated canonical URL).
Category Directory for Product Page URLs

In some instances, this situation can also arise with sub-Category URLs where the products displayed might be exactly the same, or close to it. For example, a “Flashlights” sub-category might be placed under both /tools/flashlights/ and /emergency/flashlights/ on an Emergency Preparedness eCommerce website, and have mostly the same products.

Taxonomy opinions aside, the same approach can be applied in these situations as with product pages. Also, ensuring that robust intro descriptions exist atop the category pages would help ensure that each similar sub-category page has unique content.

Product Review Pages

Many CMS systems come with built-in review functionality. Oftentimes, separate “review pages” are created to host all reviews for particular products, yet some (if not all) of the reviews are placed on the product pages, themselves.

This can create duplicate content between the product pages, themselves, and the corresponding product review pages. These “review pages” should either be canonicalized to the main product page or set to “noindex,follow” via meta robots or X-robots tag. The canonicalization method is preferred, just in case a link to a “review page” occurs on an external website, which will pass the link equity to the product page.

It’s also critical to ensure that review content is not duplicated on external sites when using 3rd party product review vendors. For a deep dive into this topic, please read Product Review Vendors—Solutions to Fit Your eCommerce SEO Needs.

WWW vs. Non-WWW URLs & Uppercase vs. Lowercase URLs

Just as the Post Office would consider 123 Race Avenue and 123 Race Street different home addresses, search engines consider http://www.domain.com and http://domain.com different web addresses. Therefore, it’s critical that one version of URLs is chosen for every page on the eCommerce website. 301 redirecting the non-preferred version to the preferred version is the recommended solution to avoid these technically created duplicate URLs, per Google.

Uppercase and lowercase URLs need to be handled in the same manner. If both render separately, then search engines can consider them different. It’s important to choose one format and 301 Redirect one version to the other. We have a helpful article that offers instruction on how to do this: How to Redirect Uppercase URLs to Lowercase URLs Using Htaccess.

Trailing Slashes on URLs

Similar to www and non-www URLs, search engines consider URLs that render both with a trailing slash and without, to be different URLs. As an example, duplicate URLs are created when URLs such as /page/ and /page/index.html, or /page and /page.html, render the same content.

It is especially problematic when /page and /page/ show the same content since, technically speaking, these two pages aren’t even in the same directory. Common approaches to fixing this problem are to either canonicalize both to a single version or 301 redirect one version to the other.

HTTPS URLs: Relative vs. Absolute Path

HTTPS (secure) URLs are typically created after a user has logged into an eCommerce website. Most times, search engines have no way of finding these URLs. However, there are instances where this is possible, such as when a logged-in Administrator is updating content and navigational links.

In this scenario, it’s common for the Administrator not to realize that embedded URLs include HTTPS instead of HTTP in the URLs. When relative path URLs (excluding the “http://www.domain.com” portion) are also used on the site (either in content or navigational links), it makes it all too easy for search engines to quickly crawl hundreds, if not thousands of HTTPS URLs, which are technically duplicates of the HTTP versions.

The most common solutions to fix this consist of using absolute path URLs (including the “http://www.domain.com” portion) coupled with ensuring that canonical URLs always use the HTTP version. Using 301 redirects in these cases could easily break the user-login functionality, as the HTTPS URLs would not be able to be rendered.

Internal Editorial Duplicate Content

The following instances of duplicate content are not technical but happen because the on-page content for different pages is either similar or duplicated. These issues are commonly solved by writing unique copy for each individual page.

Similar Product Descriptions

It’s easy to take shortcuts with product descriptions on eCommerce websites, especially with similar products. However, consider that Google is judging the content of eCommerce websites similar to regular content sites.

That alone should be enough to make a professional SEO realize that product page descriptions should be unique, compelling, and robust–especially for mid-tier eCommerce websites looking to scale content production because they don’t have enough Domain Authority to compete with bigger competitors.

Sharing short paragraphs, specifications and other content between product pages increases the likelihood that search engines will decrease their perception of a product page’s content quality and subsequently, ranking position.

Category Pages

Category pages on eCommerce websites typically include a title and product grid. This means that there is no unique content on these pages. Category page best practice is to add unique descriptions at the top of category pages (not the bottom, where content is given less weight by search engines) that describes what types of products are featured within the category.

There is no magic number of words or characters to use, however the more robust the category page content is, the better chance the page will be able to maximize traffic from organic search results (due to long-tail keyword traffic).

A benchmark of 100-300 words is common. It’s important to understand screen resolutions of your visitors and ensure that the product grid is not pushed below the fold on their browsers. Doing so could limit user discoverability of the product grid upon visiting the category page.

Tip: Intro descriptions on category pages offer a great opportunity to build deep links to related sub-category pages, related article content that may exist on the site, and popular products that deserve attention and link equity.

Homepage Duplicate Content

Every SEO should know that home pages typically have the most amount of incoming link equity, and thus serve as highly rankable pages in search engines.

What many SEOs forget is that a homepage should be treated like any other page on an eCommerce website, content-wise. Always ensure that unique content fills the majority of home page body content, as a homepage consisting merely of duplicated product blurbs offers little contextual value for search engines to rank the home page as highly as possible for target keywords in search engines.

Tip: Online marketers also commonly use the homepage’s descriptive content in directory submissions and other business listings on external websites. Ensure that unique content is provided to these external websites instead. If this has already been done to a large extent, rewriting the home page descriptive content is the easiest way to fix the preexisting issue.

Offsite Duplicate Content

Duplicate content that exists between an eCommerce website and other eCommerce websites (and potentially even content websites) has become a real pain point in recent years.

As Google clearly moves towards ranking websites more based on inbound link metrics (such as Domain Authority), websites with less inbound link equity are finding it extremely difficult to rank well in search engines when external duplicate content exists.

Let’s dive into some of the most common forms of external (off-site) duplicate content that prevents eCommerce websites from ranking as well as they could in organic search.

Manufacturer Product Descriptions

When eCommerce websites copy product descriptions, supplied by the product manufacturer, and place them on their own product pages, they are put at an immediate disadvantage.

In the search engines’ algorithmic analysis, these websites aren’t offering any unique value to users, so they choose to rank the big brand websites (who have more robust, and higher quality inbound link profiles), who may also be using the same product descriptions, higher instead. The only way to fix this is to embark upon the extensive task of rewriting existing product descriptions in addition to ensuring any new products are launched with completely unique descriptions.

In our experiences, we’ve seen lower-tier eCommerce websites increase organic search traffic by as much as 50-100% by simply rewriting product descriptions for half of the website’s product pages–with no manual link building efforts.

For eCommerce websites whose products are very time-sensitive, meaning they come in and out of stock as newer models are released, a better approach can be to simply ensure that new product pages are only launched with completely unique descriptions.

Other ways of filling product pages with unique content include:

  • Multiple photos (preferably unique photos, if possible)
  • Enhanced descriptions that offer more detailed insight into product benefits
  • Product demonstration videos (users love videos)
  • schema markup (to enhance SERP listings)
  • User-generated reviews

Duplicate Content on Staging, Development or Sandbox Websites

Time and time again, Development teams forget, give little consideration to, or simply don’t realize that testing sites can be discovered and indexed by search engines, oftentimes creating exact duplicates of a live eCommerce website. Luckily, these situations can be easily fixed through different approaches:

  • Adding a “noindex,nofollow” meta robots or X-robots tag to every page on the test site.
  • Blocking search engine crawlers from crawling the sites via a “Disallow: /” command in the /robots.txt file on the test site (don’t use this if your “duplicate” content has already been indexed).
  • Password-protecting the test site, to prevent search engines from crawling it.
  • Setting up these test sites separately within Webmaster Tools and using the “Remove URLs” tool in Google Search Console, or the “Block URLs” tool in Bing Webmaster Tools, to quickly get the entire test site out of Google and Bing’s index.

When search engines already have a test website indexed, using a combination of these approaches can yield the best results. One approach is to add the “noindex,nofollow” meta robots or X-robots tag, remove the entire site from search engines’ indexes via Webmaster Tools, and then add a “Disallow: /” command in the /robots.txt file once the content has been removed from the index.

3rd Party Product Feeds (Amazon & Google)

For good reason, eCommerce websites see the value in extending their products onto 3rd party shopping websites in order to extend their potential sales reach. What many eCommerce website marketing managers don’t realize is that this is creating duplicate content across these external domains.

Oftentimes, an eCommerce website’s own products on 3rd party websites will end up outranking its own product pages when products are fed onto 3rd party websites with more authoritative inbound link profiles.

Consider the popular scenario where a product manufacturer, with its own eCommerce website (to sell its own products direct to consumers), feeds its products to Amazon to greatly increase sales. This scenario is highly plausible for revenue reasons.

From an SEO perspective, serious problems have just been created, as Amazon is one of the most authoritative websites in the world and the product pages on Amazon are almost guaranteed to outrank the product pages on the manufacturer’s eCommerce website. Some may view this is as revenue displacement, but it clearly is going to put an in-house SEO’s job, or an SEO agency’s contract, in jeopardy when organic search traffic (and resulting revenue) plummets for the eCommerce website.

The solution to this problem is exactly what you would expect: ensure that product descriptions fed to 3rd party sites are different than what is placed on your eCommerce website. It’s recommended to give the manufacturer description to the 3rd party shopping feeds like Google, and write a more robust, unique description for your own eCommerce website.

Always give your own website the edge when it comes to content. In cases where an eCommerce website is selling its own products, webmasters and marketers will need to decide whether to rewrite the 3rd party shopping feed description or the on-site description. Whichever is decided upon, just ensure that the most authoritative and robust description exists on-site.

Affiliate Programs

Google’s quote from the beginning of this article, covering affiliate programs, is worth repeating:

“Pages with product affiliate links on which the product descriptions and reviews are copied directly from the original merchant without any original content or added value.”

If your eCommerce site offers an affiliate program, ensure that you do not distribute your own site’s product descriptions to your affiliates. It’s advised to provide affiliates with the same product feeds that are given to other 3rd party vendors who sell or promote your products.

For maximum ranking potential in search engines, ensure that no affiliates or 3rd party vendors use the same descriptions that you are using. Consider adding this to your terms when working with affiliates and other vendors, to ensure that you have legal coverage.

If any affiliates or vendors violate these terms, you have the contractual right to require them to remove the duplicated content and use your designated product description feed instead.

Syndicated Content

Some eCommerce websites will also have blogs in order to provide more marketable content on their website, and some of them will even syndicate that content out to other websites (again, to extend their marketing reach).

While this may seem like a great idea at first, it’s critical to realize that without proper SEO handling, this can also create external duplicate content. If the syndication partner is a more authoritative website (according to its inbound link profile), then it’s possible that the content on the syndication partner’s website will outrank (in search engines) the original content on the eCommerce website.

There are a few different solutions to prevent syndicated content outranking your own content:

  • Ensure that the syndication partner canonicalizes the content to the URL on the eCommerce site that it originated from. This is the best solution, as any inbound links to the content on the syndication partner’s website will be applied to the content on the eCommerce website. (hint, hint: link building!).
  • Ensure that the syndication partner applies a “noindex,follow” meta robots or X-robots tag to the syndicated content on their site.
  • Don’t partake in content syndication, and focus on other channels of traffic growth and brand development.

Scraped Content

Oftentimes, low-quality scraper sites can steal content from eCommerce websites in order to generate traffic and drive sales through ads. Furthermore, actual eCommerce competitors can steal content (even rewritten manufacturer descriptions), which can be a threat to a reputable eCommerce site’s visibility and rankability in search engines. While search engines have gotten much better at identifying these spammy sites, and filtering them out of their search results, they can still pose a problem.

The best way to handle this is to file a DMCA complaint with Google, or Intellectual Property Infringement with Bing, in order to alert these two search engines to the problem, and ultimately get these sites removed from search results.

Caveat: The content must be your own. If you’re using manufacturer product descriptions, you might have difficulty in convincing the search engines that the scraper site is truly violating your copyright. This might be a little easier if the scraper site is displaying your entire web page on their site, with clear branding of your website.

Classifieds & Auction Sites

Many eCommerce sites experience content duplication issues when other people or retailers copy their product descriptions to Craigslist, eBay and other auction/classifieds sites. Fighting this issue is an uphill battle that is likely to create more effort than it’s worth. Luckily, pages on these sites expire relatively quickly (within a few months) and Google likely is keen to that situation.

What is within your control is your own product listings on classifieds and auction sites. Be mindful of any content duplication, and use your product feed for these sites wherever possible.

Rand Fishkin chimed in on a Moz Q&A regarding duplicate content on eBay, and although the comment is from 2011, it still holds weight.

“…generally, the content duplication by having the product info on their site shouldn’t harm you.

If you’re really worried, provide more detail/depth/content on your own site than what you do on eBay, and possibly consider having different title/product name conventions. There’s lots of good ways to describe the same product.”

eBay doesn’t offer much help to other eBay members duplicating your content. Their Images and text policy guidelines merely state:

“If your image or text is being used by another member, we encourage you to contact the other member to ask if they’ll remove your image or text from their listing.”

My best advice here is to limit the problem by controlling what is within your power to control. Ensuring that your own site’s product pages (and page updates) are crawled and indexed quickly and regularly will help to ensure that Google sees your content as the original source.

What is Thin eCommerce Content?

Thin content is a page on your site with little to no content that doesn’t add unique value to the website or the user. It provides terrible user experiences and can get your eCommerce website penalized if the problem grows above the unknown threshold of what Google deems acceptable.

Here are some examples of scenarios where thin content could occur.

Thin/Empty Product Descriptions

For large eCommerce websites, it can be easy to take shortcuts on product descriptions. Taking this approach, however, can severely limit both organic search traffic and conversion potential.

Search engines are attempting to rank the best content for their users, and users (typically) want clear explanations of products to help them with their purchasing decisions. When product pages only include one or two sentences, this helps no one.

The solution is to ensure that product descriptions are thorough and detailed as possible. Even when you think it might not be possible to write more (or much at all) about a product.

Tip: One way to expand product descriptions is to jot down 5-10 questions that a customer might ask about the product, write down the answers, and then work them into the product description.

Test or Orphaned Pages

Nearly every website has outlying pages that were published as test pages, forgotten about, and now orphaned on the site. Guess who is still finding them? That’s right, search engines.

Sometimes these pages can be duplicates of others, sometimes they can have partially written content, and sometimes they can simply be empty. Ensure that all published and indexable content on your website is strong and provides value to a user who might view it.

Thin Category Pages

During the taxonomy development phase, content managers can sometimes get carried away with category creation. If a category is only going to be a few products, or potentially none in the future, then don’t create it.

Thinking in terms of the user, a category with only 1-3 products usually doesn’t provide the greatest browsing experience. Thinking in terms of the search engine–who thinks in terms of the user–too many of these thin category pages (coupled with other forms of duplicate and thin content) can lead a site to be penalized. The bottom line is to ensure that category pages are robust with both unique intro descriptions and sufficient product listings.

Thin content on category pages can also arise when drilling down into faceted category navigation until a page is reached with no products. These are called “stub pages,” and can lower search engines qualitative analysis of an eCommerce website when too many exist.

A helpful solution to fixing this issue is to apply a conditional “noindex,follow” meta robots or X-robots tag to these pages whenever common verbiage (i.e. – “No products exist”) is used on the page by the CMS. For a deeper dive on this subject, we highly recommend reading this article, which offers nifty recommendations using AJAX navigation or a selective combination of meta robots tags and /robots.txt disallow commands to maximize crawl budget.

Tools for Finding & Diagnosing Duplicate Content

Discovering duplicate content can be one of the most difficult and time-intensive tasks in a technical audit of an eCommerce website. This section will cover some quick tips on how to speed up the process of uncovering duplicate and thin content in order to “know what to fix.”

Google Search Console

Many duplicate content issues (and even thin content issues) can be discovered through Google Search Console, which is free to set up on your website. Bing does not offer anywhere near the same level of investigative tools for the use of duplicate content analysis, so this section will focus solely on Google Search Console.

Here are some of the ways to use Google Search Console for the purpose of identifying duplicate and thin content:

  • HTML Improvements – In this section, Google will point out specific URLs that have duplicate title tags and duplicate meta descriptions. Look for patterns, such as “Duplicate title tags” and “Duplicate meta descriptions” caused by category pages with URL parameters, orphaned pages with “Missing title tags,” etc.
  • Index Status – In this section, Google will show a historical traffic graph of the number of pages from your eCommerce site in its index. If the graph spikes upward at any point in time, and there was no corresponding increase in content creation coinciding with it, it could be an indication that duplicate or low-quality URLs have somehow made their way into Google’s index en masse.
  • URL Parameters – In this section, Google will tell you whether it’s having difficulty crawling and indexing your site. This section is nothing short of fantastic for identifying URL parameters (particularly for category pages) that could be leading to technically-created duplicate URLs. Use Google operators (we’ll get to this soon) to identify if Google has URLs from your eCommerce site with these parameters in its index, and determine whether it is duplicate/thin content or not.
  • Crawl Errors – In this section, if your eCommerce website’s soft 404 errors have spiked, it could be an indication that many low-quality pages have been indexed due to improper 404 error pages being produced (lacking 404 header status codes). Often times these pages will all have an error message as the only body content, and sometimes they have different URLs, which can cause technical duplicate content.

Moz Site Crawl Tool

Moz offers a Site Crawl tool, which is very helpful with identifying internal duplicate page content, not just duplicate metadata. Duplicate content is flagged as a “high priority” issues in the Moz Site Crawl tool since it diminishes the pages’ value to search engine indexes if the ratio of duplicate to unique content is too high. The tool allows you to export the reported pages with duplicated content (and associated pages), making it easier to identify what fixes are necessary.

Inflow’s CruftFinder SEO Tool

Created by the Inflow team, the CruftFinder SEO tool is designed to help you boost the quality of your domain by cleaning up “cruft” (junk URLs and low quality pages), reducing index bloat, and optimizing your crawl budget.

It’s primarily meant to be a diagnostic tool, so use it during your audit process, especially on older sites or when you’ve recently migrated to a new platform.

Search Query Operators (site:, inurl:, etc.)

Using search query operators in Google is one of the most effective ways of identifying duplicate and thin content, especially after potential problems have been identified from Webmaster Tools. The following operators are particularly helpful:

site: – This operator will show most URLs from your site indexed by Google, but not necessarily all of them. This is a quick way to gauge whether Google has an extremely excessive amount of URLs indexed for your site when compared to the number of URLs included in your sitemap (it should be an accurate depiction of the number of true content pages on your site, assuming that your sitemap is correctly populated with all of your true content URLs).

  • Example – site:www.domain.com

inurl: – This operator is ideal to use in conjunction with the site: operator in order to discover if URLs with particular parameters are indexed by Google. As mentioned earlier, potentially harmful URL parameters (if they are creating duplicate content and indexed by Google) can be identified in the URL Parameters section of Google Search Console. Use this operator to discover if Google has them indexed.

  • Example – site:www.domain.com inurl:?price=

This operator can also be used in “negative” fashion to identify if non-www URLs are indexed by Google (assuming that the www version of URLs is preferred).

  • Example – site:domain.com -inurl:www

intitle: – This operator will show all URLs indexed by Google that have specific words in the meta title tag. This can be particularly helpful when attempting to identify duplicates of a particular page, such as a product page that may also have a “review page” indexed by Google.

  • Example – site:www.domain.com intitle:Maglite LED XL200

Plagiarism, Crawler & Duplicate Content Tools

There are a number of very helpful 3rd party tools to help additionally identify duplicate and low-quality content that search engines could easily index. The following are some of the more popular tools to use for these purposes:

  • Copyscape – This tool is particularly useful at identifying external “editorial” duplicate content. Copyscape can crawl a website’s sitemap and compare all URLs within it to the rest of Google’s index, looking for instances of plagiarism. For the specific needs of eCommerce websites, this is particularly helpful at identifying the worst-offending product pages when it comes to copied and pasted manufacturer product descriptions. Exporting the data as a CSV file, and sorting by risk score allows for quick prioritization of the pages with the most duplicate content. Try this tool at www.copyscape.com.
  • Screaming Frog – This tool is very popular with advanced SEO professionals, as it crawls a website and helps to identify potential technical issues that could exist with duplicate content, improper redirects, error messages, etc. Exporting the crawl and segmenting the duplicate content issues can provide a lot of additional insight not provided by Google Search Console. Download this tool at http://www.screamingfrog.co.uk.
  • Siteliner – This tool offers a quick way to identify pages on your eCommerce site with the most internal duplicate content. The percent of duplicate content returned by this tool crawling your website pages is determined by how much unique content exists on each particular page in comparison to the repeated elements of each web page (header, sidebar, footer, etc.). This tool is particularly helpful at finding thin content pages. Try this tool at www.siteliner.com.

Experience, Intuition & CMS Knowledge

While the various tools and technical tips recommended above are extremely helpful at identifying duplicate, thin, and low-quality content, nothing compares to years of experience in identifying, diagnosing and fixing duplicate content problems.

As you work through identifying these specific issues on your website, you’ll be developing a wealth of knowledge that can be used and re-used in the future to continue cleaning up these issues, and preventing in the future. There’s only one way to get to that point–get started!

Don’t have the time or resources to tackle such an important project? Rely on our team of experts to help clean up your eCommerce website.  Contact us here.

Additional Resources