Editor’s Note: This post was originally published in November of 2013. While a lot of the original content still stands, algorithms and strategies are always changing. So our team has updated this post for 2020 and we hope that it will continue to be a helpful resource.
For eCommerce SEO professionals, issues surrounding duplicate content, and also thin/low-quality content, can spell disaster in the search engine rankings. As Google, Bing, and other search engines become more sophisticated, they reward websites that present only quality, unique content to their search bots for indexation.
In this resource guide, we dig into a wide variety of duplicate content scenarios commonly found on eCommerce websites.

Table of Contents
|
What is Duplicate Content?
Google began taking duplicate, scraped, and thin content very seriously on February 24th, 2011 when they launched their first Panda algorithm update. According to their Content Guidelines, Google defines duplicate content as:
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:
- Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
- Store items shown or linked via multiple distinct URLs
- Printer-only versions of web pages
Basically, duplicate content is the exact same website copy found on multiple pages anywhere on the internet. For example, if you have content on a URL on your site that can be found word for word somewhere else on your site then you’re dealing with a duplicate content issue.
However, duplicate content is not a ranking penalty directly but it can hurt your overall website rankings if Google doesn’t know which page to rank. Search engines will have a difficult job figuring out which web page is more relevant to the user.
What we should be worried about is having a large number of web pages on our websites that are mostly duplicate content or product pages with such short product descriptions that the content can be deemed thin, and thus, not valuable (to neither Google nor the reader).
Our job as website publishers and content managers is to ensure that we are providing the most robust information possible to our readers. When we take this approach, we are rewarded by Google since this meets their quality guidelines.
But not all duplicate content is editorially created. There are a number of technical situations that can lead to duplicate content issues that can lead to Google penalizing your site.
We’ll dive into many of these situations within this chapter so that you’re fully prepared to avoid duplicate content across your entire website.
We can help you spot and fix issues on your website that are harming your overall ranking. Contact us here.
Internal Technical Duplicate Content
Duplicate content can exist internally on an eCommerce site in a number of ways, both due to technical and editorial causes. We’ll dive into some of the more popular instances where internal duplicate content can rear its ugly head.
The following instances of duplicate content are typically caused by technical reasons with the Content Management System (CMS) and other code-related aspects of eCommerce websites.
Non-Canonical URLs
Canonical URLs, help search engines understand that there is only a single version of the page’s URL that should be indexed no matter what other URL versions are rendered in the browser, linked to from external websites, etc.
Canonical URLs are extremely important in the case of tracking URLs, where tracking code (i.e. – affiliate tracking, social media source tracking, etc.) is appended to the end of a URL on the site (i.e. – ?a_aid=, ?utm_source, etc.).
They are also very helpful in fine-tuning indexation of category page URLs on eCommerce websites in instances where sorting, functional and filtering parameters are added to the end of the base category URLs to produce a different ordering of products on a category page (i.e. – ?dir=asc, ?price=10-, etc.).
Ensuring that the Canonical URL (in the <head> of the source code) is the same as the base category URL will prevent search engines from indexing these duplicate URLs.
URL/Page Type | Visible URL | Canonical URL |
Base Category URL | https://www.domain.com/page-slug | https://www.domain.com/page-slug |
Social Tracking URL | https://www.domain.com/page-slug?utm_source=twitter | https://www.domain.com/page-slug |
Affiliate Tracking URL | https://www.domain.com/page-slug?a_aid=123456 | https://www.domain.com/page-slug |
Sorted Category URL | https://www.domain.com/page-slug?dir=asc&order=price | https://www.domain.com/page-slug |
Filtered Category URL | https://www.domain.com/page-slug?price=-10 | https://www.domain.com/page-slug |
It might also be beneficial to disallow crawling of the commonly used URL parameters via the /robots.txt file, in order to maximize crawl budget. Example:
User-agent: * Disallow: *?dir=* Disallow: *&order=* Disallow: *?price=*
Session IDs
Many eCommerce websites use session IDs in URLs (i.e. – ?sid=) to track user behavior. The problem for search engines is that this creates a duplicate of the core URL of whatever page the session ID is applied to.
One common approach to fix this is to use cookies to track user sessions, instead of appending session ID code to URLs. However, if session IDs are appended to URLs, it’s easy to fix this by canonicalizing the session ID URLs to the page’s core URL.
A backup approach could be to set URLs with session IDs to noindex, but this limits page-level link equity potential in the event that someone links to a page’s URL that includes the session ID. It might also be beneficial to disallow crawling of session ID URLs via the /robots.txt file as long as the CMS system does not produce session IDs for search bots (which could cause major crawlability issues).
Example:
User-agent: * Disallow: *?sid=*
Shopping Cart Pages
When users add products to their cart on your eCommerce website, and views their cart, most CMS systems implement URL structures that are specific to the shopping cart experience.
They might have “cart,” “basket,” or some other word as the unique identifier within these shopping cart URLs. It’s important to realize that these are not the types of pages that search engines wish to index, so identifying them and then setting them to “noindex,nofollow” via a meta robots tag or X-robots tag (and also disallowing crawling of them via the /robots.txt file) will help prevent search engines from indexing this low-quality content.
Internal Search Results
Internal search result pages are produced when someone conducts a search using an eCommerce website’s internal search feature. They have no unique content, only repurposed snippets of content from other pages on your eCommerce website.
Google’s own Matt Cutts has clearly stated that they do not want to send users from their search results to your search results (source). Instead, they want to send users to true content pages (product pages, category pages, static site pages, blog posts, and articles). This is an extremely common issue with eCommerce websites.
Many CMS systems do not set internal search result pages to “noindex, follow” by default, so a developer will need to apply this rule in order to fix this problem. It’s also recommended to disallow search bots from crawling internal search result pages within the /robots.txt file once all of your internal search result pages are removed from the index or before any of the pages do get indexed.
It’s an easy fix, yet an important one since it can lead to ranking penalties under Google’s Panda algorithm if there are too many internal search results in Google’s index.
Duplicate URL Paths
How CMS systems handle URL structures where products are placed in multiple categories of taxonomy can get tricky. For example, if a product is placed in both category A and category B, and if category directories are used within the URL structure of product pages, then the CMS could potentially create two different URLs for the same product.

As one can imagine, this can lead to devastating duplicate content problems for product pages, which are typically the highest converting pages on an eCommerce website. Common approaches to fix this are:
- Use root-level product page URLs (unfortunately this removes keyword-rich, category-level URL structure benefits and also limits trackability in Analytics software).

- Use /product/ URL directories for all products (which at least offers grouped trackability of all products in Analytics software).

- Use product URLs built upon category URL structures, but ensure that each product page URL has a single, designated canonical URL).

In some instances, this situation can also arise with sub-Category URLs where the products displayed might be exactly the same, or close to it. For example, a “Flashlights” sub-category might be placed under both /tools/flashlights/ and /emergency/flashlights/ on an Emergency Preparedness eCommerce website, and have mostly the same products.
Taxonomy opinions aside, the same approach can be applied in these situations as with product pages. Also, ensuring that robust intro descriptions exist atop the category pages would help ensure that each similar sub-category page has unique content.
Product Review Pages
Many CMS systems come with built-in review functionality. Oftentimes, separate “review pages” are created to host all reviews for particular products, yet some (if not all) of the reviews are placed on the product pages, themselves.
This can create duplicate content between the product pages, themselves, and the corresponding product review pages. These “review pages” should either be canonicalized to the main product page or set to “noindex,follow” via meta robots or X-robots tag. The canonicalization method is preferred, just in case a link to a “review page” occurs on an external website, which will pass the link equity to the product page.
It’s also critical to ensure that review content is not duplicated on external sites when using 3rd party product review vendors. For a deep dive into this topic, please read Product Review Vendors—Solutions to Fit Your eCommerce SEO Needs.
WWW vs. Non-WWW URLs & Uppercase vs. Lowercase URLs
Just as the Post Office would consider 123 Race Avenue and 123 Race Street different home addresses, search engines consider https://www.domain.com and https://domain.com different web addresses. Therefore, it’s critical that one version of URLs is chosen for every page on the eCommerce website. 301 redirecting the non-preferred version to the preferred version is the recommended solution to avoid these technically created duplicate URLs, per Google.
Uppercase and lowercase URLs need to be handled in the same manner. If both render separately, then search engines can consider them different. It’s important to choose one format and 301 Redirect one version to the other. We have a helpful article that offers instruction on how to do this: How to Redirect Uppercase URLs to Lowercase URLs Using Htaccess.
I noticed there was some confusion around trailing slashes on URLs, so I hope this helps. tl;dr: slash on root/hostname=doesn’t matter; slash elsewhere=does matter (they’re different URLs) pic.twitter.com/qjKebMa8V8
— John ☆.o(≧▽≦)o.☆ (@JohnMu) December 19, 2017
Trailing Slashes on URLs
Similar to www and non-www URLs, search engines consider URLs that render both with a trailing slash and without, to be different URLs. As an example, duplicate URLs are created when URLs such as /page/ and /page/index.html, or /page and /page.html, render the same content.
It is especially problematic when /page and /page/ show the same content since, technically speaking, these two pages aren’t even in the same directory. Common approaches to fixing this problem are to either canonicalize both to a single version or 301 redirect one version to the other.
HTTPS URLs: Relative vs. Absolute Path
HTTPS (secure) URLs are typically created after a user has logged into an eCommerce website. Most times, search engines have no way of finding these URLs. However, there are instances where this is possible, such as when a logged-in Administrator is updating content and navigational links.
In this scenario, it’s common for the Administrator not to realize that embedded URLs include HTTPS instead of HTTP in the URLs. When relative path URLs (excluding the “https://www.domain.com” portion) are also used on the site (either in content or navigational links), it makes it all too easy for search engines to quickly crawl hundreds, if not thousands of HTTPS URLs, which are technically duplicates of the HTTP versions.
The most common solutions to fix this consist of using absolute path URLs (including the “https://www.domain.com” portion) coupled with ensuring that canonical URLs always use the HTTP version. Using 301 redirects in these cases could easily break the user-login functionality, as the HTTPS URLs would not be able to be rendered.
Internal Editorial Duplicate Content
The following instances of duplicate content are not technical but happen because the on-page content for different pages is either similar or duplicated. These issues are commonly solved by writing unique copy for each individual page.
Similar Product Descriptions
It’s easy to take shortcuts with product descriptions on eCommerce websites, especially with similar products. However, consider that Google is judging the content of eCommerce websites similar to regular content sites.
That alone should be enough to make a professional SEO realize that product page descriptions should be unique, compelling, and robust–especially for mid-tier eCommerce websites looking to scale content production because they don’t have enough Domain Authority to compete with bigger competitors.
Sharing short paragraphs, specifications and other content between product pages increases the likelihood that search engines will decrease their perception of a product page’s content quality and subsequently, ranking position.
Category Pages
Category pages on eCommerce websites typically include a title and product grid. This means that there is no unique content on these pages. Category page best practice is to add unique descriptions at the top of category pages (not the bottom, where content is given less weight by search engines) that describes what types of products are featured within the category.
There is no magic number of words or characters to use, however the more robust the category page content is, the better chance the page will be able to maximize traffic from organic search results (due to long-tail keyword traffic).
A benchmark of 100-300 words is common. It’s important to understand screen resolutions of your visitors and ensure that the product grid is not pushed below the fold on their browsers. Doing so could limit user discoverability of the product grid upon visiting the category page.
Tip: Intro descriptions on category pages offer a great opportunity to build deep links to related sub-category pages, related article content that may exist on the site, and popular products that deserve attention and link equity.
Homepage Duplicate Content
Every SEO should know that home pages typically have the most amount of incoming link equity, and thus serve as highly rankable pages in search engines.
What many SEOs forget is that a homepage should be treated like any other page on an eCommerce website, content-wise. Always ensure that unique content fills the majority of home page body content, as a homepage consisting merely of duplicated product blurbs offers little contextual value for search engines to rank the home page as highly as possible for target keywords in search engines.
Tip: Online marketers also commonly use the homepage’s descriptive content in directory submissions and other business listings on external websites. Ensure that unique content is provided to these external websites instead. If this has already been done to a large extent, rewriting the home page descriptive content is the easiest way to fix the preexisting issue.
Offsite Duplicate Content
Duplicate content that exists between an eCommerce website and other eCommerce websites (and potentially even content websites) has become a real pain point in recent years.
As Google clearly moves towards ranking websites more based on inbound link metrics (such as Domain Authority), websites with less inbound link equity are finding it extremely difficult to rank well in search engines when external duplicate content exists.
Let’s dive into some of the most common forms of external (off-site) duplicate content that prevents eCommerce websites from ranking as well as they could in organic search.
Manufacturer Product Descriptions
When eCommerce websites copy product descriptions, supplied by the product manufacturer, and place them on their own product pages, they are put at an immediate disadvantage.
In the search engines’ algorithmic analysis, these websites aren’t offering any unique value to users, so they choose to rank the big brand websites (who have more robust, and higher quality inbound link profiles), who may also be using the same product descriptions, higher instead. The only way to fix this is to embark upon the extensive task of rewriting existing product descriptions in addition to ensuring any new products are launched with completely unique descriptions.
In our experiences, we’ve seen lower-tier eCommerce websites increase organic search traffic by as much as 50-100% by simply rewriting product descriptions for half of the website’s product pages–with no manual link building efforts.
For eCommerce websites whose products are very time-sensitive, meaning they come in and out of stock as newer models are released, a better approach can be to simply ensure that new product pages are only launched with completely unique descriptions.
Other ways of filling product pages with unique content include:
- Multiple photos (preferably unique photos, if possible)
- Enhanced descriptions that offer more detailed insight into product benefits
- Product demonstration videos (users love videos)
- schema markup (to enhance SERP listings)
- User-generated reviews
Duplicate Content on Staging, Development or Sandbox Websites
Time and time again, Development teams forget, give little consideration to, or simply don’t realize that testing sites can be discovered and indexed by search engines, oftentimes creating exact duplicates of a live eCommerce website. Luckily, these situations can be easily fixed through different approaches:
- Adding a “noindex,nofollow” meta robots or X-robots tag to every page on the test site.
- Blocking search engine crawlers from crawling the sites via a “Disallow: /” command in the /robots.txt file on the test site (don’t use this if your “duplicate” content has already been indexed).
- Password-protecting the test site, to prevent search engines from crawling it.
- Setting up these test sites separately within Webmaster Tools and using the “Remove URLs” tool in Google Search Console, or the “Block URLs” tool in Bing Webmaster Tools, to quickly get the entire test site out of Google and Bing’s index.
SEO Tip: want to update, (eg. noindex) a set of pages quickly? Create a HTML list with all URLs, Fetch as Google in Search Console -> Crawl this URL and its direct linkss #seo pic.twitter.com/s5SDRFVHRF
— Jan-Willem Bobbink (@jbobbink) December 5, 2017
When search engines already have a test website indexed, using a combination of these approaches can yield the best results. One approach is to add the “noindex,nofollow” meta robots or X-robots tag, remove the entire site from search engines’ indexes via Webmaster Tools, and then add a “Disallow: /” command in the /robots.txt file once the content has been removed from the index.
3rd Party Product Feeds (Amazon & Google)
For good reason, eCommerce websites see the value in extending their products onto 3rd party shopping websites in order to extend their potential sales reach. What many eCommerce website marketing managers don’t realize is that this is creating duplicate content across these external domains.
Oftentimes, an eCommerce website’s own products on 3rd party websites will end up outranking its own product pages when products are fed onto 3rd party websites with more authoritative inbound link profiles.
Consider the popular scenario where a product manufacturer, with its own eCommerce website (to sell its own products direct to consumers), feeds its products to Amazon to greatly increase sales. This scenario is highly plausible for revenue reasons.
From an SEO perspective, serious problems have just been created, as Amazon is one of the most authoritative websites in the world and the product pages on Amazon are almost guaranteed to outrank the product pages on the manufacturer’s eCommerce website. Some may view this is as revenue displacement, but it clearly is going to put an in-house SEO’s job, or an SEO agency’s contract, in jeopardy when organic search traffic (and resulting revenue) plummets for the eCommerce website.
The solution to this problem is exactly what you would expect: ensure that product descriptions fed to 3rd party sites are different than what is placed on your eCommerce website. It’s recommended to give the manufacturer description to the 3rd party shopping feeds like Google, and write a more robust, unique description for your own eCommerce website.
Always give your own website the edge when it comes to content. In cases where an eCommerce website is selling its own products, webmasters and marketers will need to decide whether to rewrite the 3rd party shopping feed description or the on-site description. Whichever is decided upon, just ensure that the most authoritative and robust description exists on-site.
Affiliate Programs
Google’s quote from the beginning of this article, covering affiliate programs, is worth repeating:
“Pages with product affiliate links on which the product descriptions and reviews are copied directly from the original merchant without any original content or added value.”
If your eCommerce site offers an affiliate program, ensure that you do not distribute your own site’s product descriptions to your affiliates. It’s advised to provide affiliates with the same product feeds that are given to other 3rd party vendors who sell or promote your products.
For maximum ranking potential in search engines, ensure that no affiliates or 3rd party vendors use the same descriptions that you are using. Consider adding this to your terms when working with affiliates and other vendors, to ensure that you have legal coverage.
If any affiliates or vendors violate these terms, you have the contractual right to require them to remove the duplicated content and use your designated product description feed instead.
Syndicated Content
Some eCommerce websites will also have blogs in order to provide more marketable content on their website, and some of them will even syndicate that content out to other websites (again, to extend their marketing reach).
While this may seem like a great idea at first, it’s critical to realize that without proper SEO handling, this can also create external duplicate content. If the syndication partner is a more authoritative website (according to its inbound link profile), then it’s possible that the content on the syndication partner’s website will outrank (in search engines) the original content on the eCommerce website.
There are a few different solutions to prevent syndicated content outranking your own content:
- Ensure that the syndication partner canonicalizes the content to the URL on the eCommerce site that it originated from. This is the best solution, as any inbound links to the content on the syndication partner’s website will be applied to the content on the eCommerce website. (hint, hint: link building!).
- Ensure that the syndication partner applies a “noindex,follow” meta robots or X-robots tag to the syndicated content on their site.
- Don’t partake in content syndication, and focus on other channels of traffic growth and brand development.
Scraped Content
Oftentimes, low-quality scraper sites can steal content from eCommerce websites in order to generate traffic and drive sales through ads. Furthermore, actual eCommerce competitors can steal content (even rewritten manufacturer descriptions), which can be a threat to a reputable eCommerce site’s visibility and rankability in search engines. While search engines have gotten much better at identifying these spammy sites, and filtering them out of their search results, they can still pose a problem.
The best way to handle this is to file a DMCA complaint with Google, or Intellectual Property Infringement with Bing, in order to alert these two search engines to the problem, and ultimately get these sites removed from search results.
Caveat: The content must be your own. If you’re using manufacturer product descriptions, you might have difficulty in convincing the search engines that the scraper site is truly violating your copyright. This might be a little easier if the scraper site is displaying your entire web page on their site, with clear branding of your website.
Classifieds & Auction Sites
Many eCommerce sites experience content duplication issues when other people or retailers copy their product descriptions to Craigslist, eBay and other auction/classifieds sites. Fighting this issue is an uphill battle that is likely to create more effort than it’s worth. Luckily, pages on these sites expire relatively quickly (within a few months) and Google likely is keen to that situation.
What is within your control is your own product listings on classifieds and auction sites. Be mindful of any content duplication, and use your product feed for these sites wherever possible.
Rand Fishkin chimed in on a Moz Q&A regarding duplicate content on eBay, and although the comment is from 2011, it still holds weight.
“…generally, the content duplication by having the product info on their site shouldn’t harm you.
If you’re really worried, provide more detail/depth/content on your own site than what you do on eBay, and possibly consider having different title/product name conventions. There’s lots of good ways to describe the same product.”
eBay doesn’t offer much help to other eBay members duplicating your content. Their Images and text policy guidelines merely state:
“If your image or text is being used by another member, we encourage you to contact the other member to ask if they’ll remove your image or text from their listing.”
My best advice here is to limit the problem by controlling what is within your power to control. Ensuring that your own site’s product pages (and page updates) are crawled and indexed quickly and regularly will help to ensure that Google sees your content as the original source.
What is Thin eCommerce Content?
Thin content is a page on your site with little to no content that doesn’t add unique value to the website or the user. It provides terrible user experiences and can get your eCommerce website penalized if the problem grows above the unknown threshold of what Google deems acceptable.
Here are some examples of scenarios where thin content could occur.
Thin/Empty Product Descriptions
For large eCommerce websites, it can be easy to take shortcuts on product descriptions. Taking this approach, however, can severely limit both organic search traffic and conversion potential.
Search engines are attempting to rank the best content for their users, and users (typically) want clear explanations of products to help them with their purchasing decisions. When product pages only include one or two sentences, this helps no one.
The solution is to ensure that product descriptions are thorough and detailed as possible. Even when you think it might not be possible to write more (or much at all) about a product.
Tip: One way to expand product descriptions is to jot down 5-10 questions that a customer might ask about the product, write down the answers, and then work them into the product description.
Test or Orphaned Pages
Nearly every website has outlying pages that were published as test pages, forgotten about, and now orphaned on the site. Guess who is still finding them? That’s right, search engines.
Sometimes these pages can be duplicates of others, sometimes they can have partially written content, and sometimes they can simply be empty. Ensure that all published and indexable content on your website is strong and provides value to a user who might view it.
Thin Category Pages
During the taxonomy development phase, content managers can sometimes get carried away with category creation. If a category is only going to be a few products, or potentially none in the future, then don’t create it.
Thinking in terms of the user, a category with only 1-3 products usually doesn’t provide the greatest browsing experience. Thinking in terms of the search engine–who thinks in terms of the user–too many of these thin category pages (coupled with other forms of duplicate and thin content) can lead a site to be penalized. The bottom line is to ensure that category pages are robust with both unique intro descriptions and sufficient product listings.
Thin content on category pages can also arise when drilling down into faceted category navigation until a page is reached with no products. These are called “stub pages,” and can lower search engines qualitative analysis of an eCommerce website when too many exist.
A helpful solution to fixing this issue is to apply a conditional “noindex,follow” meta robots or X-robots tag to these pages whenever common verbiage (i.e. – “No products exist”) is used on the page by the CMS. For a deeper dive on this subject, we highly recommend reading this article, which offers nifty recommendations using AJAX navigation or a selective combination of meta robots tags and /robots.txt disallow commands to maximize crawl budget.
Tools for Finding & Diagnosing Duplicate Content
Discovering duplicate content can be one of the most difficult and time-intensive tasks in a technical audit of an eCommerce website. This section will cover some quick tips on how to speed up the process of uncovering duplicate and thin content in order to “know what to fix.”
Google Search Console
Many duplicate content issues (and even thin content issues) can be discovered through Google Search Console, which is free to set up on your website. Bing does not offer anywhere near the same level of investigative tools for the use of duplicate content analysis, so this section will focus solely on Google Search Console.
Here are some of the ways to use Google Search Console for the purpose of identifying duplicate and thin content:
- HTML Improvements – In this section, Google will point out specific URLs that have duplicate title tags and duplicate meta descriptions. Look for patterns, such as “Duplicate title tags” and “Duplicate meta descriptions” caused by category pages with URL parameters, orphaned pages with “Missing title tags,” etc.
- Index Status – In this section, Google will show a historical traffic graph of the number of pages from your eCommerce site in its index. If the graph spikes upward at any point in time, and there was no corresponding increase in content creation coinciding with it, it could be an indication that duplicate or low-quality URLs have somehow made their way into Google’s index en masse.
- URL Parameters – In this section, Google will tell you whether it’s having difficulty crawling and indexing your site. This section is nothing short of fantastic for identifying URL parameters (particularly for category pages) that could be leading to technically-created duplicate URLs. Use Google operators (we’ll get to this soon) to identify if Google has URLs from your eCommerce site with these parameters in its index, and determine whether it is duplicate/thin content or not.
- Crawl Errors – In this section, if your eCommerce website’s soft 404 errors have spiked, it could be an indication that many low-quality pages have been indexed due to improper 404 error pages being produced (lacking 404 header status codes). Often times these pages will all have an error message as the only body content, and sometimes they have different URLs, which can cause technical duplicate content.
Moz Site Crawl Tool
Moz offers a Site Crawl tool, which is very helpful with identifying internal duplicate page content, not just duplicate metadata. Duplicate content is flagged as a “high priority” issues in the Moz Site Crawl tool since it diminishes the pages’ value to search engine indexes if the ratio of duplicate to unique content is too high. The tool allows you to export the reported pages with duplicated content (and associated pages), making it easier to identify what fixes are necessary.
Inflow’s CruftFinder SEO Tool
Created by the Inflow team, the CruftFinder SEO tool is designed to help you boost the quality of your domain by cleaning up “cruft” (junk URLs and low quality pages), reducing index bloat, and optimizing your crawl budget.
It’s primarily meant to be a diagnostic tool, so use it during your audit process, especially on older sites or when you’ve recently migrated to a new platform.
Search Query Operators (site:, inurl:, etc.)
Using search query operators in Google is one of the most effective ways of identifying duplicate and thin content, especially after potential problems have been identified from Webmaster Tools. The following operators are particularly helpful:
site: – This operator will show most URLs from your site indexed by Google, but not necessarily all of them. This is a quick way to gauge whether Google has an extremely excessive amount of URLs indexed for your site when compared to the number of URLs included in your sitemap (it should be an accurate depiction of the number of true content pages on your site, assuming that your sitemap is correctly populated with all of your true content URLs).
- Example – site:www.domain.com
inurl: – This operator is ideal to use in conjunction with the site: operator in order to discover if URLs with particular parameters are indexed by Google. As mentioned earlier, potentially harmful URL parameters (if they are creating duplicate content and indexed by Google) can be identified in the URL Parameters section of Google Search Console. Use this operator to discover if Google has them indexed.
- Example – site:www.domain.com inurl:?price=
This operator can also be used in “negative” fashion to identify if non-www URLs are indexed by Google (assuming that the www version of URLs is preferred).
- Example – site:domain.com -inurl:www
intitle: – This operator will show all URLs indexed by Google that have specific words in the meta title tag. This can be particularly helpful when attempting to identify duplicates of a particular page, such as a product page that may also have a “review page” indexed by Google.
- Example – site:www.domain.com intitle:Maglite LED XL200
Plagiarism, Crawler & Duplicate Content Tools
There are a number of very helpful 3rd party tools to help additionally identify duplicate and low-quality content that search engines could easily index. The following are some of the more popular tools to use for these purposes:
- Copyscape – This tool is particularly useful at identifying external “editorial” duplicate content. Copyscape can crawl a website’s sitemap and compare all URLs within it to the rest of Google’s index, looking for instances of plagiarism. For the specific needs of eCommerce websites, this is particularly helpful at identifying the worst-offending product pages when it comes to copied and pasted manufacturer product descriptions. Exporting the data as a CSV file, and sorting by risk score allows for quick prioritization of the pages with the most duplicate content. Try this tool at www.copyscape.com.
- Screaming Frog – This tool is very popular with advanced SEO professionals, as it crawls a website and helps to identify potential technical issues that could exist with duplicate content, improper redirects, error messages, etc. Exporting the crawl and segmenting the duplicate content issues can provide a lot of additional insight not provided by Google Search Console. Download this tool at https://www.screamingfrog.co.uk.
- Siteliner – This tool offers a quick way to identify pages on your eCommerce site with the most internal duplicate content. The percent of duplicate content returned by this tool crawling your website pages is determined by how much unique content exists on each particular page in comparison to the repeated elements of each web page (header, sidebar, footer, etc.). This tool is particularly helpful at finding thin content pages. Try this tool at www.siteliner.com.
Experience, Intuition & CMS Knowledge
While the various tools and technical tips recommended above are extremely helpful at identifying duplicate, thin, and low-quality content, nothing compares to years of experience in identifying, diagnosing and fixing duplicate content problems.
As you work through identifying these specific issues on your website, you’ll be developing a wealth of knowledge that can be used and re-used in the future to continue cleaning up these issues, and preventing in the future. There’s only one way to get to that point–get started!
Don’t have the time or resources to tackle such an important project? Rely on our team of experts to help clean up your eCommerce website. Contact us here.
Additional Resources
- Google’s Official Advice on Duplicate Content
- eCommerce Copywriting Guidelines (Free PDF download by Inflow)
- Why Use Unique eCommerce Website Copy?
- Four SEO Best Practices for Using a Content Delivery Network (CDN)
- Duplicate Content Guide from Kern Media
- Duplicate Content in a Post Panda World
- eCommerce SEO: Product Variations, Colors and Sizes
- The Complete Guide to Mastering Duplicate Content Issues
Great in-depth post, Thanks!
Dan this is an Epic post. It is tough to cover so much ground at once without sacrificing depth, but you’ve managed to strike that balance here. Good stuff!
That’s an awesome list and a even better check list for all Webmasters in eCommerce. Thank you very much for sharing! Greetings from Switzerland!
Superb in depth post Dan – Excellent resource.
Definitely the most thorough yet concise article on duplicate content that I have read. Bookmarked and shared.
Great post Dan, definitely worth the read and worth sharing.
Great article indeed Dan. It’s a wonderful resource for ecommerce site owners and staff at SEO companies. A small correction though. Panda algorithm was launched in Feb of 2011 and not 2012. Thanks again.
Thanks for the typo fix Vivek. I will update Dan’s post now. We’re glad you enjoyed it and are proud to have him on our team.
When did robots.txt start supporting wild card entries?
> Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
Source: https://www.robotstxt.org/robotstxt.html
James Google has been obeying wildcard directives in the robots.txt file for several years. You can verify this by using them in the robots.txt testing tool in Google Webmaster Tools. As for the correct syntax according to other organizations, they may not be technically supported. As SEOs we tend to think more about how syntax is treated by search engines. According to Google’s Developer Help page:
I’m currently up to my eyeballs in duplicate content so this post is a serious life saver!
Thanks for sharing!
Thanks for this interesting blog post Dan. Greetings from India!
Excellent post, thanks for putting this info together in a clean and easy to read format. Bookmark’d for later use. 🙂
I’ve been looking for creative ways to check for duplicate content on very large websites, > 100,000 pages. Siteliner and Copyscape are too expensive to make it worthwhile. Any suggestions?
Hello Dan,
There are affordable tools to do checks for duplicate content internally. For example, you could use Screaming Frog to crawl your site and report on duplicate titles, descriptions and other issues typically caused by technical reasons for duplicate content. Other types of internal duplicate content (such as copied and pasted text or duplicate product descriptions on different products) are a little more difficult to catch without a full content audit.
With regard to external duplication, we typically start with a Copyscape check of 1,000 pages from various sections of the site, which gives us a general idea as to the scale and cause of external duplicate content problems. From there it tends to be an issue of fixing the problem more than identifying more pages with the symptom.
Excellent article. I have a question. I am building an ecommerce site with more than 1000 products. I simply do not have the time or the budget to put unique product descriptions and/or specifications for each product. I realise that if i did that I would certainly be in good standing with Google but realistically this is almost impossible. Do you have any suggestions?
This is a common scenario, David. It reminds me of a mantra that the Chief Digital Officer (of a publishing company that I worked for previously) had about eCommerce sites: “If you can’t write a unique description for this product tailored to our audience, then don’t put it on the store (website).” So, I would first suggest revisiting your belief of not having time/budget to ensure you have unique product descriptions and consider how to make the investment. Your product pages and category pages are the foundation of your store, and their quality will set the foundation for your success online. There are numerous copywriting services that can help you scale the copywriting. Consider checking out Copypress and more services with this Google query. Consider hiring an intern or two, or even family members, and having them rewrite the duplicate product descriptions. If you have a store with an unusually large amount of products (i.e. – 10,000 ), then consider rewriting/improving the top 10-50% that get the most organic search traffic (improve what’s already working to make it work better). As a last resort, you could set your product pages to “noindex,follow” (via a meta robots tag) if they don’t get any organic search traffic due to duplicate content (and your inability to improve them for various reason). Google’s own John Mueller has stated that if you don’t plan to improve low quality content, then either delete it or set it to noindex until you can improve it. In that case, you would focus your copywriting efforts on improving your category pages (optimizing meta titles, meta descriptions and on-page intro descriptions of 100 words for target keywords) and really building out your strategic content marketing efforts (blog posts, video, infographics, etc.) in order to create content about related topics people are searching for and increase your search engine visibility/discoverability in that manner. Hope this helps!
Hi Dan, Great article mate 🙂 Have their been any updates to this since last year and the changes happening with the google algorithm updates?
Hi David, yes there was recently a “quality” update launched by Google. More info here, here and here.
Glenn Gabe is seeing “thin content” as a big culprit. We don’t know anything directly from the source (Google), but the content quality issues appear to be similar to Panda. Everything in this article still applies. You only want to have high quality (non-duplicative, deep and authoritative) content indexed in Google. Hope this helps!
While trying to find a solution for my “problem”, I stumble upon this great article. Trying to understand it all, I hope you still want to answer this question: Our real estate board issues a monthly update on the market..I post this update on my site. Can I then use a canonical url pointing to the real estate board, even though the real estate board only places it on their site as a PDF? Or how should I go about this, as many other agents do the same. (without re-writing)
I’ve just read a book that I didn’t intend to. Thank you very much!
I found this article extremely useful! You’ve highlighted for me several areas that I need to improve upon on my website. Thank you so much!
I work with a re seller website reselling market research reports published by market research companies. Our website is having 3 Lacs reports of different publishers. The report/product description usually provided by publishers are duplicate content. How to solve the issue?
OK, so I understand the importance of original, unique content. I also understand the tactic of keeping any ‘duplicate’ content hidden from search engines.
However, I have several ecommerce clients who are selling third party products which are ALSO sold by other merchants as well, so the product titles have been identified as ‘duplicate’. But to hide them defeats the purpose of having the products on the website at all.
Furthermore, some of the products are very basic and very similar in nature (eg. a ‘rose gold cake topper’ versus a ‘glitter cake topper’). So how reasonable is it to expect the client to generate original, unique content for each? But again, to hide the products makes no sense either.
What to do…?
It’s a simple as this. If your client sells the same products as everyone else who sources their stock from that distributor or manufacturer, what is to differentiate their page from everyone else’s? If you were Google, how would you know which ones to show in the search results if there are dozens or hundreds of virtually identical results?
If the client doesn’t care how those pages rank, then they shouldn’t be in the index. If the client does care, then they need unique titles and product descriptions.
I sympathize with the issue of writing copy for products that are very similar. We handle this in several different ways. One option is to combine the products on one page and allow the visitor to select their option via drop-down. Sometimes that doesn’t make sense for the situation so here are some other options: You could choose one version of the product to be “canonical” and point the rel =”canonical” tag the other product pages to that one. This way you only have to write unique copy for one of them. Of course we also have some clients where it makes sense to get creative and hiring a copywriter to highlight the small differences between the different versions in unique copy on each product page.
Hello Dan Kem,
I want to know if excerpts of posts on my site homepage is duplicate content?
Whether it hurts SEO?
Thank you!
Hello Harley,
If that is the “only” content on your homepage it is not good for SEO. However, if your homepage is sort of like a blog home page where you have excerpts from recent posts, that is fine so long as you have other content unique to that page.
This is great stuff!
I have a question
Wha i case client has two websites with same domain (websitename) on 2 different ccTLd’s like .com and .com.au
These two websites are duplicate of each other.
Also, they are selling products on popular third part websites so everywhere the product descriptions are same which is duplicate content.
I plan to do separate keyword research and write on page content for .com and .com.au version of website landing pages.
Also, going to set target location in webmasters separately.
But what about products descriptions as there are a lot of products and on both websites as well as third party sites there is same description.
Adding an in depth content for products on own website is an option but as there are two personal websites so what will be the right approach.
Looking for help in this case.
Thanks
Hi,
What if you have a business that serves multiple locations, and you have duplicate content for each location EXCEPT for the location names?
So, let’s say you have a 500 word page about “Real Estate Law in Kansas City,” and then a nearly identical page (on another subdirectory or subdomain) on “Real Estate Law in St. Louis”. Then you change any and all geographic references.
Is this duplicate content? Will you be penalized?
Jonathan,
Our advice would be to make it unique content. For example how are the laws different in KC vs STL in regard to real estate. Are you a member of a bar association or other professional group in each?
Also you won’t be penalized per say, but you might not see your rankings go in the right direction either.
Thanks for the great article. I have a quick question. While doing a SEO audit I received a warning stating all my product pages of the ecommerce website are Orphaned Pages. How do I fix this issue specifically for an ecommerce website with 1000 products? Is there a way to interlink product pages for Google Bots to not tag them as Orphaned?
It sounds like perhaps your internal linking is not in place to reach paginated category page results? I would also look at category structure, spreading a bit wider there (adding more cats and sub cats) could help provide the internal links necessary to your products. You should not orphan your product pages!
Hi,
Its a good article. My question is how to handle the following situation.
Let I have a product
Sony Headphones XYZ { This product Over-ear type, Water and dust proof and durable}
Lets I made a following posts on Headphone
1. Sports Headphone
2. Durable Headphones
3. Over Ear Headphone
Since product XYZ qualify for the above 3 posts so I would like to mention it in all 03 post, so no I have following questions
1 would it create a duplication of content
2 How to deal with it.
Note: kindly reply me on my email too
Best Regards
that isn’t the sort of “duplicate content” we’d be worry about. That is normal and expected use.