Duplicate Content and SEO - The Complete Guide 2023

What is internal duplicate content?

Well, what is Duplicate Content at all? Internal Duplicate Content can actually be described quite simply:

(1) Any page of a website that is listed under more than one URL is callable, represents Duplicate Content dar.

Quite simple, isn't it? - Well, unfortunately it's not that simple after all. There's a little more to it, namely potential Duplicate Content.

(2) Any sorted list is potential duplicate content.

Good, one step further. But there is still one point missing.

(3) Every similar page is potential duplicate content.

We are now complete as far as internal DC is concerned. I want to give a few examples to illustrate what's behind it. I don't want to show any sites that might do it, so you'll have to live with a certain degree of abstraction.

1. same page on different URLs

  • https://www.beispiel.de
  • https://www.beispiel.d?IchBinEinParameter

This can be particularly nasty if you have affiliate links coming in from outside, which of course want to be crawled by you. There are certainly some people who try to build up link power with affiliate links - but with parameterized URLs this unfortunately tends to have the opposite effect.

Try it out, attach any parameter to one of your pages - voila. DC.

2. sorted lists

Imagine you have a store and sell backpacks. Now you have a wonderful list of backpacks that the user can sort by the most amazing parameters and wishes, e.g. by price. He does this diligently. Just like the Googlebot. Click, click, click.

Now you might only have a page with backpacks because they are really super rare items. Do you think Google wouldn't notice that the content on the page is just somehow rotated, but has exactly the same content?

If you think that, then you are producing the most beautiful DC with a clear conscience.

3. similar content

A bit similar to the sorting thing, but different. You have a page with - oh, let's say backpacks, but this time a detail page. The user can select colors here and you deliver a new URL because maybe the red backpack will be linked and the blue one too (that would be great)?

Well. You may notice that there is hardly any change to the content on the page. The backpack will be blue (because you have a really cool store) and the headline might change.the page looks really different than before!

Well, add up your words, you certainly have a similarity of 99%. Do you think that's enough? - DC.

The most common cases of internal DC

It takes a certain nose for internal duplicate content, precisely because most people don't find it bad and think "Google will fix it and recognize it". Think again. How we deal with duplicate content and control or remove it is explained below.

Here, however, I would like to briefly help you recognize them - just see if you can find one or two problems on your site. I'm deliberately sticking to WordPress topics here, because the Blogosphere as a whole, but the problems are transferable.

  • Canonical URL: Is your domain with https://www.domain.de and with https://domain.de, maybe still with https://domain.de/index.php callable?
  • IP as a complete DC: Cause technically it can also be, if your IP for reasons in the Index arrived. Heise, Golem and some other larger sites have this problem. You just have to recognize it and that is mostly random. But the real problem is: the complete content of the domain is mirrored or duplicated. Not good.
  • Useless parameters: Just click on the search button on your WordPress blog without entering a search term. Well? Exactly. Domain-DC/home-page-DC (the worst thing that can happen) - and Google clicks on buttons, we know that.
  • Accessibility: Your blog uses tags, categories, date archives, everything that works. Unfortunately, this is all potential DC. Decide on a version of lists that you want to have indexed.
  • Article pages under different URLs: Especially a WordPress problem - you can use any URL in your WP blog. Of course you don't do that voluntarily, but maybe you unconsciously change the category of a post. Zack, you have two valid URLs, because - try it out: You can change the category name in the URL with another category of your blog. One wrong link is enough and you have DC.
  • Grades and selections: The examples from above. Sorted lists and parameterized articles/details pages are DC.

If you look over your pages, you will certainly find DC somewhere, certainly no page here is perfect and be it because of technical restrictions (CMS, blog system, etc.). However, if you know how to recognize DC, you have taken the first really important step towards avoiding it.

Tips and tricks for DC detection

Especially the recognition of similar pages as duplicate content can be a problem. You generally have to think away your navigation, the footer and all that stuff, because Google can do that quite well too. Then compare the remaining text and image elements of the actual content of the two pages in question. If you achieve a value above 80%, you are on the safe side that you have just produced DC. (There are also tools here, such as Sitelinwho calculate such things for you, but it's easiest to do it with reason)

Example: Siteline Duplicate Content Check

Use Google! Unfortunately, the baby has probably already fallen into the well for this tip. Do a site query of your site and narrow it down so that you should get a solid number of pages. Google indexed now does not include what could be considered a duplicate (near duplicate) or is very similar to other pages on your website. With the query "site:yourpage.com/" you can check all pages that Google has checked and indexed have been, see.

Just assume that anything missing there and not too new potentially has a minor problem.

What is external duplicate content?

External duplicate content refers to duplicate content outside of your own domain. This can be the case both in your own network and on competitors' websites. In my experience, duplicate content on subdomains of your own project is also considered external duplicate content.

"My website looks completely different from the competitor's, I have a different header and footer and other page elements!"

Anyone who approaches the analysis of external DC in this way has lost, and it is precisely this person who should take the following advice very much to heart.

  • Google and other search engines recognize page-wide recurring structures such as navigation elements, footer areas or sidebar elements on websites.
  • That is, Google can very well distinguish what is content and what is equipment.

So it is really only important to look at the content (in the best case, what is below the H1). Now, anyone who remembers the classification of the first part (internal duplicate content) will recognize the following list in parts.

1. same content on different domains is external duplicate content

That is clear and everyone can understand.

2. similar content on different domains is potential duplicate content

This is also clear when we consider the results of the first part on internal DC. Sorting, lists, results, all of this can mean DC - both internally and externally.

3. your domain is completely duplicated or mirrored

This is by far the nastiest of all issues. If the start page of your product is recognized as DC or appears duplicated on the Internet, you should try to remove the duplicates really quickly. You can see how to proceed to uncover such cases below.

How do search engines react to duplicate content?

There are two ways how the search engines Google, Bing, etc. react to the duplicate content. Either the duplicated content is "merged" by the search engines, or it can happen that a page that makes use of copied texts is removed from the Index is excluded. Linking the same content with different URLs will worsen the search result.

Search engine optimization professionals therefore strictly ensure that only Unique Content is used. There are tools that check finished texts for their uniqueness - more on this later. However, it is also sufficient, for example, to type the first passage of a post into Google and start the search. If a similar text is found on the World Wide Web, the result will also show this.

The copywriters who are commissioned to create content should actually pay attention to this and check whether unique content can be guaranteed or not. A declaration on the invoice that all texts were written from their own ideas and not copied is often not accepted. Agencyen in the Search engine optimization required. It would be reckless to deliver content that has not been cross-checked.

Who owns the original content?

When search engines find duplicate content, they have to decide who owns the original content. They do this according to algorithmic templates, which are basically easy to understand, but the outcome is not always correct or clear.

  • Fingerprint/Timestamp: Found content is usually given a kind of versioning by the search engines. This can also be called a fingerprint. It is noted (as everything is always noted everywhere) when the content was found and where. Once this fingerprint has been determined, the page that was first identified with the content is usually assigned the content as the original.
  • Google News: I am not an expert in this case, so I can only write superficially. With news, however, you can clearly see that the principle applies: first come, first served. Most newspapers come with DPA reports, which are usually DC, just like press releases (watch out for article directories, press directories, etc.).
  • Theft/Content Theft: Content theft is quite problematic. This can be text or images. It's not just yesterday that there were reports of content being stolen and published 1:1 on other sites. This is very annoying, but unfortunately not always avoidable. If the other site is outside the German legal area, there is not much you can do about it legally. However, I will explain small technical things in the next part "Avoiding duplicate content".
  • Trust and domain strength: For both fingerprint and stolen content, a good portion of the Trust or the strength of a domain as a factor for the originality of the content. A small example: Your domain with the original content is 3 years old and has different topics. Now someone comes along (maybe yourself) and takes over this content on a domain that is maybe 6 years old, has more links and only deals with this topic. The content will most likely be attributed to this domain and your own Traffic crashes.

How do I recognize/find external duplicate content?

1. same content on different domains

Such cases can be done very well with a quotation query on Google. Simply take part of the content of your page and copy it into the Google search box "in quotation marks". You may find similar entries that could possibly have a fingerprint. Comparing the date and cache helps with the evaluation. There are also tools that make it possible to track down stolen images or text.

No duplicate content - no problems.

2. similar content on different domains

The typical newspaper/DPA problem. I really can't say much about this because I have some gaps, especially in the news area. In general, however, as with internal DC, the higher the comparability of the text elements, the more likely it is to be DC.

3. duplication of the domain or home page

This is the supergau and you should generally be careful that no duplicates of your home page are sent to you in the Index arrive. Sometimes domains are mirrored, sometimes such things happen because you have an IP in the Index (this is also external DC) and thus mirrors your entire domain. The more Trust and the older the project is, the more likely you can get away with such cases. Telekom has a lot of domains that have the same content, Heise and similar projects are with IPs in the Index and it does not do any harm (yet). For smaller projects, however, it can be quite harmful.

You can discover such a duplication by searching for the title of your homepage on Google using the "Intitle query" and taking a closer look at similar pages. I have already written about this problem in more detail and also what you should do if something like this happens.

Prevent internal duplicate content

We had something about equal pages through URL-confusion, as well as about similar pages and sorted lists. All of this had been declared DC in terms of the definition we put in Google's mouth. This must now be prevented and we have some means that we can use.

Noindex

In my opinion Noindex the Columbus egg of SEO. Everything you find dubious, everything you don't like, everything you really think is DC - pop a nice "Noindex" tribute on it. The great thing is that any links that are received will at least be redistributed proportionately. So always remember the internal linking think. There are very nice tools for WordPress, such as robots-meta from Yoast or All In One SEO.

Robots.txt

You should be careful with your robots.txt. Always remember: Everything that you exclude with robots.txt will no longer be found by the bot. So you can very quickly turn off the linkjuice and also the crawlability. Google finds the pages, which are blocked by robots.txt of course, because only when calling the bot knows, which URL the page has.

In principle, no pages that could potentially be linked have a place in robots.txt. You can release individual pages e.g. via Sitemap (Xing does it quite well, you can check it out there) or wildcards. For the check, however, I definitely recommend to look into the Google Webmaster Tools to check the correctness, because the wildcards or "allow" are not actually intended in the robots.txt, but they are often needed and are very helpful.

Nofollow

Rel-nofollow is absolutely not helpful to reduce duplicate content. The only thing that will happen is that your links will have no Anchor text and do not pass any Juice. Indexing can be done with nofollow not control. Put that out of your mind. Something else is meta-nofollow. According to the definition here really links are not to be departed. I don't have a test right now, but I don't think that's the case.

Canonical tag

I don't want to give a recommendation here as to whether it's any good, it's still too early for that. Basically, I think the tag is quite suitable for various things. This has mainly to do with redirecting link power from Noindex-pages, which you cannot 301can. However, I think we should be careful not to add a second nofollow or PageRank-sculpted hype. The tag is too complex for it to be really useful. But I don't need to say that there are different opinions. Testing is the order of the day. But above all:

Set your page structure wisely and profoundly, then you don't need Sculpting or Canonical

Avoid (tracking via) URL parameters

The following often happens on the web: The marketing manager wants to advertise, starts an affiliate program and is happy about the links coming in at the same time. The only pity is that they all generate internal DC for tracking because they are delivered with parameters (e.g. domain.de?afflink=123). There are various solutions here: Cookies and thus track, the parameters noindex set or simply do not track.

SEOmoz showed another great idea: use hash (hash) instead of ?-parameter (i.e. domain.de#afflink=123). If you are interested in this: Here is the Whiteboard Friday video and a nice store example of how you can display pagination with #. The highlight: # is not rated by Google and all the nice link power flows to the original page.

But there are also problems, which can be read in the SEOmoz post. Attention: If you want to make detailed pages accessible with the pagination, make sure that there are not too many links on one page (you have to find out the number yourself, this blog should be able to handle about 100). The example store had no details and therefore works very well with the program.

Well, means are one thing, a few examples are of course much cooler. So here we go.

Category system in WordPress

Well, I briefly touched on this earlier. You often see the same systems in SEO blogs too: Navigation or making the detailed pages accessible via: Tags, author archives, category archives, date archives. What we know from the fingerprint system is that a teaser is almost as good as a detail page. This makes it all the more incomprehensible for me to have the whole jumble of categories and tags and all the rest of it indexed.

Siteliner: category pages often generate DC (see Match Percentage)

With noindex this would be so easy to avoid and after all, the detail pages are what should be important, not just any category. What I find really difficult in this context are systems or plugins that automatically link buzzwords to tag pages. Terrible.

Just to be clear: There are certainly good reasons to have this or that indexed, and in some cases categories can certainly work better than details. What needs to be checked very carefully, however, is the use of several of these archives at the same time. The same also applies to indexable (full) feeds.

Especially the indexing of tag pages you should check carefully. If you have an average of 4-6 tags per post, that means you would potentially have 5 new pages per post in the Index tipping. That is too much.

Duplicate URLs, parameters

Do not have your search pages indexed and redirect your page meinblog.de?s= to your main page, because Google is sure to press a button. Above all, make sure that you create your posts correctly. A big problem with WordPress is that every post in every category can be accessed. This makes you potentially vulnerable from the outside.

If you want to move a post or change your category system in general, you should in any case create a Plugin like Redirection use. But you should always think about it beforehand, because a 301 can never give you back the link power you have on the original.

Prevent external duplicate content

Avoiding external DC is much more difficult than influencing internal DC, as your influence on the page that provides the external duplicate content is usually beyond your control. Nevertheless, a few basic and simple rules can help you to effectively avoid duplicate content before it is created in many cases.

Organize cooperations

Cooperation partners who play your content for whatever reason (content exchange, playback space, sale, etc.) should at least play the content on noindex set. If an exclusion of the Crawlings is feasible through robots.txt, this should also be done. External cooperations can be dangerous as far as DC is concerned. You should already start thinking about it with the noindex-Flag waving.

Unique Content

Example: DPA reports. How often do you read the same thing online? And each of these publishers certainly wants to be found online. Unfortunately, there's only one thing that helps most of the time: if you can't publish as quickly as possible, you'll have to resort to real editorial means and change the text.

Fingerprints

What can happen internally with category lists can of course also happen externally. It is usually only worse there, as Google chooses the owner of the original fingerprint itself. The only thing that helps here is Unique Content. If you have partially duplicated content, change it if your time allows and you expect a positive ROI. If neither is the case, close down your site and think about a new task for your web server.

Conclusion on duplicate content

As you can see Noindex really one of my favorites. The meta tag is a powerful tool if you use it correctly. There are certainly other ways to avoid DC, maybe I've forgotten a really important one. You are therefore welcome to comment diligently (provided you have managed to persevere this far).

FAQ

What is duplicate content? arrow icon in accordion
Duplicate content is a term used to describe the content that is published more than once on a website. It can also be referred to as the same content on more than one website, even though the URLs are different.
What is the risk of duplicate content in SEO? arrow icon in accordion
Duplicate content can cause search engines to rank the content on your website as irrelevant. In addition, duplicate content can reduce visibility for your website and hurt your rankings on search engines.
How do I avoid duplicate content? arrow icon in accordion
Some of the ways to avoid duplicate content are: Create a unique page title and meta tags for each page; Use 301 redirects to point multiple URLs into one; Create a sitemap and make sure all URLs are correctly identified by the search engine robot; and Use a unique canonical tag to tell search engines which URL should be considered the primary one.
What is a Canonical Tag? arrow icon in accordion
A canonical tag is an HTML element that tells search engines which URL should be considered the primary one for the content. The canonical tag is a simple way to prevent duplicate content problems.
What is a 301 redirect? arrow icon in accordion
A 301 redirect is a redirect code that asks a website to use a specific URL instead of another. This method is great for avoiding duplicate content because all URLs are redirected to a single URL, and the content can be found in only one place.
What is a sitemap? arrow icon in accordion
A sitemap is a file that lists all the URLs on your website and helps search engine robots index your pages. It is an important element of search engine optimization and is especially helpful when it comes to avoiding duplicate content problems.
Why is it important to avoid duplicate content? arrow icon in accordion
It's important to avoid duplicate content because it can cause a number of potential problems. These include a poorer user experience, lower rankings on search engines, reduced page visibility and a lower number of potential visitors.
Can duplicate content cause my website to be penalized? arrow icon in accordion
In most cases, no. Search engines usually do not perceive duplicate content as negative criteria. However, some search engines may consider duplicate content as spam and take appropriate actions that may affect your website.
What tools can I use to identify duplicate content? arrow icon in accordion
There are a number of tools you can use to identify duplicate content on your website. For example, you can use Copyscape to check the content on your website, or use a tool like Screaming Frog to find duplicate content on your website.
What is the difference between duplicate content and plagiarism? arrow icon in accordion
The difference between duplicate content and plagiarism is that duplicate content is the same content published in more than one place on a website, while plagiarism is the content copied without the author's permission.

With top positions to the new sales channel.

Let Google work for you, because visitors become customers.

About the author

Social Media & Links:

SEO Scaling Framework

The fastest way to the SEO revenue channel

✅ Our exact framework condensed into 96 pages

✅ 3 hours of detailed accompanying video with additional best practices

✅ Step-by-step path to the Bulletproof 100k€ SEO channel

Request video + PDF now!

ℹ️ We will check your details and then release the PDF:

🔒 Don't worry! We will No spam e-mails send!