|
DUPLICATE
AND NEAR-DUPLICATE CONTENT
Duplicate content is a problem for search engines since it overloads their
servers from both a CPU processing as well as a storage standpoint. Recent
statements from the search engines indicate that nearly 40% of the content
in the indexes is duplicate content. The engines are working hard to
minimize duplicate content, but with the increase in scraper sites,
multilingual corporate sites and third-party partner sites all using the
same content, they seem to be fighting a losing battle.
What is
duplicate content? It is simply defined as segments of text that are exactly
the same on one or more pages within a specific site or collection of sites.
A page is considered to be a duplicate if it reaches the threshold of around
12% of the total content of another page. Depending on the size of your
page, that is not much.
What is
duplicate content? It is simply defined as segments of text that are exactly
the same on one or more pages within a specific site or collection of sites.
A page is considered to be a duplicate if it reaches the threshold of around
12% of the total content of another page. Depending on the size of your
page, that is not much.
Duplicate content detection has become relatively easy to do based on a
variety of algorithms developed by Andrei Broder of IBM as well as other
information retrieval researchers. Google's most recent patent approval
yields insight into how they are detecting duplicate content. While it is
overly technical, the current approach shows them doing the analysis based
on the query elements and not necessarily the entire document.
Since
all of the query-related data is maintained in tables that link to the pages
of relevant content, it is relatively easy for them to detect and demote
duplicates. These key content elements are very important, since these
"query elements" tie directly to relevance attributes, putting emphasis on
titles, headings and other key content on the pages. Yahoo and MSN have
implemented a similar approach to duplicate detection. However, rather than
a real-time execution, they check duplicates as part of an ongoing
background process.
If you
feel that your site has been because of this, you may be optimizing key
content elements in the same way across multiple pages. Key content elements
are factored into the scoring, because identical ones make pages appear
duplicate, and thus they are being demoted. This trend probably occurs
mostly with newer affiliates and their sites. As noted above, the age of the
site is taken into consideration as well as its relevancy. If this content
is attributed to another site, any additional reference to it would be
demoted.
How
close is duplicate content? This is a big discussion in the Internet but we
spoke with the engines as well as IBM search engineers to get a number that
you can factor into your content development process. The official number
we've been given lies at around 12%. Some optimizers on various search
forums argue that it is as little as 5% or as great as 20%. We figure that
12% is a reasonable happy medium.
For
example – if we look at the content on the site www.businessobjects.co.uk
and run it through a common plagiarism tool CopyScape (www.copyscape.com)
we will see that it detected other Business Objects pages that are a match
to this content.
|
URL |
Words |
Duplicate |
% |
|
www.uk.businessobjects.com |
301 |
238 |
79% |
|
www.ireland.businessobjects.com |
299 |
205 |
69% |
|
asiapac.businessobjects.com |
315 |
100 |
33% |
|
uk.education.businessobjects.com |
185 |
54 |
30% |
Try
this with a few websites and look at how much duplicate content is out
there for just a few key phrases. When designing and setting up new
pages, it's important to be mindful of what is considered duplicate
content. With duplicate and near-duplicate pages losing rankings, you
may do well to analyze your pages and/or reassess your web content
strategy. |