©2007 411 Web Design Group of Las Vegas411 Web Design Group
Las Vegas, Nevada
Toll Free ( 877) 241-7613

HomepageServicesSupportClient ListingResourcesAbout UsContact Us

Website Designs and re-buildsDUPLICATE AND NEAR-DUPLICATE CONTENT

Duplicate content is a problem for search engines since it overloads their servers from both a CPU processing as well as a storage standpoint. Recent statements from the search engines indicate that nearly 40% of the content in the indexes is duplicate content. The engines are working hard to minimize duplicate content, but with the increase in scraper sites, multilingual corporate sites and third-party partner sites all using the same content, they seem to be fighting a losing battle.

What is duplicate content? It is simply defined as segments of text that are exactly the same on one or more pages within a specific site or collection of sites. A page is considered to be a duplicate if it reaches the threshold of around 12% of the total content of another page. Depending on the size of your page, that is not much.

What is duplicate content? It is simply defined as segments of text that are exactly the same on one or more pages within a specific site or collection of sites. A page is considered to be a duplicate if it reaches the threshold of around 12% of the total content of another page. Depending on the size of your page, that is not much.

Duplicate content detection has become relatively easy to do based on a variety of algorithms developed by Andrei Broder of IBM as well as other information retrieval researchers. Google's most recent patent approval yields insight into how they are detecting duplicate content. While it is overly technical, the current approach shows them doing the analysis based on the query elements and not necessarily the entire document.

Since all of the query-related data is maintained in tables that link to the pages of relevant content, it is relatively easy for them to detect and demote duplicates. These key content elements are very important, since these "query elements" tie directly to relevance attributes, putting emphasis on titles, headings and other key content on the pages. Yahoo and MSN have implemented a similar approach to duplicate detection. However, rather than a real-time execution, they check duplicates as part of an ongoing background process.

If you feel that your site has been because of this, you may be optimizing key content elements in the same way across multiple pages. Key content elements are factored into the scoring, because identical ones make pages appear duplicate, and thus they are being demoted. This trend probably occurs mostly with newer affiliates and their sites. As noted above, the age of the site is taken into consideration as well as its relevancy. If this content is attributed to another site, any additional reference to it would be demoted.

How close is duplicate content? This is a big discussion in the Internet but we spoke with the engines as well as IBM search engineers to get a number that you can factor into your content development process. The official number we've been given lies at around 12%. Some optimizers on various search forums argue that it is as little as 5% or as great as 20%. We figure that 12% is a reasonable happy medium.

For example – if we look at the content on the site www.businessobjects.co.uk and run it through a common plagiarism tool CopyScape (www.copyscape.com) we will see that it detected other Business Objects pages that are a match to this content.

URL

Words

Duplicate

%

www.uk.businessobjects.com

301

238

79%

www.ireland.businessobjects.com

299

205

69%

asiapac.businessobjects.com

315

100

33%

uk.education.businessobjects.com

185

54

30%

Try this with a few websites and look at how much duplicate content is out there for just a few key phrases. When designing and setting up new pages, it's important to be mindful of what is considered duplicate content. With duplicate and near-duplicate pages losing rankings, you may do well to analyze your pages and/or reassess your web content strategy.

Click here to visit Digg.com

 


411 Web Design will evaluate your website for use of keywords, proper coding, link value and overall ranking all for free!
This Company is PayPal Verified.



Google.com is ranked the #1 search engine in the world.

Yahoo.com - Do you Yahoo!?

Windows Live Search from Microsoft
 


Member of the HTML Writers GuildMember of the International Webmasters Association
Fight Spam! Click Here!


HOME Bullet  SERVICES  Bullet SUPPORT Bullet CLIENT LISTING  Bullet  RESOURCES  Bullet  ABOUT US
 CONTACT US Bullet  SITEMAP  Bullet TERMS OF USE Bullet PRIVACY NOTICE