Google was recently granted a new patent with the name ‘Duplicate document detection in a web crawler system’. The patent explains how a content filter from the search engine can work with a duplicate content server, which in itself is not a new practice for Google, more an advancement of existing technologies.
So what is duplicate content?
The patent contains the following definition of ‘duplicate content’:
“Duplicate documents are documents that have substantially identical content, and in some embodiments wholly identical content, but different document addresses.”
The patent goes on to describe three such scenarios in which duplicate documents are identified by a web crawler:
- Two pages, comprising any combination of regular web page(s) and temporary redirect page(s), are duplicate documents if they share the same page content, but have different URLs.
- Two temporary redirect pages are duplicate documents if they share the same target URL, but have different source URLs.
- A regular web page and a temporary redirect page are duplicate documents if the URL of the regular web page is the target URL of the temporary redirect page or the content of the regular web page is the same as that of the temporary redirect page.
A permanent redirect page is not directly involved in duplicate document detection because the crawlers are configured not to download the content of the redirecting page.
What does this mean in laymen’s terms
- This basically means that two (or more) identical pages have been found, but on different web site addresses, i.e. the content source is duplicated. The severity of Google’s action in delisting a site will likely be in respect of how much duplication there is.
- This basically means the same as above but for temporary redirect pages.
- This basically closes a loop hole where a temporary redirect page and the main source site are identical and directing to one another as effectively, some web browsers may find the temp page, and some the main source page, effectively giving the web address (url) two identical versions of the same page
How does Google detect duplicate content?
According to the patent description, Google’s web crawler consults the duplicate content server to check if a found page is a copy of another document. The algorithm then determines which version is the most important version.
Google can use different methods to detect duplicate content. For example, Google might take “content fingerprints” of previously indexed pages and compare them when a new web page is found.
Interestingly, it is not always the page with the highest PageRank that is chosen as the most important URL for the content?
How does this affect your website?
If you want to get high rankings, it is easier to do so with unique content. Try to use as much original content as possible on your web pages.
If your website must use the same content as another website, make sure that your website has better inbound links than the other websites that carry the same content. It’s likely that your website will be chosen as the most important URL for the content then.
If your web site has unique content, you don’t have to worry about potential duplicate content penalties. Optimize that content for search engines and make sure that your web site has good inbound links. It’s hard to outrank a website with good optimized content and many good inbound links.
How does this affect your blogs?
Whilst the duplicate content filter has been around some time, the recent surge in social networking and blogging habits does leave many webmasters uncertain as to the effects of the filter (and the new patent) if they have multiple blogging profiles.
In short, the main aim of Google’s duplicate content filter was always to route out spammers who would have tens or even hundreds of duplicates of the same site in order for those duplicates to act as anchor sites, directing links and therefore traffic to the intended target site of that business.
For a small business, who has a small to medium web presence and several blogging profiles (with similar or identical blog posts) the implications are minor to say the least.
Copyright Protection?
However, one overlooked application is that of copyright protection. A useful side effect of the ‘duplicate content’ patent is that it will likely have a wider impact on smaller businesses who do infringe copyright.
Whilst there are tools available (both free and premium) to find duplicated content their scope is still limited. Through the new patent, there could be a tightening up of copyright protection procedures. As indicated, an algorithm will analyze what it thinks is the most important version (i.e. the true version) and in doing so, potentially oust others as imitations, or in other words, copyright infringements?
The web is now increasingly expected to become more responsible for the content available via the likes of Google and this new patent could go a long way into identifying copyright infringers.
The best defense in this case is to create you own unique copy. Google will love you all the better for it, and when your visitors arrive, they will appreciate not having the same old boring content as everyone else in the search results.