Express newspaper creates an infinite number of URLs using rel = canonical
The Express newspaper has cocked up its implementation of the rel=canonical command SO BADLY that it has created an infinite number of duplicate webpages ... many of which now have links from elsewhere on the internet.
Using rel = canonical properly
You use the rel=canonical command to tell Google that a given URL is actually a version of another URL - and that the search engine should treat the second version as if it was that main URL.
It's useful if you have multiple copies of a page in different directories, have lots of versions of the same page due to EG WordPress making 2 versions of every page, or allow anyone to rewrite your URLs so it looks like your insulting Pippa Middleton's sister.
Make a mistake with rel=canonical, however, and it can wipe your website off the face of the internet.
Using rel = canonical to make infinite URLs
The Express site's CMS is creating a duplicate version of every single page via the rel=canonical tag. And then a 3rd version, and then a 4th ... and it's never stopping until it gets to infinity.
Take a sample page like this one: http://www.express.co.uk/features/view/244786/AV-referendum-Why-we-must-vote-NO-to-the-new-voting-system
If you look at the HTML code, you can find:
<link rel="canonical" href="http://www.express.co.uk/features/view/244786/AV-referendum-Why-we-must-vote-NO-to-the-new-voting-systemAV-referendum-Why-we-must-vote-NO">
The CMS has miscoded the canonical URL to include the first bit of the URL relating to the individual page (the AV-referendum-Why-we-must-vote-NO bit) twice.
If you visit that supposedly canonical URL, you see this, with the page-specific bit in there three times.
<link rel="canonical" href="http://www.express.co.uk/features/view/244786/AV-referendum-Why-we-must-vote-NO-to-the-new-voting-systemAV-referendum-Why-we-must-vote-NOAV-referendum-Why-we-must-vote-NO">
Go to that URL, and you find it there 4 times. Etc.
but this will never stop. Each time you visit the canonical URL, a new canonical URL is created.
All these URLs are working pages because the Express only looks at the number in the URL to decide what content to show. So http://www.express.co.uk/features/view/244786/AV-referendum-Why-we-must-vote-NO-to-the-new-voting-system is the same as http://www.express.co.uk/features/view/244786/vote-YES is the same as http://www.express.co.uk/features/view/244786/who-exactly-specced-this-CMS.
Dozens of URls for each Express story
Sometimes these duplicate canonical URLs aren't in Google's index (I guess as each one is cancelled out by the next one). Although you can find them. This search, for instance, has this URL showing up: http://www.express.co.uk/posts/view/242092/DEBATE-Is-Britain-a-soft-touch-for-benefit-spongers-DEBATE-Is-Britain-a-soft-touch-for-benefit-spongers-DEBATE-Is-Britain-a-soft-touch-for-benefit-spongers-DEBATE-Is-Britain-a-soft-touch-for-benefit-spongers-DEBATE-Is-Britain-a-soft-touch-for-benefit-spongers-DEBATE-Is-Britain-a-soft-touch-for-benefit-spongers-
Even worse, the first URL that appears for that search is the printable URL of the page with no adverts on!

One paragraph, 55 results ...
And as that search, with 55 results, reveals, the Express has a massive problem with duplicate content.
The Express then makes the problem even worse ...
This is a problem it makes worse via its use of Tynt to add URLs when you copy and paste content. So if you copy and paste the first sentence from this URL: http://www.express.co.uk/features/view/244786/AV-referendum-Why-we-must-vote-NO-to-the-new-voting-system, what you end up with is this:
"BY the time you read this you will have probably already voted No to AV in today’s referendum.
Read more: http://www.express.co.uk/features/view/244786/AV-referendum-Why-we-must-vote-NO-to-the-new-voting-systemAV-referendum-Why-we-must-vote-NO#ixzz1LW2s00ge".
The Express uses Tynt to add the read more bit and the URL to what you've copied.
But, yes, the code they are adding contains the wrong URL with two versions of the page slug. Follow that link and copy a sentence and you end up with this:
"BY the time you read this you will have probably already voted No to AV in today’s referendum.
Yup, another new URL created by the system that's designed to channel links to the main story.
You can see this in action on this page on the Daily Mail where someone has copied the opening para from some other bat shit story, and the Tynt URL is to http://www.express.co.uk/posts/view/244206/EU-wants-to-merge-uk-with-franceEU-wants-to-merge-uk-with-franceEU-wants-to-merge-uk-with-france#ixzz1LCIcD5jI.
This might explain why the Express can't rank in first place for a paragraph from its own story.
To sum up
The Express isn't appearing top of Google's results for searches using their own content and Google is serving up versions of its pages with no adverts on - all because Google can't work out which page is the correct one because the Express constantly points to yet another URL for every single page - even the made up ones.
My head hurts.
You might also like
- Cross-domain rel=canonical now supported by Google
- Can you use rel = canonical to fix duplicate comment problems caused by comment pagination in wordpress?
- Google’s indexed 64 fake Independent jelly-bean Kate-Middleton URLs
- A wireframe for a new Express homepage
- Express looking at wrong Twitter accounts in BBC attack

Amazing how an organisation such as the Express can get something like this so so wrong, but they aren't the first and they certainly wont be the last.
I for one, am not wholly against canonical tags, they do have a very useful function in certain specific instances, however I am increasingly seeing more widespread implementations of these around various campaigns and not always in implementations I would deem relevant and where other options such as 301's may be better alternatives.
Some testing might have been in order. Some newspapers suffer from CMSes that create multiple versions of URLs for different sections (ie with /section/ in the URL) etc - 301ing often isn't an option for their current implementations (as you'll lose section specific rules about layout, colour etc). Better CMSes are the ultimate solution of course!
I have seen this so often in sitemaps - where this same reason causes duplication and sitemaps created from crawling have this repetition in them... I thought that Google would be better at fixing this and similar issues though! (even if I think that Google shouldn't have to)
didn't understand a bloody word of that but it sounds freakin' awful. glad you're on our side!
...and they have the same issue as the Independent had / the Sun has re: only the article ID being the unique identifier. Which is why they have
http://www.express.co.uk/features/view/244786/AV-referendum-Why-we-must-vote-NO
&
http://www.express.co.uk/features/view/244786/AV-referendum-Why-we-must-vote-NO-to-the-new-voting-system
in the index...
Canonical tags can be great, but they have to be implemented right, and a little bit of testing doesn't go amiss
hahahahaha Malcom, while the shear volume of it trumps my client site issue, the fact that this is just ONE problem, is superseded by my client's multiple layers of different problems.
SO I'd say my client site wins from the "shear pathetic stupidity of the developers" perspective. Not by much though. Not by much at all...