Search Engines: Specify your canonical URL

patio11 · on Feb 13, 2009

Regarding the question of whether Google is a monopoly or not: non-monopolies cannot easily cause a new Internet standard to spring into being simply by announcing that a program of theirs will now apply specified behavior to a previously undefined syntactical element.

This line has a whole lot of chutzpah:

"This standard can be adopted by any search engine when crawling and indexing your site."

[Edit: Incidentally, I will have this implemented on my site by the end of the day. Because I'd be an idiot not to. Google is, I think, probably the only company who can create "drop what you are doing, now, this is your new priority" work for me besides my actual employer.]

litewulf · on Feb 13, 2009

(Agree with most of your points)

Yahoo and Google have both posted things before to the effect of "web authors: it'd really help us out if you did X", and often times the other will adopt the convention. The spec only describes some uses of link elements for example, so it doesn't really seem like an abuse of anything as much as it is a case of saying that something which was previously undefined now means something to you (or them).

briansmith · on Feb 13, 2009

You are much better off with doing the following: (1) All responses for non-canonical URLs are 301 redirects to the canonical URLs, (2) Your website will never link to a resource using a URL other than its canonical one, (3) you encourage people to link to pages on your site using the canonical URLs.

This way your site will be very cache-friendly while still being usable. Also, all search engines will be able to understand your site without any proprietary extensions (a.k.a. "standards" at Google, apparently) being needed.

wmf · on Feb 13, 2009

I generally agree with that approach, but Google gives an example of pages with query strings that need to have different URLs because they are subtly different, but not in a way that search engines need to care about.

OTOH, Google's wiki example is bogus; people have been telling MediaWiki that they should be using 301s for years but they just won't. This workaround just encourages them to never fix it.

zepolen · on Feb 13, 2009

It's not just one example, you can 'hurt' a website simply by making a huge list of links with arbritrary query string garbage so that google picks it up, eg:

http://domain.com/?dupcontent

http://domain.com/?blabla

What an app should really do is validate the arguments in the query string, remove any invalid ones, then issue a 301 redirect to the proper url.

For example:

http://www.google.com/search?q=someterm&unknown_variable...

redirects to

http://www.google.com/search?q=someterm

Of course that's like getting everyone's CSS to validate correctly :)

Edit: It is also why I prefer to sort my query string so that it can be deterministic and always be the same no matter what order the args are in.

briansmith · on Feb 13, 2009

That kind of attack won't work on a website that does a 301 redirect from all non-canonical URLs to the canonical ones.

IsaacSchlueter · on Feb 13, 2009

Why didn't they use the already established rel=bookmark value from the hAtom microformat? That's already in the wild on countless blogs and websites.

I swear, sometimes Google's awareness of existing web conventions is shockingly lacking.

litewulf · on Feb 13, 2009

Maybe I'm crazy, but a canonical URL is different from an hAtom permalink.

Besides, its a canonical URL for the whole document and not just a portion of a page. How does Google know what the scope of a given rel=bookmark is? What if people already use it, and using it in a way with a slightly different way would pollute hAtom?

IsaacSchlueter · on Feb 16, 2009

Well, that's why you'd use a <link> tag instead of a <a> tag.

Link tags in the head are information about the whole document. Anchor tags are more vague in their semantics, and in the context of an hAtom item, <a rel="bookmark"> would be the link to the canonical URL for that item.

For a document, <link rel="bookmark"> would be the canonical URL for that document. As in, "If you are looking to save or bookmark this document's URL, you should do it using this URL over here."

Mix · on Feb 25, 2009

Because hAtom is wrong, Zeldman just does not want to admit his mistake. rel=bookmark is not for permalinks, but for anchors. Read this for full explanation: http://www.tamurajones.net/MarkingPermalinks.xhtml

aristus · on Feb 13, 2009

(Black hat on) This might be very interesting for some types of injection attacks. Instead of simply getting backlinks you could steal the pagerank of your victims without leaving a visible mark. Limited to subdomains, though.

lsb · on Feb 13, 2009

No, it's the same domain, so unless your victims are your co-workers, it won't work.

buro9 · on Feb 13, 2009

I've got a site that runs on both http and https.

Is a protocol change enough of a difference to consider the identical content as a duplicate?