Non-news: Malformed URLs don’t pass Anchor Text.

If this page is useful, please click the “+1” button

I’ve started another burst of postings about web server log file analysis and what it tells search engine optimisers about search engine spiders. Web spider behaviour often lies behind issues that I find on other blogs. For example, Dave Naylor has a couple of recent articles that are interesting. A good one to read is about using the “motion charts” in Google Analytics to find opportunities. But there’s an odder one about Anchor Text. Some of that article is confirmation of stuff Matt Cutts has written about – the first link being the one that carries anchor text value, for example, or anchor text and nofollow, or delayed echoes of Rand Fishkin’s recent article on Anchor Text.Apart from the validation of Matt Cutts statements, there’s one result that appears blindingly obvious. Malformed URLs don’t pass anchor text – and by implication, weight. In the context of the example in the article, adding a space to a URL in the anchor, destroys the value. Googlebot changes spaces (which aren’t valid characters in a URL) into “%20″ symbols. In Dave Naylor’s article, that means that the Googlebot will do a DNS lookup for a domain that doesn’t and can’t exist – spaces are not allowed in domain names. If the URL in the anchor’s href had been a fully pathed URL, then a space would be added to the end and converted to a “%20″.

That full URL, with an appended “%20″ won’t be found on the site. It should appear, at some point, in web server log files as a 404 for a Googlebot visit. 404′s don’t pass weight. So why the surprise that a malformed URL would fail?

I think the real point, not cleanly spelled out in the article, is that web browsers don’t parse pages the way that search engine spiders parse pages. A browser will cope with the embedded space. That ability of a browser to infer the useful thing to do, doesn’t make the space into a valid character in URLs – not without being escaped, anyway. And the consequence of appending the space, will be that a web spider makes a request for a resource that will usually be 404′ed, unless the administrator has used Apache mod_speling or an equivalent typo-correction tool (which should yield a 301 redirect to the correct resource).

Attempting to infer the SEO value of browser interpreted behaviour, without understanding Googlebot behaviour, will create confusing and misleading problems.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top