SEO, IIS case folding filenames, Spiders, Analytics, and Robots.Txt

If this page is useful, please click the “+1” button

AFAICS, the best way to administer IIS for SEO purposes, seems to be to run screaming from the room and hide under a desk until you are allowed to use Apache. So many of the default behaviours create difficulties for users or SEO. Yes, I’ve been continuing to dig into web analytics and IIS web server log files for a couple of clients. I’ve now seen this problem, again – first noticed many, many years ago (1999 or so, I think) for another company and still not solved by default:

What is the authoritative name of the web page?

Assume that your web site uses IIS and has a home page available as “/index.htm” You can then use the following as silent synonyms of the home page:

/index.htm
/Index.htm
/INDEX.HTM
/InDeX.hTm

Why is finding the same file under multiple names a problem?

Whether you use a web browser, or a search engine bot crawls your web site, each one of those URLs appears to be a different file, albeit with identical content. If you have optimised the content, then each is a candidate to be the answer for search users. So the spiders do need to track each file name variant, and check them, to see what has changed.

In each case you get an immediate web server response of “200″ – file found. Link love can be spent on a wide variety of paths that lead to the same place – but the spiders aren’t told that. There is another way to do this, which does not work, out of the box, on Linux and Apache, but is fairly easy to set up.

Web Brand Is All About Experience

If the file system does not fold case – that is, it treats upper and lower case letters as two distinct things – then a request for a file with mismatched case delivers a 404 – File Not Found. Now that’s a bad user interface experience. Brand is all about experience, so why punish your users with a 404 because they can’t remember what the capitalisation of your ProDuct (sic) is?

You need to find a way to deliver both the web page that the user wanted to get to, and also let the spiders know that there is one authoritative page – there may be a lot of different links to get there, but just one resource.

On Merjis.com we use a technique that helps the spider understand that we have one page, and that case changes are a link problem, not a server duplication. If you try to reach:

http://merjis.com/contact
http://merjis.com/Contact
http://merjis.com/CoNtAcT
http://merjis.com/CONTACT

then you should get to the same page, the Merjis contact information page. In terms of the user experience, this is just the same user experience as on IIS. But we issue a redirect on all the non-standard forms of the page name and IIS doesn’t. Spiders can see that only one page for “contact” exists on the Merjis site, even if it is accessible through many different URLs. This cuts down redundant crawls, focuses link love on a single page, doesn’t lose references from typographically challenged links, gets users to the page they want whatever the case of the URL they type, and is generally A Good Thing.

How To See A Redirect

If you don’t use a tool like “wget” or one of the Firefox HTTP inspection tools, your only real clue to our redirection is that whatever you did type in the URL bar, is replaced by our chosen URL for the resource. Between your input and our response, we added a redirect. Spiders will see the redirection, and only index one page and can pour all the link love on that page.

That’s completely unlike IIS standard behaviour. The default behaviour is to fold uppercase and lowercase. That means you see the URL that you typed. There’s no information that the file is a single file known by many names.

Spiders can’t guess that a single file is a single file – they only know what they are told. They get told what links exist in sitemaps and by other link references across the web. If a spidered site has references to mixed case versions of names, then the spiders will tromp madly off to each alternative case version of exactly the same file.

SEO Means Never Saying Sorry To Stupid Spiders

I’m of the opinion that helping the spiders to find the right information, helps SEO. Sending spiders to a dozen spelling variations of a path, doesn’t boost rank, unless the spiders are clever. Even without that, sending spiders to crawl redundant pages, when they could more frequently crawl real content, is a waste of the attention from search engines. If spiders were clever, they wouldn’t crawl redundant paths to the same content, repeatedly, across the whole server… So give them a hand to get to the right single file that should be taking all the page rank.

This default case folding behaviour means that IIS again contributes to spamming search engine indexes. Not so bad, except that it causes yet another problem.

My Web Analytics Don’t Fold Case

When you are trying to analyse what is happening to users, just one miskeyed filename can result in the analytics giving you multiple paths to, for example, conversion. Typically the JavaScript page bug reports the filename that was accessed – using the same capitalisation that the web server delivered. Why? Because on a large fraction of the other web servers, case does matter and “/index” is a different file from “/INDEX”.

So on an IIS delivered file, the same page can be known by a wide range of names that mean the same thing. But analytics packages don’t (usually) fold case – so each reference to a different capitalisation adds another meaningless node to journey analysis.

[ N.B. See Chris’s comment below. I should quantify the assertion that Analytics don’t fold case – of course, if any of them *do*, that’s another problem… ]

The following deleted section is a bit rubbish. I failed to properly read and understand the robots.txt spec. I interpreted a line that meant all records in the file, to just mean the User-Agent line. robots.txt allows case folding – /MEMBERS and /members are identical according to the later spec; the earlier spec only clearly states that the User-Agent field ignores case – leaving the possibility of ignoring case or respecting it in a Disallow line.

However – I started this article because I found a clients private area on IIS had been crawled and indexed, despite being listed in robots.txt. I still need to describe that investigation – but this article is long enough.

And that’s not all. Oh no…

Google Spiders Content Disallowed In Robots.txt

If you have parts of the site that you don’t want to be in the index, you can use robots.txt to exclude those directories or applications.

Except… you can’t. Not sensibly, not with IIS out of the box.

Say that you have a directory called “/members” and you want only signed in members to see the content. You exclude the spiders with:

Disallow: /members

However… case folding… This directory is also accessible as “/MEMBERS” and that isn’t excluded in this customers’ Robots.txt. So your hidden content is now visible if just one link, somewhere on the internet, or even in your own site, uses a different capitalisation from that which has been put in the Robots.txt file.

Is this Google and Yahoo!’s problem to resolve? IMO, not really. If you choose to use a server that makes the same content available under a range of paths, it is up to you to protect those paths, not for spider developers to guess that you may have shot yourself in the foot.

OTOH, the SEs do themselves no favours for reducing spam in the indexes, and decreasing crawl volumes and bandwidth usage, by failing to recognise that IIS can serve the same page under a multiplicity of case variations. That’s a different problem – but solving one would solve the other. If the Server can be detected as using case folding then using a case independent match of Robots.txt paths would be a useful extension.

I can even imagine adding a new directive to Robots.txt to express that pathnames do not respect case.

Other Case Folding Systems

Well, Apple OS X. It may be my favoured desktop OS, but it has a default FS that is caseless:

$ echo boo > goose $ cat GOOSE boo $

I haven’t tested – I have no SEO clients with Mac servers – but I suspect that Mac servers with a default FS run the same problem of a futile and useless Robots.txt protection.

IIRC, OS/2 aka “Warp” was used for some years as a web server and it used a case folding FS – so if you run one of those ancestral systems, watch out.

Webmasters

Your defence? Well, make IIS respect case in queries and do a proper redirect to the actual file.

That way, spiders could properly use Robots.txt and your hidden content wouldn’t be accessible. Try asking Microsoft about that configuration. Heh. Here’s Microsoft’s page about creating a Robots.txt – note the discussion about case folding? Oh, there *is* no mention of case folding? Hmm. Well, I think that’s a lesson in its own right.

Failing any rational advice from Microsoft, you could do a combinatorial madness on Robots.txt:

Disallow: /members Disallow: /Members Disallow: /mEmbers ... Disallow: /MEmbers ... Disallow: /MEMBERS
and so on. It’s easy. Yeah. Right.

Have I mentioned that I think IIS adds problems, rather than removing them?

It is all so much more complex if you have cookieless mode enabled for your ASP applications. Because the path given to the robots doesn’t match the path that is denied. Deny “/secret” and you get a path that starts “/(” and goes on to “))/secret”. Combine that with case folding and there is no end to index spamming.

And, of course, Robots are the main thing that need to read and respect Robots.txt.

Given a modicum of sense, this whole area could be made a lot simpler for system admins and webmasters. If an IP address and user agent doesn’t accept cookies, and asks for “robots.txt”, it is probably a robot. Stop sending cookieless tracking paths to that IP and UA.

Sticking In a Reverse Caching Proxy

If you are a technological sophisticate, then you could insert Apache with a rewriter and mod_speling, to get the benefits of case matching and redirection to the single real file instance. You’ll possibly see a slight average speed up for users, as unchanged content is delivereddirect from the Apache cache.

How to set up and configure one of these cute web servers is beyond the scope of this article, though.

Scale Of Problems

It’s quite nasty for people using hosted IIS, who have no significant control. I have no doubt that the SEs do duplicate detection. Matt Cutts has written that in-site duplication isn’t too awful a penalty – probably because they keep spidering IIS sites. OTOH, places like WebMasterWorld have a fair number of webmaster stories about having two or more case variations of the same page in the search engine indexes, with radically different page rank – and that depending upon what has been spidered recently, the position of the site will change radically.

There is a difference between ignoring duplicates and sending full credit for an inbound link to the “master” version of the duplicates. Some folk seem to think that there’s no great loss. I’m pretty conservative about this – why risk losing the benefits of inbound links, just because of case folding?

Case folding for private content is a problem. I’ve seen many complaints about Google revealing information concealed by Robots.txt, and I have strong suspicions that in most of these cases, the complainants were running IIS and had ~~an undetectably case-mismatched link reference somewhere on the site, or had~~ enabled Cookieless mode for ASP.

I can’t find any authoritative discussion about this issue, especially in Microsofts’ Knowledge Base, hence the posting here. Of course it may be that I detest MS so much that I haven’t spent enough time poking around their resources. That’d be a quite valid criticism of me and this article. 🙂

Summary

IIS, by default, apparently opens up web sites to some Google Bowling activities. You can even bowl yourself out, if you use case variations in your own URLs.

Carefully review what you do have; probably the simplest suggestion is that you stick to all lower case in every link. Watch out for inbound links that use mixed case – those can scupper your rank.

Use the meta tag for robots to set NOINDEX and possibly NOFOLLOW on your private content on case-folding IIS systems to prevent your data leaking into the search engines.

Best bet – dump IIS. OK. That’s a biased view, and a very selective view. But if you want to rank highly in search engines, setting up and running a LAMP (Linux Apache MySQL PHP) or similar system is easy and cheap. While it does have comparable problems (unbranded 404′s etc) the solutions are also cheap and easy to set up. If I’m really blunt, I wouldn’t actually use PHP, either – the script language design makes it too easy to embed SQL, IME. I’d steer for Python, Ruby On Rails or OCaml… probably 😉

It’s better than finding your private data leaked, or that you’ve been scuppered by a near unfindable typo.