QDN: Frontier and the Robot Exclusion Protocol

Frontier and the Robot Exclusion Protocol

Dec 21, 1999 | Q

Every time I look at the logs of my webservers, I’m amazed at how many webcrawlers are out there, indexing web sites for some search engine or another. But there are definitely parts of my websites that aren’t meant for indexing — dynamically generated statistics that change every minute aren’t worth indexing, and likewise, web cameras probably aren’t worth throwing into a search engine. This is where the Robot Exclusion Protocol comes in; it allows you to create a robots.txt file as part of your website, and well-behaved webcrawlers are supposed to look at this file to determine what not to index.

Frontier is an amazing web serving and web content environment, and putting a robots.txt file into your Frontier-generated or -maintained web sites is easy. Since I’m concentrating mostly on how to integrate a robots.txt file into a Frontier web site, I am going to assume from this point on that you know how to write a robots.txt file; if you don’t, it doesn’t really matter, but you can also read the Web Server Administrator’s Guide to the Robot Exclusion Protocol. Additionally, I am focusing here only on mainResponder as the webserver in Frontier, since that’s the current shipping framework. (It’s also much improved over the old webserver, so you should upgrade to it if you’re still running a version of Frontier without it!)

First, let’s start out with the three big requirements of a robots.txt file that are relevant to this discussion:

There can only be one robots.txt file on a website.
The file should be at the root of the website.
The file should be plain text, not HTML.

The first requirement is tricky, insofar as it means on any given base URL you can only have one robots.txt file. So, if you have Frontier serving only one URL (say, my.site.com), you can only have one file total. But if you are using Frontier’s ability to serve multiple virtual domains, then each domain can have its own robots.txt file.

The last two requirements mean that, depending on how you have set up mainResponder or each virtual domain site, there are different ways to include the robots.txt file. The big variable is where a website’s home URL is being served from — for the most part, it can either be from the Guest Databases\www\ directory or from a website table in the Object Database (ODB). Let’s look at both options.

The Guest Databases\www\ directory

This provides the easiest way to include a robots.txt file, since a website that’s serving out of this directory just serves up files as-is. All that you have to do is put a straight text file into the directory; a file like the following will prevent any well-behaved crawler from indexing anything on your site.

User-agent: *
Disallow: /

Now, remember the first rule above — you can only have one robots.txt file on any given website. So, if the base URL for your website is my.site.com, and you have a Manila site at my.site.com/myFirstSite/, you can’t have a robots.txt file at the base of your Manila site (my.site.com/myFirstSite/robots.txt) — webcrawlers won’t look for it there, since that violates the protocol spec. Instead, your robots.txt file at the root of your website can include a section that has specific rules for the directory myFirstSite; read the protocol specs to get the details. (Note that there’s also a robots META tag that you can use on a per-page basis; that’s a topic for a later date.)

Website tables in the ODB

Because of the third requirement above, this is a little trickier. when a site is rendered out of a table, Frontier is good about creating proper HTML; for the robots.txt file, though, you don’t want HTML, you want plain text. This is easily done.

First, create a subtable named #templates at the root of your website table. In this table, create an outline object named robotTemplate. The outline should consist of one line, {bodytext}. (Specifically, there should not be references to the pageHeader or pageFooter macros.)

Next, create a WP Text entry at the root of your website table named robots.txt. The first line of this should be #template “robotTemplate”; the rest should contain your robots.txt information. Again, determine the content that you want in this from the protocol spec. Here’s an example:

#template "robotTemplate"
   
User-agent: *
Disallow: /private/

Now, when the webcrawler requests the robots.txt file, it will get this object; Frontier renders it through the robotTemplate template, so it doesn’t get any HTML formatting. Check it out in your web browser — request the robots.txt file, and look at the page source to verify that there’s no HTML.

Again, remember that this is only useful if the website table is actually being served as the root of a website; if it’s one directory into the website, then webcrawlers won’t look for the file, and it won’t do you any good.