Saturday, February 6, 2010

Writing Robots.txt Files

Restricting Search Engine Indexing using ‘robots.txt’ File

Search engines determine which pages of your website to index by looking in the ‘robots.txt’ file residing in the root folder on your website.

To make changes to this file, firstly get the file either via FTP or a File Manager type program provided by your domain host. Then, after making a backup of the file, edit it in a text editor such as Notepad. If the file doesn’t exist yet in your root web folder, then you can just create a new file in your text editor and then save it as ‘robots.txt’.

There are a number of options (called directives) to allow and disallow certain robots from crawling your site but in most cases you just want to allow all the robots to have a look at your site. So the first line of your ‘robots.txt’ file should be the directive:

User-agent: * 

The * says that the rules that follow apply to all robots.

Then, in your file, you list the folders and files (one per line) that you don’t want the robots to crawl (don’t leave any blank lines in the file), giving:

User-agent: *
Disallow: /keep-out/
Disallow: /no-entry/
Disallow: /come-in-here/but-not-in-here/
Disallow: /a-folder/private.htm

The robots will not crawl the contents of ‘/keep-out/’, ‘/no-entry/’, ‘/come-in-here/but-not-in-here/’ (or any files or folders within these), and will not look at the file ‘/a-folder/private.htm’. The robots will, however, crawl the root directory, ‘/come-in-here/’, and ‘/a-folder/’ plus any other folders that exist.

This above example is generally all you need. There is another directive called ‘Allow’ that you can use to allow the robots to crawl certain folders and files that are contained within ‘Disallowed’ folders. This directive is not supported by all search engines and so is best avoided if possible. But if you had the following file ‘/no-entry/look-at-me.pdf’ that you did want indexed then the above example can be changed to:

User-agent: *
Disallow: /keep-out/
Allow: /no-entry/look-at-me.pdf
Disallow: /no-entry/
Disallow: /come-in-here/but-not-in-here/
Disallow: /a-folder/private.htm

Note that it is very important that the Allow directive occurs before the Disallow directive for the folder in question (‘/no-entry/’ in this case) as most search engines go through the ‘robots.txt’ file from top to bottom until a match is found.

Finally, if you didn’t want your site to be indexed at all then your ‘robots.txt’ file would be:

User-agent: *
Disallow: /

Once you are happy with the changes, save the file and upload it to the root of your web site.