BOOK A CALL >>
0
$0.00 0 items

No products in the cart.

What is Robots.txt? Everything you need to know

Josien Nation
Published on 
February 16, 2021

A search engine optimization process consists of technical SEO optimization, content optimization (On-Page SEO), and Off-Page SEO. Technical optimization consists of all kinds of small facets, including the robots.txt file.

The robots.txt file seems to be a small part of your SEO, but it has a major impact on the crawlability of your website (can the Google Bot “read” your website). The file contains instructions for the search engines. It indicates how the search engines should deal with the website.

A good robots.txt prevents problems such as non-indexed content and crawl budget consumption. I will explain exactly how this works in this blog. Of course with the necessary examples.

What is a robots.txt file?

A robots.txt file is a text file that contains instructions for search engines. Thanks to these instructions in the robots.txt file, the crawler of a search engine knows how to handle the website. The instructions in it indicate which pages search engines may or may not crawl. The robots.txt file can be seen as the manual of your website for all search engines.

What is crawling?

All search engines visit your website on a daily, weekly, or monthly basis. This is also called crawling. Crawling will include you in the search results. The crawl process starts with the robots.txt file.

The location of the robots.txt file

The robots.txt file is placed in the root directory of the website. When a crawler visits your website, it first visits the robots.txt file. When this file is not present, the crawler automatically crawls all pages that the crawler encounters within your crawl budget (I explain the crawl budget later in this article).

What does a robots.txt file look like?

A robots.txt file is a text file with instructions. One instruction is placed on each line. You can find the file by adding /robots.txt to your main domain.

Mine can be found via the URL: https://josiennation.com/robots.txt. You can see it contains several rules with instructions. It tells the crawler how to handle my website.

Why is a robots.txt file important?

If you don’t have a robots.txt file, the crawler will crawl the entire website. This means that unnecessary pages, such as the login page of the website, are also crawled. These do not have to end up in the search results and eat up the crawl budget.

The robots.txt file provides instructions to the search engines and is the basis of the crawl process. In most cases, the instructions are followed as well, making this file important. The search engine may ignore your robots.txt file. This will be the case when search engines think the content is irrelevant. This is very uncommon.

What is the crawl budget?

The crawl budget is the time that a crawler takes to visit your website. The more crawl budget a crawler has for your website, the more pages it visits. Large websites, such as news sites, generally get more crawl budget than a small website with pages that are not updated often.

Upfront, you can’t know exactly how long the crawler will take. You can, however, get an estimate from the crawl statistics that you find in Google Search Console. That is why it is wise to ensure that unnecessary pages, such as a login page, are not crawled.

what is robots.txt

How do you create a robots.txt file?

Almost all major search engines, such as Google and Bing, use the robots.txt and have also drawn up guidelines. These guidelines indicate how to prepare the instructions. Incorrect use of the instructions in the robots.txt file confuses search engines. The main instructions are:

User agent: which search engines are allowed to crawl the website?

The robots.txt file often starts with User-agent: *. This User-agent instruction indicates for which search engine the instruction is intended. When you use the User-agent: * statement, you indicate that the instructions apply to all crawlers, including search engine crawlers and other bots on the Internet.

Also, it is possible to give specific instructions to a search engine. Each search engine has its own name. For example, Google’s user agent is called GoogleBot and Bing’s is called BingBot.

If you have instructions for one specific crawler, it is important to indicate this specifically. For example, if you have an instruction specific to Google, then you use User-agent: Googlebot. When you use this, Google will also follow the instructions below and other crawlers will ignore the instructions.

This also means that you can provide instructions for all crawlers first and specific to one crawler later. So you first write instructions with the User-agent: * line and give instructions for a specific crawler later in the robots.txt file. For example, this looks like this:

User-agent: *
Disallow: /about/josien

User-agent: Googlebot
Disallow: /about/

In the example above, I’m telling all crawlers to NOT crawl a specific about page and the Google Bot crawler all the about pages.

*This is an example, I do not recommend doing this for your about page.

Disallow and Allow rules: grant or deny access

You probably guessed it already, Disallow and Allow are the two instructions that allow or deny the search engine access to certain parts of the website. With a disallow instruction you prohibit access to a page or a group of pages and with the allow instruction you give access to a page or a group of pages.

For example, this looks like this in your robots.txt:

User-agent: *
Disallow: /wp-admin/

In this example, I exclude the back-end of the website using Disallow: /wp-admin/. Most people know that /wp-admin/ is the login page of a WordPress website. This one does not need to be crawled.

With the instruction: Disallow you prohibit the search engine from accessing this page. This prevents the page from being crawled unless a backlink to this page can be found on the website itself. Then the crawler will still find the page. So do not only exclude the page via the robots.txt but also make sure that it is not visible on the website.

Of course, it also happens that you want to give specific access to a page. Then use the Allow statement. For example, do you want to make sure that your blog is being crawled? Then simply place Allow: /blog/ in the robots.txt file.

Wildcards: Instructions for a group of pages

The instruction User-agent: * contains a wildcard and thus gives instructions to several search engines. A wildcard is a symbol that replaces a character or string of characters, so you don’t have to compose an instruction for every URL or crawler. There are different kinds of wildcards you can use in the robots.txt file.

Wildcard: *

With an asterisk as a wildcard, you indicate that any parameter, character, or repetition can be replaced with this in an infinitely long sequence. Everything that can be placed in the position of the wildcard falls under the instruction that you draw up. For example:

User-agent: *
Disallow: /*?

This means that the search engine does not have to crawl every URL that contains a question mark. That is of course not a desirable situation, so you should not put this disallow in your own robots.txt file.

Wildcard: $

The dollar sign indicates an end of a URL. You use the wildcard $ for files on the website, such as PDF files or PHP files. In practice, you don’t use this wildcard much. An example of the usage is:

User-agent: *
Disallow: /*.pdf$

Watch out for conflicting instructions in the robots.txt file

The instructions in the robots.txt file should not contradict. This confuses the crawlers. Conflicting instructions can arise from misuse of wildcards or Allow and Disallow instructions that are used interchangeably.

Example of conflicting guidelines

User-agent: *
Allow: /blog/
Disallow: /*.html

In this example, an instruction has been given to crawl URLs containing the path /blog/, but the search engine will not be able to access all URLs containing /.html for the entire website. Thus, if the URL contains /blog/i-want-to-be-crawled.html, the crawler will be stopped. This confuses the crawler and it will not crawl that blog.

Add sitemap references in the robots.txt file

In addition to the instructions above, the robots.txt file is the location for a reference to your sitemap. This lets the crawler know where to find the sitemap. A sitemap contains all URLs of your website.

Always post an absolute URL (a fully written URL) to your sitemap. I also recommend submitting the sitemap in Google Search Console or Bing Webmaster Tools to ensure that the search engine indexes the sitemap.

Example of a sitemap reference in a robots.txt file:

User-agent: *
Sitemap: https://josiennation.com/sitemap.xml

Multiple sitemap references in the robots.txt file

It is also possible to include multiple sitemap references in your robots.txt. You do this when you have different domains for your website or use different sitemaps. The well-known WordPress SEO plugin Yoast creates sitemaps for blogs, pages, categories, authors, and tags. Links to these different sitemaps should ideally be in your robots.txt.

You might host a blog on a subdomain. References to multiple sitemaps are possible, as long as you do this correctly. Use one line per sitemap reference. This way the crawler will recognize and visit the sitemap.

Example of multiple XML sitemap references

User-agent: *
Sitemap: https://josiennation.com/sitemap.xml
Sitemap: https://services.josiengalama.com/sitemap.xml

How do I create a robots.txt file?

Creating a robots.txt file is not difficult. Within WordPress, you can use various SEO plugins, like Yoast or RankMath. These generate this file for you and an XML sitemap at the same time. After this, you manually place the sitemap in the robots.txt file.

If you want to create the robots.txt file yourself, you can do this with an HTML Editor or an FTP program. In the file, you put all the instructions you want for your website. When the file is ready, upload it to the root of the website.

Robots.txt Testers

Of course, you want your robots.txt to be correct and not to give contradictory instructions to the search engines. That’s why it’s wise to check the file when it’s online.

With the Google Search Console Robot txt Tester or the Bing Robots.txt Tester you can easily check if your robots.txt file is correct.

What should I pay attention to with a robots.txt file?

Every search engine handles the robots.txt differently. That is why there are several points that you should pay attention to.

The order of the statements in the robots.txt

The first instructions in the robots.txt are crawled first, and then the other instructions from top to bottom. Still, there are exceptions, for example, Google and Bing focus stronger on specific instructions. The longest instruction goes first. For example:

User-agent: *
Disallow: /about/
Allow: /about/josien/

In this example, the Allow instruction will be followed before the Disallow. Namely, the reasoning is the longer the instruction, the more specific. If you want to give multiple search engines separate instructions in the robots.txt, you have to pay attention to the order.

Be specific when specifying Disallows and Allows

For all instructions, you should be as specific as possible. Only this way you give correct and good instructions to the search engines. With Disallow you simply prohibit access to parts of the website and with Allow that you want to give access to.

Do not mix them up and make them as specific as possible. This prevents conflicting instructions.

Beware of conflicting instructions

Do not mix specific instructions and wildcards. Another problem is giving instructions for all search engines and then instructions for specific search engines. For example:

User-agent: *
Disallow: /about/
Disallow: /about/josien/

User-agent: GoogleBot
Allow: /about/

In this example, all search engines are denied access to /about/ and /about/josien/, but GoogleBot is later instructed to visit the page with /about/. This is a contradiction for Google and the algorithm will get confused by these instructions.

Create a separate robots.txt file for each domain

You must create and place a separate robots.txt for each domain. If you have a .com and .fr version, then also place a file per domain. This also applies to subdomains.

Google Search Console & robots.txt

In Google Search Console you indicate how Google should deal with your website. You also do this in the robots.txt. If these instructions differ and therefore conflict with each other, Google will choose the instructions from Google Search Console. Always check what you listed in Google Search Console and what you listed in the robots.txt.

Read this article and learn the 5 steps for Google to process a page to improve your website’s performance in a search engine.

Noindex tags and Allows in your robots.txt file

With a noindex tag, you can easily let a crawler know that the (search engine friendly) URL does not need to be indexed. Google warns against using this tag with URLs that you include in the robots.txt. I recommend that you follow this warning.

Maximum size of 500 kb

Google also states it supports a robots.txt of up to 500 KB. Any content of the robots.txt after 500 KB will be ignored. It is unclear which guidelines other search engines use for this.

Watch out for capital letters

A URL is upper- and lowercase sensitive and so is your robots.txt. Therefore, do not use capital letters in the file name.

Post comments in your robots.txt

When you post comments in your robots.txt file, use the hashtag. You place comments to indicate what the instruction is for. Search engines do not use these comments, so they are for your colleagues and yourself.

Example of comments:

#All user agents must be able to crawl these sitemaps
User-agent: *
#List of our sitemaps, add new sitemaps here
Sitemap: https://josiennation.com/sitemap.xml
Sitemap: https://blog.josiengalama.com/sitemap.xml
Sitemap: https://services.josiengalama.com/sitemap.xml

Frequently asked questions about the robots.txt

Does a robots.txt file prevent certain URLs from appearing in search results?
When a page contains the tag, but the crawler cannot access it through the robots.txt, the URL will still be indexed. The file does not guarantee that the pages will not end up in the search results. Nevertheless, instructions in the robots.txt are important guidelines that search engines adhere to in most cases.

Which search engines use the robots.txt?

Almost all major search engines use the robots.txt files. These include Google, Bing, Yahoo, DuckDuckGo, Yandex, and Baidu. This file is therefore also an important part of international SEO.

What is crawl-delay in the robots.txt file?

It is possible to issue a crawl-delay instruction to the search engines in the robots.txt file. This prevents your servers from being overloaded by the requests.

Search engines can burden your server if you have a website with a lot of pages. In that case, adding this instruction is recommended. Ultimately, you will have to find a better hosting platform for your website, because the crawl-delay instruction is only a temporary solution.

Reach out!

If you want to know more about me or what I can do for you, please reach out! You can message me directly here:
CONTACT >> 
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram