In this post, we will discuss robots.txt and it's usage.
What is robots.txt?
It is part of Robots Exclusion Protocol(REP) and an optional file located at the root directory of your website to tell the webmasters, search engines and other robots about the crawling preferences and instructions of the website. In general, most of the popular and good webmasters, search engines follow the website preferences, but it might not be the case always with malware.
We can also specify the master sitemap file location in robots.txt which is used by the search engine robots to crawl the web pages listed in it.
What are the advantages of robots.txt?
- It instructs search engine and other robots to crawl the web pages to follow REP.
- It allows/disallows the robots to crawl the site pages for indexing.
- It can avoid crawling and indexing of duplicate pages on your website like the print version of the same page.
- Having proper robots.txt file will have a considerable impact on the site SEO.
What tags are allowed within the robots.txt file?
- User-agent: It accepts either * to mark the rules for all the robots or the robot name for a specific robot.
- Disallow: It consists of a resource(page, directory or file) location to block crawling. If left blank, it allows crawling the whole site.
- Allow: It's specific to Googlebot and overrides the Disallow: instructions to tell robots to crawl specific parts though crawling is disabled.
- Sitemap: It consists of the location of sitemap having site links and instructions.
How to use robots.txt?
In this section, we will discuss most of the common patterns to use robots.txt.
User-agent: * Disallow: /
It's useful in the cases where webmasters do not want indexing of their sites. The possible scenarios include new sites, private sites, sites under construction.
User-agent: * Disallow:
It's the simplest form to allow robots to crawl any data without any restrictions.
Though we can place the sitemap.xml file at any publicly accessible location, it's good to place it at web root of the website. We can even use a different name. We can have links to other sitemaps within the main sitemap.
Allow Single Directory
User-agent: * Disallow: / Allow: /blog/
We can instruct the robots to crawl pages available in a single directory. We can specify only one directory in a line, though we can allow the robots to crawl multiple directories by having one permission on a line. Note that the Allow works only with Googlebot and does not have an impact on any other robots.
Allow Single Page
User-agent: * Disallow: / Allow: /index.html Allow: /about-us.html
Similar to the single directory, we can allow the crawlers to crawl only one page. The instructions can have one page in a line, though multiple entries are possible as shown above.
We can also mix directory and pages as mentioned below keeping one rule in a line.
User-agent: * Disallow: / Allow: /index.html Allow: /blog/
Selective User Agents
User-agent: * Disallow: / Allow: /index.html User-agent: Googlebot Allow: / User-agent: discobot Disallow: /
The above example disallows all the robots and allows Googlebot to crawl only the landing page of the website, but at the same time, it allows Googlebot to crawl any part of the website. It also shows how to completely disallows discobot to crawl the entire website.