Robots.txt is a text file used by websites used to communicate with the web crawlers and other web robots. The standard specifies how to inform the web crawlers about which pages of the website should be crawled and which part of the websites should not be crawled. Most of the search engines recognise and obey robots.txt requests.
.
Basic format:
User-agent: [user-agent name] Disallow: [URL string not to be crawled]
Importance of robots.txt
Google can usually find and index all of the important pages on your site. And they will not index pages that are not important or duplicate versions of other pages. However, there are 3 main reasons that you’d want to use a robots.txt file.
Block Non-Public Pages: Sometimes you have pages on your site that you don’t want it to be indexed. For example, you might have a staging version of a page. Or a login page. These pages need to exist. But you don’t want random people landing on them. This is a case where you’d use robots.txt to block these pages from search engine crawlers and bots.
Maximize Crawl Budget: If you’re having a tough time getting all of your pages indexed, you might have a crawl budget problem. By blocking unimportant pages with robots.txt, Googlebot can spend more of your crawl budget on the pages that actually matter.
Prevent Indexing of Resources: Using meta directives can work just as well as robots.txt for preventing pages from getting indexed. However, meta directives don’t work well for multimedia resources, like PDFs and images.
Best Practices to create a robots.txt file
Since it is a text file, you can create a robots.txt file using a notepad. The format for creating robots.txt file will be the same in every case.
User-agent: X
Disallow: Y
User-agent is the specific bot and everything that comes after “disallow” are pages or sections that you want to block.
For example: User-agent: googlebot
Disallow: /images
This rule would tell Googlebot not to index the image folder of your website. You can also use an asterisk (*) to speak to any and all bots that stop by your website.
For example:
User-agent: *
Disallow: /images
The “*” tells all spiders to NOT crawl your images folder.
Search engine has two main jobs, crawling the webpages to discover the content and indexing the content for the purpose of serving the content to users whenever they are searching for the information.
To crawl sites, search engines follow links to get from one site to another — ultimately, crawling across many billions of links and websites. This crawling behavior is sometimes known as spidering. After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does notcontain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site.
Things to remember
There are certain factors one should always remember before adding robots.txt file. In order to be found, a robots.txt file must be placed in a website’s top-level directory. Robots.txt is case sensitive: the file must be named “robots.txt”. Some robots may choose to ignore your robots.txt file. This is especially common with fraud crawlers like malware robots or email address scrapers.
To make robots.txt file publicly available, add /robots.txt to the end of any root domain to see that website’s directives. This means that anyone can see what pages you do or don’t want to be crawled, so don’t use them to hide private user information.
Each subdomain on a root domain uses separate robots.txt files. This means that both blog.example.com and example.com should have their own robots.txt files (blog.example.com/robots.txt and example.com/robots.txt). It’s generally a best practice to indicate the location of any sitemaps associated with this domain at the bottom of the robots.txt file.
Common language (terms) of robots.txt
User agent: The specific web crawler to which you’re giving crawl instructions
Disallow: This is the command used to tell a user-agent not to crawl particular URL. Only one “Disallow:” line is allowed for each URL.
Allow: This function is only applicable for googlebot, the command to tell googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content? Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemaps associated with this URL. Note this command is only supported by Google, Bing, and Yahoo.