Understand the robots.txt file of a website. What is it, and what kind of functions it performs?
So you like to understand the robots.txt file? But if I say it is a traffic inspector. That allow or disallow traffic on some roads, and let block on others. This file allows or disallows search engine bots to crawl some sections of your webpage and disallow others.
Search Engine Optimization and Robots.txt
What are the Search Engine Bots?: These are the bots that read data from a website or webpage and transfer that to their database Like Google, Bing, Yandex, or any other. Suppose you create a website and continuously writing articles on that.
Internal Backlink and robots.txt: If you’re writing an article, and within that, you provide some internal link that may be of category or label or anything else that is blocked by robots.txt. Now that link you provided within the article is dofollow as it is internal, and at the same time, you’re disallowing it by the permission set in robots.txt file. So best practice for internal pages that shouldn’t be index but crawl is to noindex those pages, not to disallow using robots.txt file.
What pages should be blocked by the robots.txt file: It should block the sensitive pages. That may be the admin section of your website or blog. All other pages that cause junk or double content should be noindex using proper meta tags or x-robots tag.
Links from external resources will block: Suppose someone provided a backlink of your website of category section, then in such case crawl engine will not able to crawl your website and a hard-earned backlink will waste.
Robots.txt syntax used
This declared the bots or web crawler to which we’re giving instruction or controlling them for the various sections using allow and disallow function.
Sections of websites that are not allowed to crawled by bots(usually a search engine).
It usually used for Googlebot, to allow sections for the crawl of a website. It may allow a subfolder that’s parent folder is disallowed to crawl.
It declares the location of the sitemap of the website or blog. This command is supported by Google, Ask, Bing, and Yahoo.
robots.txt file example:
User-agent: Mediapartners-Google Disallow: User-agent: * Disallow: /*/junk/* Disallow: /search Allow: / Sitemap: https://example.com/sitemap.xml
In the above robots.txt example User-agent: Mediapartners-Google declares instruction for Google AdSense, instruction are followed as Disallow: to nothing. That means AdSense can crawl your whole website and display ads.
The next command link is User-agent: * that means instruction for all other bots or crawler other than Google AdSense. Disallow: /*/junk/*
Disallow: /search that disallows subfolder “junk” to any parent folder. And also parent folder “search”. Allow command allows the whole website to crawl. Sitemap: https://example.com/sitemap.xml is the location of sitemap added to the domain.