Robots.txt Explained

robots.txt is a text file at the root of your domain that tells crawlers which pages they are allowed to visit. It's a simple but powerful tool for controlling crawl behaviour. However, many people misunderstand what it does and doesn't do.

What Is robots.txt?

robots.txt is a plain text file located at the root of your domain (https://example.com/robots.txt) that contains rules for crawlers. It uses a simple syntax to specify which paths crawlers should avoid.

A basic example:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml

This tells all crawlers: "Don't crawl /admin/ or /private/, except for /private/public/. Here's my sitemap."

Common Robots.txt Directives

Directive	Function	Example
User-agent	Specifies which crawler(s) this rule applies to. * means all crawlers.	User-agent: Googlebot
Disallow	Tells crawlers not to crawl paths matching the rule. Empty value allows crawling.	Disallow: /admin/
Allow	Explicitly allows crawling of a path (used to override a broader disallow rule).	Allow: /admin/public/
Crawl-delay	Specifies seconds to wait between requests (not officially standard, ignored by Google).	Crawl-delay: 1
Sitemap	Tells crawlers where to find your XML sitemap(s).	Sitemap: https://example.com/sitemap.xml

What robots.txt Does (and Doesn't) Do

What It Does

Prevents crawling. If a path is disallowed, crawlers will not request those URLs, saving server resources and crawl budget.
Directs crawlers. You can specify crawl delays or point to sitemaps.

What It Does NOT Do

Does not prevent indexation. A page blocked by robots.txt can still be indexed if another site links to it. Google has confirmed this. So blocking a URL in robots.txt does not guarantee it won't appear in search results.
Does not prevent discovery. Crawlers can discover URLs blocked by robots.txt (via links from other sites, in sitemaps, etc.) — they just won't crawl them.
Is not a security tool. Never use robots.txt to block sensitive pages. Use authentication, noindex, or authentication instead.
Is not universally respected. Most search engines respect robots.txt, but malicious bots ignore it entirely. Never rely on robots.txt for security.

Critical Distinction

robots.txt blocks crawling. noindex blocks indexation. These are different tools with different purposes. To hide a page from search results, use noindex (in a meta tag or HTTP header), not robots.txt.

Common Mistakes with robots.txt

Blocking CSS and JavaScript Files

A frequent mistake is disallowing /css/ or /js/ in robots.txt. Google needs to fetch these files to render JavaScript-heavy pages properly. If you block CSS and JS, Google cannot fully render your pages, which can hurt your rankings. Don't do this. Allow crawlers to access your CSS and JavaScript.

Accidentally Disallowing Everything

Using Disallow: / blocks all crawlers from all paths. If you do this by mistake, your entire site won't be crawled. Some developers do this on staging sites (correctly) but forget to remove it before going live (disaster). Always test robots.txt on staging before deploying.

Overly Complicated Rules

robots.txt syntax is simple, but you can make it unnecessarily complex. Keep rules clear and minimal. Disallow only what you truly don't want crawled.

robots.txt vs noindex: When to Use Each

Use robots.txt when you want to save crawl budget by preventing crawlers from accessing pages you don't care about. Example: admin pages, login pages, or duplicate filter URLs.

Use noindex when you want crawlers to visit a page (to understand its content) but not include it in the search index. Example: an internal search results page or a draft article.

Use 301 redirects when you're moving or consolidating pages permanently and want to pass link equity.

Use authentication when you need real security (sensitive admin areas).

Testing Your robots.txt

Google Search Console has a robots.txt tester. In GSC, go to Coverage > robots.txt Tester. Enter a URL from your site, and it will show whether that URL is allowed or disallowed according to your robots.txt. Use this to verify your rules are working as intended before deploying.

Best Practice

Start simple: most sites don't need complex robots.txt rules. Disallow /admin/, /user-uploads/, and other non-public directories. Include a sitemap directive. Test with GSC's tester. That's usually enough.

The Crawl Budget Benefit

For very large sites, a well-configured robots.txt prevents Googlebot from wasting crawl budget on low-value pages. For example, filtering parameters that generate thousands of combinations. If you can consolidate or disallow parameter combinations that don't matter, you free up crawl budget for pages that do rank.