Robots.txt Explained
robots.txt is a text file at the root of your domain that tells crawlers which pages they are allowed to visit. It's a simple but powerful tool for controlling crawl behaviour. However, many people misunderstand what it does and doesn't do.
What Is robots.txt?
robots.txt is a plain text file located at the root of your domain (https://example.com/robots.txt) that contains rules for crawlers. It uses a simple syntax to specify which paths crawlers should avoid.
A basic example:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml
This tells all crawlers: "Don't crawl /admin/ or /private/, except for /private/public/. Here's my sitemap."
Common Robots.txt Directives
| Directive | Function | Example |
|---|---|---|
| User-agent | Specifies which crawler(s) this rule applies to. * means all crawlers. | User-agent: Googlebot |
| Disallow | Tells crawlers not to crawl paths matching the rule. Empty value allows crawling. | Disallow: /admin/ |
| Allow | Explicitly allows crawling of a path (used to override a broader disallow rule). | Allow: /admin/public/ |
| Crawl-delay | Specifies seconds to wait between requests (not officially standard, ignored by Google). | Crawl-delay: 1 |
| Sitemap | Tells crawlers where to find your XML sitemap(s). | Sitemap: https://example.com/sitemap.xml |
What robots.txt Does (and Doesn't) Do
What It Does
- Prevents crawling. If a path is disallowed, crawlers will not request those URLs, saving server resources and crawl budget.
- Directs crawlers. You can specify crawl delays or point to sitemaps.
What It Does NOT Do
- Does not prevent indexation. A page blocked by robots.txt can still be indexed if another site links to it. Google has confirmed this. So blocking a URL in robots.txt does not guarantee it won't appear in search results.
- Does not prevent discovery. Crawlers can discover URLs blocked by robots.txt (via links from other sites, in sitemaps, etc.) — they just won't crawl them.
- Is not a security tool. Never use robots.txt to block sensitive pages. Use authentication, noindex, or authentication instead.
- Is not universally respected. Most search engines respect robots.txt, but malicious bots ignore it entirely. Never rely on robots.txt for security.
Common Mistakes with robots.txt
Blocking CSS and JavaScript Files
A frequent mistake is disallowing /css/ or /js/ in robots.txt. Google needs to fetch these files to render JavaScript-heavy pages properly. If you block CSS and JS, Google cannot fully render your pages, which can hurt your rankings. Don't do this. Allow crawlers to access your CSS and JavaScript.
Accidentally Disallowing Everything
Using Disallow: / blocks all crawlers from all paths. If you do this by mistake, your entire site won't be crawled. Some developers do this on staging sites (correctly) but forget to remove it before going live (disaster). Always test robots.txt on staging before deploying.
Overly Complicated Rules
robots.txt syntax is simple, but you can make it unnecessarily complex. Keep rules clear and minimal. Disallow only what you truly don't want crawled.
robots.txt vs noindex: When to Use Each
Use robots.txt when you want to save crawl budget by preventing crawlers from accessing pages you don't care about. Example: admin pages, login pages, or duplicate filter URLs.
Use noindex when you want crawlers to visit a page (to understand its content) but not include it in the search index. Example: an internal search results page or a draft article.
Use 301 redirects when you're moving or consolidating pages permanently and want to pass link equity.
Use authentication when you need real security (sensitive admin areas).
Testing Your robots.txt
Google Search Console has a robots.txt tester. In GSC, go to Coverage > robots.txt Tester. Enter a URL from your site, and it will show whether that URL is allowed or disallowed according to your robots.txt. Use this to verify your rules are working as intended before deploying.
The Crawl Budget Benefit
For very large sites, a well-configured robots.txt prevents Googlebot from wasting crawl budget on low-value pages. For example, filtering parameters that generate thousands of combinations. If you can consolidate or disallow parameter combinations that don't matter, you free up crawl budget for pages that do rank.