How-To & Life · Guide · Developer Utilities
How to write robots.txt
User-agent rules, Allow vs Disallow precedence, Sitemap directive, crawl-delay (or the lack thereof), and why noindex belongs in meta, not robots.txt.
A robots.txt file is a 1994-vintage plain-text protocol that still controls whether Googlebot, Bingbot, and a zoo of other crawlers touch your site. It lives at exactly one path — /robots.txt at your root domain — and a single typo can either leak your staging environment into search results or de-index your entire production site. The syntax looks trivial, but the rules around precedence, wildcards, and the difference between crawling and indexing trip up even experienced SEOs. This guide covers the full directive set (User-agent, Disallow, Allow, Sitemap, Crawl-delay), wildcard and anchor matching, why Noindex was moved out of robots.txt in 2019, and the common patterns that keep staging environments private without breaking production.
Advertisement
What robots.txt actually controls
Robots.txt tells well-behaved crawlers which URLs they may request. It does not enforce anything — malicious scrapers ignore it — and it does not directly stop a page from appearing in search results. A URL blocked in robots.txt can still be indexed if Google discovers it via external links; the search result simply has no snippet because the crawler never read the page body.
If you need a page to stay out of the index, use an HTTP X-Robots-Tag: noindex header or a <meta name="robots" content="noindex">tag — and crucially, leave the URL crawlable so Google can see the directive.
File location and format
The file must be at https://example.com/robots.txt. Subdomains need their own — blog.example.com/robots.txt is separate from example.com/robots.txt. The content type should be text/plain and encoding UTF-8. A BOM at the start is tolerated but avoid it.
Maximum file size honored by Google is 500 KiB. Anything past that is truncated, and the truncation can land mid-rule. Keep real-world files well under 50 KiB.
User-agent: targeting specific crawlers
Every rule block starts with one or more User-agent lines. The value is a substring match against the crawler’s product token, case-insensitive.
User-agent: Googlebot Disallow: /admin/ User-agent: Bingbot Disallow: /admin/ Crawl-delay: 10 User-agent: * Disallow: /private/
The * wildcard matches every crawler that has no more specific block. A crawler picks the single most specific matching group and obeys only that one — rules do not merge across groups. If Googlebot matches both its own block and the * block, it follows only the Googlebot block.
Disallow and Allow
Disallow gives a path prefix the crawler must not request. Allow overrides a broader Disallow for a narrower path. Both are prefix matches starting from the root of the domain.
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
When rules conflict, Google uses the longest-matching path rule. Above, a request to /wp-admin/admin-ajax.php matches Allow (28 characters) more specifically than Disallow (10 characters), so it is allowed.
A bare Disallow: with no path means “nothing is disallowed” — it is the way to open a section back up. Disallow: / blocks the entire site.
Wildcards and end-of-URL anchor
Google, Bing, and most modern crawlers support two pattern characters beyond plain prefix matching:
* matches any sequence of characters. $ anchors the pattern to the end of the URL.
User-agent: * Disallow: /*?sessionid= Disallow: /*.pdf$ Allow: /public/*.pdf$
The first line blocks any URL containing ?sessionid=. The second blocks URLs ending in .pdf — without $, /file.pdf.html would also match. The Allow line then re-opens PDFs under /public/.
Sitemap directive
Unlike the rule directives, Sitemap lines are global — they are not tied to any User-agent group and can appear anywhere in the file. Use absolute URLs.
Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-news.xml
You can list multiple sitemaps or point to a sitemap index. This is the only directive that actively tells crawlers where to find content — the rest only tell them where not to go.
Crawl-delay — limited support
Crawl-delay: N asks the crawler to wait N seconds between requests. Bing and Yandex honor it; Google ignores it entirely (use Search Console crawl rate settings instead). Baidu interprets it differently. In practice, scope it to the bots that respect it.
Noindex is no longer supported here
Google stopped obeying Noindex: directives in robots.txt on September 1, 2019. If you still have Noindex: /thanks lines, move that control to a <meta name="robots" content="noindex"> tag or an X-Robots-Tag header and make sure the URL is crawlable. A page that is both noindex and robots-disallowed is the worst of both worlds: Google cannot see the noindex tag, so it may leave the URL in the index as a snippetless link.
Patterns for staging and preview
For staging subdomains (staging.example.com), serve:
User-agent: * Disallow: /
Combine with HTTP Basic Auth or IP allowlisting — robots.txt alone is a courtesy, not a gate. If a competitor or scraper finds the staging URL, they will ignore your robots.txt and crawl it anyway.
For a production site with a private admin area and a search results page you do not want indexed:
User-agent: * Disallow: /admin/ Disallow: /cart Disallow: /search Disallow: /*?utm_ Allow: / Sitemap: https://example.com/sitemap.xml
The /*?utm_ line blocks the duplicate-content URLs that UTM-tagged inbound links create.
Testing your file
Google Search Console has a robots.txt report under Settings that shows the last-fetched copy, any parse errors, and the timestamp. For live testing against specific URLs, use the URL Inspection tool — it reports whether a given URL is blocked and which rule blocked it.
Before deploying, run the full file through a syntax checker that understands Google’s precedence rules. A misplaced Allow/Disallow can look harmless but silently block a whole section.
Common mistakes
Blocking CSS and JS. Google needs to render pages to evaluate them. Disallowing /wp-content/ or /static/ can hide the styles and scripts Googlebot uses for layout, which hurts rankings. Leave asset directories crawlable.
Using robots.txt to hide sensitive URLs. The file is public. Listing /admin-secret-backup/ in a Disallow line is like putting a giant arrow on it. Use auth, not robots.txt, for security.
Expecting Disallow to remove pages from the index.Disallow stops crawling, not indexing. Already-indexed URLs can stay in search results for months as snippetless links. To remove, use noindex (and keep crawlable) until Google processes it, then block.
Case-sensitive path mismatch. Paths are case-sensitive. Disallow: /Admin/ does not block /admin/. Match your actual URL casing.
Forgetting subdomain scope. Uploading robots.txt to the root does nothing for cdn.example.com. Each subdomain that serves HTTP needs its own file.
Trailing-slash surprises. Disallow: /foo blocks /foo, /foo/, /foobar, and /foo.html. If you meant only the folder, write Disallow: /foo/.
Run the numbers
Draft and validate a production-ready file with the robots.txt generator. Pair with the sitemap URL generator so the Sitemap: line you add actually points at a well-formed file, and the URL parser to verify the exact path shape your rules will match against.
Advertisement