Skip to content
Free Tool Arena

How-To & Life · Guide · Developer Utilities

How to write robots.txt

User-agent rules, Allow vs Disallow precedence, Sitemap directive, crawl-delay (or the lack thereof), and why noindex belongs in meta, not robots.txt.

Updated April 2026 · 6 min read

A robots.txt file is a 1994-vintage plain-text protocol that still controls whether Googlebot, Bingbot, and a zoo of other crawlers touch your site. It lives at exactly one path — /robots.txt at your root domain — and a single typo can either leak your staging environment into search results or de-index your entire production site. The syntax looks trivial, but the rules around precedence, wildcards, and the difference between crawling and indexing trip up even experienced SEOs. This guide covers the full directive set (User-agent, Disallow, Allow, Sitemap, Crawl-delay), wildcard and anchor matching, why Noindex was moved out of robots.txt in 2019, and the common patterns that keep staging environments private without breaking production.

Advertisement

What robots.txt actually controls

Robots.txt tells well-behaved crawlers which URLs they may request. It does not enforce anything — malicious scrapers ignore it — and it does not directly stop a page from appearing in search results. A URL blocked in robots.txt can still be indexed if Google discovers it via external links; the search result simply has no snippet because the crawler never read the page body.

If you need a page to stay out of the index, use an HTTP X-Robots-Tag: noindex header or a <meta name="robots" content="noindex">tag — and crucially, leave the URL crawlable so Google can see the directive.

File location and format

The file must be at https://example.com/robots.txt. Subdomains need their own — blog.example.com/robots.txt is separate from example.com/robots.txt. The content type should be text/plain and encoding UTF-8. A BOM at the start is tolerated but avoid it.

Maximum file size honored by Google is 500 KiB. Anything past that is truncated, and the truncation can land mid-rule. Keep real-world files well under 50 KiB.

User-agent: targeting specific crawlers

Every rule block starts with one or more User-agent lines. The value is a substring match against the crawler’s product token, case-insensitive.

User-agent: Googlebot
Disallow: /admin/

User-agent: Bingbot
Disallow: /admin/
Crawl-delay: 10

User-agent: *
Disallow: /private/

The * wildcard matches every crawler that has no more specific block. A crawler picks the single most specific matching group and obeys only that one — rules do not merge across groups. If Googlebot matches both its own block and the * block, it follows only the Googlebot block.

Disallow and Allow

Disallow gives a path prefix the crawler must not request. Allow overrides a broader Disallow for a narrower path. Both are prefix matches starting from the root of the domain.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

When rules conflict, Google uses the longest-matching path rule. Above, a request to /wp-admin/admin-ajax.php matches Allow (28 characters) more specifically than Disallow (10 characters), so it is allowed.

A bare Disallow: with no path means “nothing is disallowed” — it is the way to open a section back up. Disallow: / blocks the entire site.

Wildcards and end-of-URL anchor

Google, Bing, and most modern crawlers support two pattern characters beyond plain prefix matching:

* matches any sequence of characters. $ anchors the pattern to the end of the URL.

User-agent: *
Disallow: /*?sessionid=
Disallow: /*.pdf$
Allow:    /public/*.pdf$

The first line blocks any URL containing ?sessionid=. The second blocks URLs ending in .pdf — without $, /file.pdf.html would also match. The Allow line then re-opens PDFs under /public/.

Sitemap directive

Unlike the rule directives, Sitemap lines are global — they are not tied to any User-agent group and can appear anywhere in the file. Use absolute URLs.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

You can list multiple sitemaps or point to a sitemap index. This is the only directive that actively tells crawlers where to find content — the rest only tell them where not to go.

Crawl-delay — limited support

Crawl-delay: N asks the crawler to wait N seconds between requests. Bing and Yandex honor it; Google ignores it entirely (use Search Console crawl rate settings instead). Baidu interprets it differently. In practice, scope it to the bots that respect it.

Noindex is no longer supported here

Google stopped obeying Noindex: directives in robots.txt on September 1, 2019. If you still have Noindex: /thanks lines, move that control to a <meta name="robots" content="noindex"> tag or an X-Robots-Tag header and make sure the URL is crawlable. A page that is both noindex and robots-disallowed is the worst of both worlds: Google cannot see the noindex tag, so it may leave the URL in the index as a snippetless link.

Patterns for staging and preview

For staging subdomains (staging.example.com), serve:

User-agent: *
Disallow: /

Combine with HTTP Basic Auth or IP allowlisting — robots.txt alone is a courtesy, not a gate. If a competitor or scraper finds the staging URL, they will ignore your robots.txt and crawl it anyway.

For a production site with a private admin area and a search results page you do not want indexed:

User-agent: *
Disallow: /admin/
Disallow: /cart
Disallow: /search
Disallow: /*?utm_
Allow: /

Sitemap: https://example.com/sitemap.xml

The /*?utm_ line blocks the duplicate-content URLs that UTM-tagged inbound links create.

Testing your file

Google Search Console has a robots.txt report under Settings that shows the last-fetched copy, any parse errors, and the timestamp. For live testing against specific URLs, use the URL Inspection tool — it reports whether a given URL is blocked and which rule blocked it.

Before deploying, run the full file through a syntax checker that understands Google’s precedence rules. A misplaced Allow/Disallow can look harmless but silently block a whole section.

Common mistakes

Blocking CSS and JS. Google needs to render pages to evaluate them. Disallowing /wp-content/ or /static/ can hide the styles and scripts Googlebot uses for layout, which hurts rankings. Leave asset directories crawlable.

Using robots.txt to hide sensitive URLs. The file is public. Listing /admin-secret-backup/ in a Disallow line is like putting a giant arrow on it. Use auth, not robots.txt, for security.

Expecting Disallow to remove pages from the index.Disallow stops crawling, not indexing. Already-indexed URLs can stay in search results for months as snippetless links. To remove, use noindex (and keep crawlable) until Google processes it, then block.

Case-sensitive path mismatch. Paths are case-sensitive. Disallow: /Admin/ does not block /admin/. Match your actual URL casing.

Forgetting subdomain scope. Uploading robots.txt to the root does nothing for cdn.example.com. Each subdomain that serves HTTP needs its own file.

Trailing-slash surprises. Disallow: /foo blocks /foo, /foo/, /foobar, and /foo.html. If you meant only the folder, write Disallow: /foo/.

Run the numbers

Draft and validate a production-ready file with the robots.txt generator. Pair with the sitemap URL generator so the Sitemap: line you add actually points at a well-formed file, and the URL parser to verify the exact path shape your rules will match against.

Advertisement

Found this useful?Email