robots.txt to disallow all pages except one? Do they override and cascade?
Categories:
Mastering robots.txt: Disallowing All Pages Except One

Learn how to configure your robots.txt file to disallow crawling for an entire site while explicitly allowing a single page, and understand the cascading and overriding rules.
The robots.txt
file is a powerful tool for webmasters to communicate with web crawlers and instruct them on which parts of their site should or should not be accessed. While often used to disallow specific directories or files, a common requirement is to block access to almost the entire site, allowing only a single, specific page to be crawled and indexed. This article will guide you through the process, explain the underlying rules of robots.txt
, and clarify how directives override and cascade.
Understanding robots.txt Directives
The robots.txt
file uses a simple syntax consisting of User-agent
and Disallow
/Allow
directives. Each User-agent
block specifies rules for a particular crawler (e.g., Googlebot
, Bingbot
, or *
for all crawlers). Disallow
directives tell crawlers not to access specified paths, while Allow
directives explicitly permit access to paths that might otherwise be disallowed by a broader Disallow
rule.
User-agent: *
Disallow: /private/
Allow: /private/public-document.html
Example of a basic robots.txt file with Disallow and Allow rules.
Disallowing All Pages Except One
To achieve the goal of disallowing all pages except one, you need to use a broad Disallow
directive for the entire site, followed by a more specific Allow
directive for the single page you wish to expose. The key here is the order and specificity of the rules. Most crawlers process robots.txt
directives from the most specific to the least specific, or they apply the most specific rule that matches a given URL. When Allow
and Disallow
directives conflict, the more specific rule (usually the one with the longer path match) takes precedence.
flowchart TD A[Start: Crawler encounters robots.txt] B{User-agent: * matched?} B -- Yes --> C[Apply Disallow: /] C --> D{Is URL '/allowed-page.html'?} D -- Yes --> E[Apply Allow: /allowed-page.html] D -- No --> F[URL is Disallowed] E --> G[URL is Allowed] F --> H[End: Blocked] G --> I[End: Crawled]
Decision flow for a crawler processing robots.txt with a global disallow and specific allow rule.
User-agent: *
Disallow: /
Allow: /path/to/your/allowed-page.html
The robots.txt configuration to disallow all pages except one.
In this configuration:
User-agent: *
applies the following rules to all web crawlers.Disallow: /
tells crawlers not to access any content starting from the root directory, effectively disallowing the entire site.Allow: /path/to/your/allowed-page.html
explicitly permits crawlers to access this specific page. Because thisAllow
directive is more specific (it has a longer matching path) than the generalDisallow: /
, it overrides the broader disallow for that particular URL.
Allow
directives after broader Disallow
directives within the same User-agent
block. While the specificity rule is key, explicit ordering can help with clarity and ensure consistent interpretation across different crawlers.Understanding Overriding and Cascading Rules
The concept of overriding and cascading in robots.txt
is crucial. Directives are processed sequentially within a User-agent
block, but the most specific rule generally wins. If a URL matches both an Allow
and a Disallow
directive, the one with the longer path match takes precedence. If both have the same length, the Allow
directive usually wins (though this can vary slightly between crawlers, Googlebot explicitly states Allow
wins for same-length rules).
Consider the following example to illustrate the cascading effect:
User-agent: *
Disallow: /folder/
Allow: /folder/subfolder/file.html
Disallow: /folder/subfolder/
An example demonstrating conflicting rules.
In this scenario:
/folder/index.html
would be disallowed byDisallow: /folder/
./folder/subfolder/another.html
would be disallowed byDisallow: /folder/subfolder/
./folder/subfolder/file.html
would be allowed becauseAllow: /folder/subfolder/file.html
is the most specific rule and overrides the broaderDisallow: /folder/
andDisallow: /folder/subfolder/
.
robots.txt
is a request, not an enforcement mechanism. Malicious bots may ignore these directives. For true security or to prevent access to sensitive information, use server-side authentication or noindex
meta tags.Verifying Your robots.txt
After implementing your robots.txt
file, it's essential to verify its correctness. Google Search Console provides a robots.txt Tester
tool that allows you to test your directives against specific URLs on your site, showing you exactly how Googlebot interprets your file. Other search engines may offer similar tools.