robots.txt to disallow all pages except one? Do they override and cascade?
Categories:
Mastering robots.txt: Disallowing All Pages Except One

Learn how to configure your robots.txt file to disallow crawling for an entire site while explicitly allowing a single page, and understand the cascading and overriding rules.
The robots.txt file is a powerful tool for webmasters to communicate with web crawlers and instruct them on which parts of their site should or should not be accessed. While often used to disallow specific directories or files, a common requirement is to block access to almost the entire site, allowing only a single, specific page to be crawled and indexed. This article will guide you through the process, explain the underlying rules of robots.txt, and clarify how directives override and cascade.
Understanding robots.txt Directives
The robots.txt file uses a simple syntax consisting of User-agent and Disallow/Allow directives. Each User-agent block specifies rules for a particular crawler (e.g., Googlebot, Bingbot, or * for all crawlers). Disallow directives tell crawlers not to access specified paths, while Allow directives explicitly permit access to paths that might otherwise be disallowed by a broader Disallow rule.
User-agent: *
Disallow: /private/
Allow: /private/public-document.html
Example of a basic robots.txt file with Disallow and Allow rules.
Disallowing All Pages Except One
To achieve the goal of disallowing all pages except one, you need to use a broad Disallow directive for the entire site, followed by a more specific Allow directive for the single page you wish to expose. The key here is the order and specificity of the rules. Most crawlers process robots.txt directives from the most specific to the least specific, or they apply the most specific rule that matches a given URL. When Allow and Disallow directives conflict, the more specific rule (usually the one with the longer path match) takes precedence.
flowchart TD
A[Start: Crawler encounters robots.txt]
B{User-agent: * matched?}
B -- Yes --> C[Apply Disallow: /]
C --> D{Is URL '/allowed-page.html'?}
D -- Yes --> E[Apply Allow: /allowed-page.html]
D -- No --> F[URL is Disallowed]
E --> G[URL is Allowed]
F --> H[End: Blocked]
G --> I[End: Crawled]Decision flow for a crawler processing robots.txt with a global disallow and specific allow rule.
User-agent: *
Disallow: /
Allow: /path/to/your/allowed-page.html
The robots.txt configuration to disallow all pages except one.
In this configuration:
User-agent: *applies the following rules to all web crawlers.Disallow: /tells crawlers not to access any content starting from the root directory, effectively disallowing the entire site.Allow: /path/to/your/allowed-page.htmlexplicitly permits crawlers to access this specific page. Because thisAllowdirective is more specific (it has a longer matching path) than the generalDisallow: /, it overrides the broader disallow for that particular URL.
Allow directives after broader Disallow directives within the same User-agent block. While the specificity rule is key, explicit ordering can help with clarity and ensure consistent interpretation across different crawlers.Understanding Overriding and Cascading Rules
The concept of overriding and cascading in robots.txt is crucial. Directives are processed sequentially within a User-agent block, but the most specific rule generally wins. If a URL matches both an Allow and a Disallow directive, the one with the longer path match takes precedence. If both have the same length, the Allow directive usually wins (though this can vary slightly between crawlers, Googlebot explicitly states Allow wins for same-length rules).
Consider the following example to illustrate the cascading effect:
User-agent: *
Disallow: /folder/
Allow: /folder/subfolder/file.html
Disallow: /folder/subfolder/
An example demonstrating conflicting rules.
In this scenario:
/folder/index.htmlwould be disallowed byDisallow: /folder/./folder/subfolder/another.htmlwould be disallowed byDisallow: /folder/subfolder/./folder/subfolder/file.htmlwould be allowed becauseAllow: /folder/subfolder/file.htmlis the most specific rule and overrides the broaderDisallow: /folder/andDisallow: /folder/subfolder/.
robots.txt is a request, not an enforcement mechanism. Malicious bots may ignore these directives. For true security or to prevent access to sensitive information, use server-side authentication or noindex meta tags.Verifying Your robots.txt
After implementing your robots.txt file, it's essential to verify its correctness. Google Search Console provides a robots.txt Tester tool that allows you to test your directives against specific URLs on your site, showing you exactly how Googlebot interprets your file. Other search engines may offer similar tools.