robots.txt to disallow all pages except one? Do they override and cascade?

Learn robots.txt to disallow all pages except one? do they override and cascade? with practical examples, diagrams, and best practices. Covers robots.txt development techniques with visual explanat...

Mastering robots.txt: Disallowing All Pages Except One

An illustration of a robot blocking access to multiple web pages, with one page highlighted as accessible, representing robots.txt rules.

Learn how to configure your robots.txt file to disallow crawling for an entire site while explicitly allowing a single page, and understand the cascading and overriding rules.

The robots.txt file is a powerful tool for webmasters to communicate with web crawlers and instruct them on which parts of their site should or should not be accessed. While often used to disallow specific directories or files, a common requirement is to block access to almost the entire site, allowing only a single, specific page to be crawled and indexed. This article will guide you through the process, explain the underlying rules of robots.txt, and clarify how directives override and cascade.

Understanding robots.txt Directives

The robots.txt file uses a simple syntax consisting of User-agent and Disallow/Allow directives. Each User-agent block specifies rules for a particular crawler (e.g., Googlebot, Bingbot, or * for all crawlers). Disallow directives tell crawlers not to access specified paths, while Allow directives explicitly permit access to paths that might otherwise be disallowed by a broader Disallow rule.

User-agent: *
Disallow: /private/
Allow: /private/public-document.html

Example of a basic robots.txt file with Disallow and Allow rules.

Disallowing All Pages Except One

To achieve the goal of disallowing all pages except one, you need to use a broad Disallow directive for the entire site, followed by a more specific Allow directive for the single page you wish to expose. The key here is the order and specificity of the rules. Most crawlers process robots.txt directives from the most specific to the least specific, or they apply the most specific rule that matches a given URL. When Allow and Disallow directives conflict, the more specific rule (usually the one with the longer path match) takes precedence.

flowchart TD
    A[Start: Crawler encounters robots.txt]
    B{User-agent: * matched?}
    B -- Yes --> C[Apply Disallow: /]
    C --> D{Is URL '/allowed-page.html'?}
    D -- Yes --> E[Apply Allow: /allowed-page.html]
    D -- No --> F[URL is Disallowed]
    E --> G[URL is Allowed]
    F --> H[End: Blocked]
    G --> I[End: Crawled]

Decision flow for a crawler processing robots.txt with a global disallow and specific allow rule.

User-agent: *
Disallow: /
Allow: /path/to/your/allowed-page.html

The robots.txt configuration to disallow all pages except one.

In this configuration:

User-agent: * applies the following rules to all web crawlers.
Disallow: / tells crawlers not to access any content starting from the root directory, effectively disallowing the entire site.
Allow: /path/to/your/allowed-page.html explicitly permits crawlers to access this specific page. Because this Allow directive is more specific (it has a longer matching path) than the general Disallow: /, it overrides the broader disallow for that particular URL.

💡

Always place your more specific Allow directives after broader Disallow directives within the same User-agent block. While the specificity rule is key, explicit ordering can help with clarity and ensure consistent interpretation across different crawlers.

Understanding Overriding and Cascading Rules

The concept of overriding and cascading in robots.txt is crucial. Directives are processed sequentially within a User-agent block, but the most specific rule generally wins. If a URL matches both an Allow and a Disallow directive, the one with the longer path match takes precedence. If both have the same length, the Allow directive usually wins (though this can vary slightly between crawlers, Googlebot explicitly states Allow wins for same-length rules).

Consider the following example to illustrate the cascading effect:

User-agent: *
Disallow: /folder/
Allow: /folder/subfolder/file.html
Disallow: /folder/subfolder/

An example demonstrating conflicting rules.

In this scenario:

/folder/index.html would be disallowed by Disallow: /folder/.
/folder/subfolder/another.html would be disallowed by Disallow: /folder/subfolder/.
/folder/subfolder/file.html would be allowed because Allow: /folder/subfolder/file.html is the most specific rule and overrides the broader Disallow: /folder/ and Disallow: /folder/subfolder/.

⚠️

Remember that robots.txt is a request, not an enforcement mechanism. Malicious bots may ignore these directives. For true security or to prevent access to sensitive information, use server-side authentication or noindex meta tags.

Verifying Your robots.txt

After implementing your robots.txt file, it's essential to verify its correctness. Google Search Console provides a robots.txt Tester tool that allows you to test your directives against specific URLs on your site, showing you exactly how Googlebot interprets your file. Other search engines may offer similar tools.