Wget doesn't download recursively after following a redirect

Learn wget doesn't download recursively after following a redirect with practical examples, diagrams, and best practices. Covers linux, http-redirect, wget development techniques with visual explan...

Wget and HTTP Redirects: Mastering Recursive Downloads

Illustration of a Wget command line tool following a winding path of HTTP redirects to a server, symbolizing recursive download issues.

Understand why Wget might fail to download recursively after an HTTP redirect and learn effective strategies to overcome this common challenge.

Wget is a powerful command-line utility for retrieving content from web servers. It's often used for mirroring websites or downloading entire directories recursively. However, a common pitfall arises when the initial URL or subsequent links lead to an HTTP redirect. By default, Wget's recursive download logic might not behave as expected after following a redirect, leading to incomplete downloads or unexpected results. This article delves into the reasons behind this behavior and provides practical solutions to ensure your recursive Wget operations succeed even when redirects are involved.

Understanding Wget's Default Redirect Behavior

When Wget encounters an HTTP redirect (e.g., a 301 Moved Permanently or 302 Found status code), it typically follows it to the new location. However, its recursive download mechanism (-r or --recursive) is primarily designed to operate within the original host or domain unless explicitly told otherwise. If a redirect points to a different domain or a path that Wget doesn't consider part of the original hierarchy, the recursive download might stop or behave unpredictably. This is a safety mechanism to prevent Wget from inadvertently downloading the entire internet.

flowchart TD
    A[Start Wget -r URL] --> B{Initial URL Redirect?}
    B -- No --> C[Begin Recursive Download from URL]
    B -- Yes --> D[Follow Redirect to New URL]
    D --> E{New URL on Same Host/Domain as Original?}
    E -- Yes --> C
    E -- No --> F[Stop Recursive Download (Default Behavior)]
    F --> G[Incomplete Download]

Default Wget Recursive Download Flow with Redirects

Common Scenarios and Their Solutions

The issue often arises in a few key scenarios. Understanding these helps in applying the correct Wget options.

Scenario 1: Redirect to a Different Host/Domain

If the initial URL redirects to an entirely different domain, Wget, by default, will download the content from the redirected URL but will not recursively follow links on that new domain. It will treat the new domain as 'external' to the original request.

wget -r http://old-domain.com/path/

This command might download http://new-domain.com/path/ but won't recurse on new-domain.com if old-domain.com redirects there.

To overcome this, you need to instruct Wget to span hosts. The --span-hosts option allows Wget to visit external hosts, and --domains can be used to explicitly list allowed domains for recursion.

wget -r --span-hosts --domains=old-domain.com,new-domain.com http://old-domain.com/path/

Allowing Wget to span hosts and specifying allowed domains for recursion.

Scenario 2: Redirect to a Different Path on the Same Host

Sometimes, a redirect might occur within the same domain but point to a different path. While Wget usually handles this better, if the new path is outside the 'directory' implied by the original URL, recursion might still be limited. For instance, http://example.com/dir1/ redirecting to http://example.com/dir2/.

wget -r http://example.com/old-path/

If old-path redirects to new-path, Wget might not recurse fully on new-path without further options.

The --no-parent option, which is enabled by default with -r, prevents Wget from ascending to the parent directory. If your redirect leads to a path that Wget considers a 'parent' of the original recursive base, this could be an issue. In such cases, you might need to reconsider your starting URL or use --relative=on (or -L) to ensure Wget follows relative links correctly, even if they appear to go 'up' a directory from the initial request's perspective.

Advanced Wget Options for Robust Recursion

Beyond the basic --span-hosts and --domains, several other options can help fine-tune Wget's behavior with redirects and recursive downloads.

  • --convert-links (-k): After downloading, convert links in the document to make them suitable for local viewing. This is crucial for mirrored sites.
  • --page-requisites (-p): Download all files that are necessary to properly display a given HTML page (e.g., images, CSS, JavaScript).
  • --restrict-file-names=windows: Useful when mirroring sites on a Windows filesystem, as it avoids characters illegal in Windows filenames.
  • --level=N (-l N): Specify the maximum recursion depth. This can prevent infinite loops or excessively large downloads.
  • --mirror (-m): A shorthand for -r -N -l inf --no-remove-listing. It's designed for mirroring and implies infinite recursion, timestamping, and other useful options.

Practical Example: Mirroring a Site with Redirects

Let's say you want to mirror a website that uses a 301 redirect from its non-www version to its www version, and you want to ensure all content is downloaded recursively.

wget \
  --recursive \
  --level=inf \
  --convert-links \
  --page-requisites \
  --no-parent \
  --span-hosts \
  --domains=example.com,www.example.com \
  --wait=1 \
  --random-wait \
  --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" \
  http://example.com/

A robust Wget command for mirroring a site that might involve redirects between example.com and www.example.com.

In this example:

  • --recursive and --level=inf ensure deep recursion.
  • --convert-links and --page-requisites make the mirrored site browsable locally.
  • --no-parent prevents Wget from going above the starting directory.
  • --span-hosts allows Wget to follow redirects and links to www.example.com from example.com.
  • --domains=example.com,www.example.com explicitly limits recursion to these two domains.
  • --wait and --random-wait are good practices to avoid overwhelming the server.
  • --user-agent can sometimes help bypass basic bot detection.