Wget doesn't download recursively after following a redirect
Categories:
Wget and HTTP Redirects: Mastering Recursive Downloads
Understand why Wget might fail to download recursively after an HTTP redirect and learn effective strategies to overcome this common challenge.
Wget is a powerful command-line utility for retrieving content from web servers. It's often used for mirroring websites or downloading entire directories recursively. However, a common pitfall arises when the initial URL or subsequent links lead to an HTTP redirect. By default, Wget's recursive download logic might not behave as expected after following a redirect, leading to incomplete downloads or unexpected results. This article delves into the reasons behind this behavior and provides practical solutions to ensure your recursive Wget operations succeed even when redirects are involved.
Understanding Wget's Default Redirect Behavior
When Wget encounters an HTTP redirect (e.g., a 301 Moved Permanently or 302 Found status code), it typically follows it to the new location. However, its recursive download mechanism (-r
or --recursive
) is primarily designed to operate within the original host or domain unless explicitly told otherwise. If a redirect points to a different domain or a path that Wget doesn't consider part of the original hierarchy, the recursive download might stop or behave unpredictably. This is a safety mechanism to prevent Wget from inadvertently downloading the entire internet.
flowchart TD A[Start Wget -r URL] --> B{Initial URL Redirect?} B -- No --> C[Begin Recursive Download from URL] B -- Yes --> D[Follow Redirect to New URL] D --> E{New URL on Same Host/Domain as Original?} E -- Yes --> C E -- No --> F[Stop Recursive Download (Default Behavior)] F --> G[Incomplete Download]
Default Wget Recursive Download Flow with Redirects
Common Scenarios and Their Solutions
The issue often arises in a few key scenarios. Understanding these helps in applying the correct Wget options.
Scenario 1: Redirect to a Different Host/Domain
If the initial URL redirects to an entirely different domain, Wget, by default, will download the content from the redirected URL but will not recursively follow links on that new domain. It will treat the new domain as 'external' to the original request.
wget -r http://old-domain.com/path/
This command might download http://new-domain.com/path/
but won't recurse on new-domain.com
if old-domain.com
redirects there.
To overcome this, you need to instruct Wget to span hosts. The --span-hosts
option allows Wget to visit external hosts, and --domains
can be used to explicitly list allowed domains for recursion.
wget -r --span-hosts --domains=old-domain.com,new-domain.com http://old-domain.com/path/
Allowing Wget to span hosts and specifying allowed domains for recursion.
Scenario 2: Redirect to a Different Path on the Same Host
Sometimes, a redirect might occur within the same domain but point to a different path. While Wget usually handles this better, if the new path is outside the 'directory' implied by the original URL, recursion might still be limited. For instance, http://example.com/dir1/
redirecting to http://example.com/dir2/
.
wget -r http://example.com/old-path/
If old-path
redirects to new-path
, Wget might not recurse fully on new-path
without further options.
The --no-parent
option, which is enabled by default with -r
, prevents Wget from ascending to the parent directory. If your redirect leads to a path that Wget considers a 'parent' of the original recursive base, this could be an issue. In such cases, you might need to reconsider your starting URL or use --relative=on
(or -L
) to ensure Wget follows relative links correctly, even if they appear to go 'up' a directory from the initial request's perspective.
--dry-run
(-nv --dry-run
) first to see what it would download without actually fetching files. This is invaluable for debugging recursive issues.Advanced Wget Options for Robust Recursion
Beyond the basic --span-hosts
and --domains
, several other options can help fine-tune Wget's behavior with redirects and recursive downloads.
--convert-links
(-k
): After downloading, convert links in the document to make them suitable for local viewing. This is crucial for mirrored sites.--page-requisites
(-p
): Download all files that are necessary to properly display a given HTML page (e.g., images, CSS, JavaScript).--restrict-file-names=windows
: Useful when mirroring sites on a Windows filesystem, as it avoids characters illegal in Windows filenames.--level=N
(-l N
): Specify the maximum recursion depth. This can prevent infinite loops or excessively large downloads.--mirror
(-m
): A shorthand for-r -N -l inf --no-remove-listing
. It's designed for mirroring and implies infinite recursion, timestamping, and other useful options.
--span-hosts
without --domains
. This can lead to Wget following links across any domain it encounters, potentially downloading vast amounts of unintended data. Always combine it with --domains
for precise control.Practical Example: Mirroring a Site with Redirects
Let's say you want to mirror a website that uses a 301 redirect from its non-www version to its www version, and you want to ensure all content is downloaded recursively.
wget \
--recursive \
--level=inf \
--convert-links \
--page-requisites \
--no-parent \
--span-hosts \
--domains=example.com,www.example.com \
--wait=1 \
--random-wait \
--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" \
http://example.com/
A robust Wget command for mirroring a site that might involve redirects between example.com
and www.example.com
.
In this example:
--recursive
and--level=inf
ensure deep recursion.--convert-links
and--page-requisites
make the mirrored site browsable locally.--no-parent
prevents Wget from going above the starting directory.--span-hosts
allows Wget to follow redirects and links towww.example.com
fromexample.com
.--domains=example.com,www.example.com
explicitly limits recursion to these two domains.--wait
and--random-wait
are good practices to avoid overwhelming the server.--user-agent
can sometimes help bypass basic bot detection.