Archiving a Drupal site with wget

Written by NewTrick

04 Jun 2025

#drupal #bash

A drawer pulled out from a filing cabinet with paper index cards inside.

I came across a Slack post this week where someone was asking for a module or tool that could archive a Drupal site.

One to the answers mentioned that wget would archive a static site—but there was no further details. This may be obvious to people who know things, but for someone like me, it was new information. Time to learn a new trick.
 

Why Wget?

While there are modules that can help out with the task of archiving a site, I’ve come across instances where instead of the complete project, I’ve been asked to save a site for the purposes of satisfying archiving or legislative requirements.

In these cases, saving a static version of the pages, associated files and images, along with the supporting CSS and JavaScript would be enough.

And with Wget you can do this with one line.

 

Wget, Youget, Iget

At its most basic, you can open a terminal and run wget www.foo.com with the URL you are interested in and you will see that a new file (most likely index.html or similar) has been copied to your system. If you use cat index.html to check the contents of the file, you should see the html for the page you targeted. So the output is similar to a curl request, but you’ll have the files locally.

There are a huge number of options that can be added to the wget request, and they are some of the better documentation I have seen in a man entry.

Wget has some cool features like being non-interactive and designed to run on slow networks, which means it can be run in the background and will continue if the network is interrupted in some way.

 

Wget it.

I’m grateful for the two sources cited at the end of this post that helped to shape this wget command and provide some of the explanation.

Here is the command that I ended up using. With so many possible options, I’m sure there is more to add:

wget -mpck --html-extension \
 --user-agent="Mozilla/5.0 (compatible; wget)" \
 -e robots=off \
 --wait 2 \
--random-wait \
 --reject-regex="/(admin|user|node/add|comment/reply|search|autocomplete)" \
 --exclude-directories="/admin,/user,/batch" \
 --restrict-file-names=windows \
 --no-parent \
 --level=0 \
 -P ./FOLDER_NAME \
 www.foo.com

 

Here is the explanation for the options used:

  • -m (Mirror): Turns on mirror-friendly settings like infinite recursion depth, timestamps, etc.
  • -c (Continue): Resumes a partially-downloaded transfer
  • -p (Page requisites): Downloads any page dependencies like images, style sheets, etc.
  • -k (Convert): Ensures that the links in the site will work in the structure that is downloaded.
  • --html-extension: This adds .html after the downloaded filename.
  • –user-agent="Mozilla/5.0 (compatible; wget)”: Mimics a user agent to prevent blocking by robots.txt or similar.
  • -e robots=off: This tells Wget to ignore the robots.txt instructions, in cases where everything is blocked. You can switch it on to respect the wishes of the site owner.
  • –wait 2 &  —random-wait: Wait 2 seconds between each action, and then randomly select a time between 0.5 and 1.5 x this. Seems kinder to the server to do this with 2 seconds, but with bigger sites this might be too much.
  • —reject-regex & —exclude-directories: Ignore Drupal specific stuff like the admin pages.
  • —restrict-file-names=windows : This ensures that file names in the archive are sanitised for windows (even if you are on Linux). So document?.php would become document_.php.
  • —no-parent: Ensures that wget does not go up from the target URL. For example, I was getting results from Cloudflare addresses without it.
  • —level=0: Sets the recursion limit to unlimited.
  • -P . set the download directory to where you would like. Default is wherever you are.

As luck would have it, I needed to use this almost straight away. My use case was to find pages that used an <iframe>  for YouTube videos. So once the site was downloaded, I could just use grep to look through the pages.

Sources:

https://gist.github.com/mullnerz/9fff80593d6b442d5c1b

https://darcynorman.net/2011/12/24/archiving-a-wordpress-website-with-wget/

Main image credit:

Maksym Khaharlytskyi