“Geek to Live: Mastering Wget”

I often take my computer and do programming in places where I don’t have an Internet connection for a few hours at a time. It generally works out well: no e-mail or IM to distract me, no fascinating web links to follow and spend time reading. Unfortunately it also means that I don’t have access to online reference material.

There are four things that I commonly need reference material for, at present: the Boost library, the wxWidgets library, the SQLite library, and the C++ Standard Template Library. The first three all have downloadable HTML reference material, so I’ve long had them on call.

The last one… there’s a lot of reference material on it, but most of it is in book form, which isn’t very convenient. The dead-tree format is the worst: I’ve got an excellent reference book, but it weighs at least twice what my laptop computer does, and it’s not very efficient to look things up in either, compared to a hyperlinked format. I’ve also got an e-book version of it, which is a lot lighter, but which still suffers in the lookup-efficiency department.

When I’m in the office, I generally use two reference sites, which I won’t name here. I generally know the name of the function I want to use, and am just looking for the proper syntax, so the hyperlinked format that they provide is perfect for quick lookups. Their reference material was not, however, available in downloadable form… until I ran across this article on Lifehacker.

I’d used wget before on occasion, to download large single files, but I’d never seen it used for anything else. The article showed me that it was useful for a lot more than that — and the thing that got my attention, that it was useful for downloading entire swathes of a site too. 🙂

It took me a while to come up with the proper invocation, because the site I was trying it on has a lot of stuff I don’t want too, including a link that appears everywhere on the site with different parameters, and which wget was downloading each time. I would have just let it, but one of the places that link appears is in its own page, recursively and with different parameters each time, and it simply never ends. Anyway, the final result looks something like this, with some names changed:

wget -m -k -X forum,userprofiles --reject='garbagelink*' http://www.example.com

The -m says to “mirror” the site to my local hard drive; the -k says to change all links to refer to the local copy. The -X forum,userprofiles tells it to ignore those directories on the site, and the --reject='garbagelink*' says to ignore the problematic link, with the asterisk indicating that it should ignore any URL containing the link regardless of what comes after it. The last piece is, of course, the site to download from.

The result: everything I wanted was pulled to my local hard drive, available for reference regardless of whether I’m connected to the ‘net or not, exactly as I’d hoped. 😀

It’s a nice tool to have. I don’t really have any other sites I need to use it on, at present, but if I find one, I now know how to get it.