More ruminating on small tools, EDC and shrinking options ►

◄ Project Zebra: Gimme that sweet sensation of a rock hard rationalisation

2023-04-08 📌 Converting a phpBB forum to a static archive

Tags 🏷 All 🏷 Tech 🏷 Personal

Want an offline read-only copy of a phpBB install or to keep content online but not have the overheads of a dynamic site being constantly targeted by bots and malware? This was done for a 3.2.x installation but it should work for older or newer ones. I used most of the recipe posted by Matthijs Kooijman at https://stackoverflow.com/questions/3979927/how-to-convert-phpbb-board-to-static-archive-page with some additional exclusions since the member list wasn't visible to non-registered users. I was doing this under Linux but I assume syntax is similar or identical with most versions of wget and sed if you're Windows or Mac.

wget http://example.com/forum/ --save-cookies cookies --keep-session-cookies
wget http://example.com/forum/ --load-cookies cookies --page-requisites --convert-links --mirror --no-parent --reject-regex '([&?]highlight=|[&?]order=|posting.php[?]|privmsg.php[?]|search.php[?]|[&?]mark=|[&?]view=|mode=viewprofile|mode=quote|mode=reply|viewtopic.php[?]p=)' --rejected-log=rejected.log -o wget.log --server-response --adjust-extension --restrict-file-names=windows

What this does is rewrite filenames that would be awkward to serve as static pages from a LAMP stack server, in particular turning ? into @ because it's a valid character in some file systems but obviously a query string marker in a URL. Unlike something like HTTrack by default wget doesn't fetch external content such as images from other sites, but depending on exact needs it can potentially help if you specify additional options such as --span-hosts, --recursive, --level and --domains to indicate which other sites to include things from.

You'll probably want to change links in the downloaded output that have been excluded from the crawling to a placeholder page:

sed -i 's,http://example.com/forum/memberlist.php,notarchived.html,g' *.html
sed -i 's,http://example.com/forum/search.php,notarchived.html,g' *.html
sed -i 's,http://example.com/forum/posting.php,notarchived.html,g' *.html
sed -i 's,memberlist.php@mode=contactadmin.html,notarchived.html,g' *.html
sed -i 's,search.php.html,notarchived.html,g' *.html
sed -i 's,ucp.php@,notarchived.html?,g' *.html

Then as well as manually editing the main forum pages to make it clear the forum is no longer something that can be logged into or interacted with in the same way you'll probably also want to amend at least one .css file referenced by all pages to hide page elements that no longer have an active function, such as setting #search-box and #forum-search and #nav-main to display: none; although you could potentially reinstate some sort of search function once Google has had chance to crawl the static copy by using javascript to insert a form forwarding to https://www.google.com/search?q=site:example.com+search+terms or similar.

Another obvious point is that this crawls pages as a guest so it won't include non-public forums. For that the simplest approach would be to give it the cookie data for a suitable registered user account, using a browser extension to get the parameters and values then manually adding these to the 'cookies' file before crawling.

💬 Comments are off, but you can use the mail form to contact or see the about page for social media links.