Sponsored Links

Pruning old forum threads

Mark.J

Administrator
Staff member
ISPreview Team
I’m generally not a big fan of erasing history, and so will often go a long way to ensuring that old news and forum content is still available to access. But the discussion forum today is getting rather big, with threads going back to c.2002/2001 ish. A lot of that dates from the dialup and early ADSL era.

Between ancient threads and related attachment files/images, today’s forum is now several GigaBytes in size and needs a bit of heavy pruning to keep it lean and manageable. I am thus planning to prune several years of threads between 2001/2 and 2007/8 ish, albeit only from a handful of the busier forums like ‘General Discussion’.

After that I’ll also need to rebuild profiles and forums, which will take a while, even when doing it via CLI (command line). I plan to do this gradually in phases, whenever I get the chance, over the next month or so - this way there shouldn't be much disruption, if any. The catch is that the database needs to be kept in sync, so eventually some members may see their total post count reduce as ancient content is removed and forum indexes rebuilt.

Anyway, that’s the tentative plan and I thought members might like to know about it, just in case you wondered why some totals for posts/thread counts etc. had changed.
 
I'm always sad to see old content go and link rot makes me mad, I think it still has value.
Is there a will to at least keep it saved as static html or something?
 
I'm not sure how this sort of thing works, but is there any prospect of getting archive.org and/or some other archiving site to crawl the forum before this deletion happens? I can understand wanting to get rid of it but it would be nice if it could survive in some form.

Lucian's idea would be great but I suspect it isn't trivial to implement.
 
Sponsored Links
The site is already on the Wayback Machine, although I don't know how many historic pages they've indexed and don't have much control over that. It may not help that, when we changed the forum software a few years ago, the permanent URLs also changed for posts/threads. We use redirects, of course, but I've noticed that the Wayback Machine doesn't always update to reflect these.

On Lucian's idea. I'd love to do that and did explore such an idea (similar to the approach taken with our old news systems), but there doesn't seem to be an easy mechanism to achieve it, at least not without me spending time and money to develop one. But that's a lot of pain for no financial gain (the forum is way bigger and more complex than our old news systems), which for a smaller publication isn't viable.
 
Regarding the Wayback Machine... I probably should have asked before hand, but now it's done... sorry Mark :P

I went to the ArchiveTeam's IRC (#archivebot on hackint), mentioned this thread, and they'll archive the site. Eventually the content is added to the Internet Archive's Wayback Machine.

Their bot does it slowly (4-5 pages per second). It will probably take a few days, assuming their IP isn't banned. I don't know how long it takes for everything to appear on the wayback machine.

This won't archive the old redirects, but at least the current content will be saved. 23k URLs done by now, 455k in the queue (it increases as the bot finds new links, I think it also saves external links): http://archivebot.com/

bot.webp


When we have the list of URLs, we can put them on a Google Sheets and then use this: https://archive.org/services/wayback-gsheets/ . The IA then slowly archives the URLs.

---

To archive a website... If you want to save pages in the WARC format, use ArchiveTeam's "grab site" (https://github.com/ArchiveTeam/grab-site/) or wget (more below).

To save it as HTML... WGET should be enough for a site like this one, at least content that doesn't require login. It creates folders, html, downloads media, fixes URLs, etc. Then you can either keep it locally or put everything on a web server. Good for personal archives or when we want to retire an old platform and keep a static copy online.

Check the main post and replies to this: https://gist.github.com/mullnerz/9fff80593d6b442d5c1b This PDF also has some basic info: https://web.archive.org/web/20171208183935/https://chris.partridge.tech/data/wget-noobs.pdf

Just be careful not to get your IP banned or in trouble.

For individual pages, SingleFile (browser extension) is very good: https://github.com/gildas-lormeau/SingleFile

If you want to create your own private Wayback Machine... ArchiveBox is the way to go: https://archivebox.io/

And that's almost everything I know about archiving online content 😂
 
Last edited:
I'm always sad to see old content go and link rot makes me mad, I think it still has value.

Would you really miss being able to read through ancient* speed tests?

* ancient? I don't know, more than three months?
 
Sponsored Links
Top
Cheap BIG ISPs for 100Mbps+
Community Fibre UK ISP Logo
150Mbps
Gift: None
Virgin Media UK ISP Logo
Virgin Media £22.99
132Mbps
Gift: None
Vodafone UK ISP Logo
Vodafone £24.00 - 26.00
150Mbps
Gift: None
NOW UK ISP Logo
NOW £24.00
100Mbps
Gift: None
Plusnet UK ISP Logo
Plusnet £25.99
145Mbps
Gift: £50 Reward Card
Large Availability | View All
Cheapest ISPs for 100Mbps+
Gigaclear UK ISP Logo
Gigaclear £17.00
200Mbps
Gift: None
Community Fibre UK ISP Logo
150Mbps
Gift: None
Virgin Media UK ISP Logo
Virgin Media £22.99
132Mbps
Gift: None
Hey! Broadband UK ISP Logo
150Mbps
Gift: None
Youfibre UK ISP Logo
Youfibre £23.99
150Mbps
Gift: None
Large Availability | View All
Sponsored Links
The Top 15 Category Tags
  1. FTTP (6024)
  2. BT (3639)
  3. Politics (2720)
  4. Business (2439)
  5. Openreach (2405)
  6. Building Digital UK (2330)
  7. Mobile Broadband (2144)
  8. FTTC (2083)
  9. Statistics (1899)
  10. 4G (1814)
  11. Virgin Media (1763)
  12. Ofcom Regulation (1582)
  13. Fibre Optic (1467)
  14. Wireless Internet (1462)
  15. 5G (1405)
Sponsored

Copyright © 1999 to Present - ISPreview.co.uk - All Rights Reserved - Terms  ,  Privacy and Cookie Policy  ,  Links  ,  Website Rules