Features Overview
DiscourseDownloader has two primary components - the JSON downloader and the HTML builder.
JSON Downloader
The JSON downloader is the first step in the archive process. Using Discourse's API, it goes through and collects any and all requested data and compiles it into a series of JSON files. Many of these files are identical to those served by the Discourse API.
Storing the data in this format first has a couple key benefits over simply creating HTML files first:- More Complete Archives - Storing virtually all of the information available in the Discourse API in its original format helps to ensure that the resulting archives are as complete as possible. Even in cases where a viewable website might not be available, the JSON data by itself contains all of the real content of the forum.
- Easy HTML Rebuilding - Storing the JSON data directly also allows for easy rebuilding of an HTML archive website later, allowing for any improvements or changes to be made to how the web archive looks and feels - all without having to redownload any content from the original forum. This is particularly useful in cases where a download is performed prior to a forum's closure, such as the case with HaloWaypoint - where redownloading the content is simply not possible.
The general process the downloader goes through is as follows:
- Build a list of all categories
- For each category, build a complete list of topic URLs
- Download each topic, taking care to download any additional posts as by default, the API will only provide the first 20 posts with the topic information
- Depending on configuration, perform either a basic or in-depth sanity check:
- Basic - The basic sanity check simply ensures that all topic/post counts line up with what was originally pulled from the API. This is a quicker check, but doesn't check anything on-disk.
- In-Depth - An in-depth sanity check goes through all data and ensures that each topic and post are actually present on disk. If any content is determined to be missing or incomplete during this check, the application will attempt to re-download this content again in order to avoid the necessity of running the entire download process again.
- After the sanity checks have passed, the downloader continue to download user profiles if enabled in config.
- After all user profiles are downloaded, miscellaneous site information (site.json, tags, groups, etc) are downloaded if enabled in config.
- After all operations are finished, the downloader will finish and the HTML builder will run, if the builder is enabled in the configuration.
Many features of the downloader can be controlled via the website.cfg file, which you can read more about here.
HTML Builder
The HTML builder is the second step in the archive process. It goes through all content and creates a fully functional website from the downloaded JSON data, designed to be as easy to use as possible.