Configuration & Switches
DiscourseDownloader features a wide range of configuration options, allowing you to customize your archive in as many ways as possible.
For the most part, the majority of the default options will be perfectly sufficient for most. However, if you'd like to further customize your archive, read below for the complete list of options available.
website.cfg
This is the primary configuration file used to control most aspects of the application. The available options are grouped into "sections", making the file easier to read.
Website Config Basic Download Settings (website_config)
This section contains most of the options that a typical user would need to adjust.
Name | Type | Default Value | Description |
---|---|---|---|
website_url | string | https://forums.halowaypoint.com | The base URL of the forum to download. |
site_directory_root | string | ./forums.halowaypoint.com/ | The path to store all downloaded JSON and built HTML content. Can be relative or absolute. |
skip_download | bool | false | Whether or not to skip the download step. Set this to true if you've already performed a complete download and simply want to build (or rebuild) the HTML website from that original data. |
download_users | bool | true | Whether or not to download user profiles and related data. |
download_topics | bool | true | Whether or not to download forum topics and related data. |
download_misc | bool | true | Whether or not to download miscellaneous site information. |
perform_html_build | bool | false | Whether or not to perform the HTML website build step. If you wish to download content from a website, but do not want to build the HTML website, set this to false. |
Networking Settings (networking)
This section contains options relating to networking features of the application. These are used to control and alter how the application interacts with the forum's server, handles errors, and so on.
Name | Type | Default Value | Description |
---|---|---|---|
max_http_retries | int | 60 | The maximum number of times a single request will be retried before giving up and returning an
error. Typically, if this limit is reached and an error code is returned, the requested content is skipped.
Note that this option's value can be overridden in some cases by other options (such as fail_on_403 , fail_on_404 , and max_404s ). |
http_retry_use_backoff | bool | true | Whether or not to use a backoff factor when experiencing failed requests. If enabled, the application
will wait an increasing amount of time before retrying a request. The formula for the request time is as follows:
delay = http_backoff_increment * (retry_count + 1)
|
http_backoff_increment | int | 5 | The amount of time that should be added to the retry delay after each failed retry. Only applies when http_retry_use_backoff is set to true . |
override_user_agent | bool | false | Whether or not to use a custom user agent. While generally unnecessary, it could help in the case where a forum may block certain
unrecognized user agents, like the one that DiscourseDownloader uses by default.
The default user agent for DiscourseDownloader is DiscourseDL v{VER} , replacing {VER} with the application version - such as 1.0.0.
|
user_agent | string | The custom user agent string to use. Only used if override_user_agent is set to true |
|
request_retry_delay | int | 5 | The standard delay for retrying a request. If http_retry_use_backoff is enabled, then this will only be used for the first failed retry. Otherwise, this delay will be used for each failed retry. |
fail_on_403 | string | true | If enabled, an HTTP 403 response will automatically be treated as a failure, and will not be retried. This defaults to
true , because the most likely reason that a 403 would be encountered is due to contacting an API which the user does not have access to.
If you are downloading your own forum and experience a 403, you may look into enabling certain API features for guest users. Alternatively, you may also try using the cookie settings (detailed further down). |
fail_on_404 | string | false | If enabled, an HTTP 404 response will automatically be treated as a failure, and will not be retried. |
max_404s | string | 5 | The maximum number of HTTP 404 responses that can be encountered before treating it as a failure. Has no effect if fail_on_404 is enabled. |
Download Settings (download)
This section contains options for fine-tuning how content is downloaded, as well as options for controlling how partial downloads are handled.
Name | Type | Default Value | Description |
---|---|---|---|
resume_download | bool | true | Whether or not to attempt resuming a partial download. |
enable_url_caching | bool | true | Whether or not to cache URL lists to disk after they are collected.
When downloading certain content (such as topics), the application will build a large initial list of item URLs before downloading any individual items. In order to save time during a resume, this URL list can be stored on disk. Keep in mind, however, that if this URL cache is used, the application will NOT attempt to fetch any newer URLS - and so there is the possibility of missing content. |
enable_data_caching | bool | true | Whether or not to cache certain data to disk during the download process.
When downloading a very large forum, the application's memory usage can become extreme by default. Enabling this option does slightly increase the overall runtime, but allows the application to free up memory after certain steps are finished (ie, a forum category is fully downloaded). This data is then loaded from disk again later during the sanity checks as needed (detailed below). |
delete_caches_on_finish | bool | false | Whether or not to delete URL or data caches after a download has completed. Not yet implemented. |
redownload_if_missing_cache | bool | false | Not used. |
sanity_check_on_finish | bool | true | Whether or not to perform sanity checks after certain download steps are completed. This option by default only
enables the basic sanity check, which simply verifies topic and post counts match between the downloaded topic list (in memory)
compared to the topic/post counts that are reported by the Discourse API.
A more in-depth sanity check can be performed by enabling thorough_sanity_check . See below for more information. |
thorough_sanity_check | bool | true | Whether or not to perform the in-depth sanity check.
This more advanced check will actually go through all data in memory and ensure that a .json file for each topic and post exist, within each category. In the event that topics or posts are missing, the application will try to re-download that content. |
download_skip_existing_categories | bool | false | Whether or not to skip existing category folders when downloading.
If enabled, the application will check for a category folder prior to any download steps. If the folder exists, the category will be skipped. This is disabled by default, as it could potentially result in categories being incomplete, and thus, result in an incomplete download. |
download_skip_existing_topics | bool | false | Whether or not to skip existing topic folders when downloading.
If enabled, the application will check for a topic folder prior to downloading the topic (and its post data). If the folder exists, the topic will be skipped. This is disabled by default, as it could potentially result in topics being incomplete and missing posts, and thus, result in an incomplete download. |
download_skip_existing_posts | bool | true | Whether or not to skip existing topic posts when downloading.
If enabled, existing post files will be skipped when downloading a topic. While this does pose the potential risk for skipping edited posts, this can help reduce download times significantly. If you are not concerned about reducing download times, or simply want to know for sure that any edits made to any posts are downloaded, set this to false . Note that this will effectively
disable download resuming, as all existing content will be downloaded again.
|
Forum Topic Download Settings (forums)
This section contains options specific to downloading forum categories, topics, and posts.
Name | Type | Default Value | Description |
---|---|---|---|
max_get_more_topics | int | -1 | Unused. |
max_posts_per_request | int | 20 | Unused. |
topic_url_collection_notify_interval | int | 15 | Controls how often update messages are printed to console and the log file when downloading topic URLs. After this many topic URL requests have been performed, a notification will be posted. |
download_subcategory_topics | bool | false | Whether or not to exclude subcategory topic URLs when building a topic URL list for a category. This should generally be left disabled, as subcategories are downloaded separately into their own folders. With this enabled, a potentially large amount of content will be duplicated, both in the downloaded JSON data and the resulting HTML archive website. |
use_category_id_filter | bool | false | Whether or not to use the configured category ID filter when downloading categories.
If enabled, only the category IDs listed in the category_id_filter option will be downloaded.
Note that this behavior is reversed when use_filter_as_blacklist is enabled. |
category_id_filter | string | 5,10 | A list of category IDs to download, separated by commas. Any other categories are skipped. Only used if use_category_id_filter is enabled. |
use_filter_as_blacklist | bool | false | If enabled, reverses the behavior of the category ID filter. The filter will instead act as a blacklist - meaning that any categories listed will be excluded from the download, and all other categories will be downloaded. |
strict_topic_count_checks | bool | false | Whether or not topic counts should match exactly when performing topic count checks.
If disabled, a download topic count that is larger than the reported topic count from the API will not be treated as a mismatch. This option is disabled by default, as it appears that the Discourse API will sometimes not report pinned topics within the total topic count for a category. |
download_all_tag_extras | bool | false | Whether or not to download the complete list of topics that have a particular tag. This is usually unnecessary, as each topic is already downloaded separately. |
max_skipped_topic_urls | int | 100 | The maximum amount of topic URLs to skip before stopping topic URL list building. Topic URLs are only counted as skipped when the category ID does not match (ie, when downloading a category with subcategories). After this many skipped URLs, it is assumed that all remaining topic URLs belong to subcategories, rather than the parent category. |
User Profile Download Settings (users)
This section contains options for controlling how user profiles are downloaded.
Name | Type | Default Value | Description |
---|---|---|---|
download_all_user_actions | bool | true | Whether or not to download all user actions. If enabled on a large forum, this could substantially increase the time required to download all profiles. |
download_all_avatar_sizes | bool | true | Whether or not to download all avatar sizes. If disabled, only the highest resolution avatar available (360x360) will be downloaded. |
download_private_messages | bool | false | Whether or not to attempt downloading a user's private messages. Not yet implemented. |
Local Directory Settings (paths)
This section contains options for determining where downloaded content is stored on disk.
Name | Type | Default Value | Description |
---|---|---|---|
html_dir | string | export/ | The directory used to store generated HTML archive content. This is relative to site_directory_root . |
json_dir | string | json/ | The directory used to store downloaded JSON data from the API. This is relative to site_directory_root . |
Cookie Settings (cookies)
This section contains options for specifying cookies. These can be used to provide authentication with the API under a specific user account. This may be desired in cases where you want to download content that is only available when logged in, or only accessible to certain groups.
Name | Type | Default Value | Description |
---|---|---|---|
cookie_name | string | _t | The name of the cookie to provide to the server. |
cookie | string | The value of the cookie to provide to the server. |
Miscellaneous Settings (misc)
This section contains miscellaneous options that don't fit into the other categories.
Name | Type | Default Value | Description |
---|---|---|---|
disable_long_finish_message | bool | false | Whether or not to disable the long message shown after the application has finished all tasks. The extended message
provides information on uploading and distributing the resulting archive on websites such as archive.org,
as well as other useful information for those new to the application and/or to website archival.
If set to true , a shorter message is printed to the console instead. |
log_level_debug | bool | false | Whether or not to print log level debug messages in console during startup. This will show a sample message in each log level. Used for debug/development purposes. |
Command-Line Switches
In addition to the configuration file, the application also allows for certain options to be controlled via command-line switches. All available switches are detailed below.
Name | Flags | Description |
---|---|---|
-config_debug | Instructs the application to show additional debug messages when reading configuration files. |