The dataset is distributed as highly compressed .zst files icontaining JSON objects separated by newlines (NDJSON).
Example Python scripts and tools for parsing and handling this data can be found in the following GitHub repository: Watchful1/PushshiftDumps
License
No license specified. Please note that the content consists of user-generated data scraped from Reddit. The underlying textual work may be protected by copyright. Researchers should use this data responsibly and consider Reddit's API and data usage guidelines.
Citation
If you use this dataset in your research or academic work, you can reference it using the following BibTeX entry:
1@misc{reddit_archive_2005_2025,
2 title = {Reddit comments/submissions 2005-06 to 2025-12},
3 author = {stuck_in_the_matrix and Watchful1 and RaiderBDev},
4 abstract = {Reddit comments and submissions from 2005-06 to 2025-12 collected by pushshift and u/RaiderBDev. These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here [https://github.com/Watchful1/PushshiftDumps](https://github.com/Watchful1/PushshiftDumps) The more recent dumps are collected by u/RaiderBDev},
2 title = {Reddit comments/submissions 2005-06 to 2025-12},
3 author = {stuck_in_the_matrix and Watchful1 and RaiderBDev},
4 abstract = {Reddit comments and submissions from 2005-06 to 2025-12 collected by pushshift and u/RaiderBDev. These are zstandard compressed ndjson files. Example python scripts for parsing the data can be found here [https://github.com/Watchful1/PushshiftDumps](https://github.com/Watchful1/PushshiftDumps) The more recent dumps are collected by u/RaiderBDev},