This repository includes approximately 784 terabytes (8.7 million files) of public domain data from the Smithsonian Institution's Open Access collection. Sourced from more than 20 libraries, museums, and research centers across the Smithsonian, this archive is updated weekly.
This is a mirror of public domain data archived from the Smithsonian Institution's Open Access S3 bucket. The source data can also be searched via the Smithsonian's Collections Search Center by limiting results to CC0 media. We look forward to enhancing the usability and discoverability of this data in the coming months.
This repository is maintained by the Library Innovation Lab at Harvard Law School Library as part of our Public Data Project.
At present, this repository mirrors the directory structure used by the Smithsonian:
Each root-level directory contains a different type of data: 3d contains 3D models, media contains images, and metadata contains metadata for all objects. For more information on working with a given type of data, please read the corresponding section below.
3D models are organized solely by identifier. Unlike images and metadata, they are not grouped by Smithsonian unit code. Each subdirectory under 3d matches an object identifier, and may contain a number of objects including 3D geometry files (GLB, GLTF, OBJ) and other material:
Also included in each 3d subdirectory is scene.svx.json, an SVX file comprising Smithsonian Voyager scene information as well as general object metadata. This metadata typically includes the object's name, description, and accession number, as well as associated links and identifiers.
Images are organized by Smithsonian unit code. Each subdirectory under media is named for a Smithsonian unit and contains JPEG and TIFF images for that unit:
There is typically, though not always, a high-resolution TIFF for every JPEG and vice versa. Each image file is referenced in an associated metadata record.
Metadata is organized by Smithsonian unit code and grouped in large text files containing line-delimited JSON records:
Also included in each subdirectory is index.txt, an index file listing all the metadata files for that directory.
More than 17 million metadata records, constituting over 47 GB, are included. As a consequence, the metadata files are quite large, and querying them is memory- and time-intensive. If analysis is your goal, we recommend downloading a relevant subset of files and then querying them using a database or an efficient data storage format such as Parquet.
To download an individual data object by name, copy its source URL from the user interface:
To download large numbers of files, we recommend using tools such as the AWS CLI or Rclone to access the S3 endpoint directly:
The files in this repository were first collected beginning in August 2025. The repository is updated weekly to mirror additions to the Smithsonian Institution's Open Access S3 bucket.
Here is a list of Smithsonian Institution unit codes used to organize parts of this collection: