Cesar Aybar¹ • Shirin Ermis² • Lilli Freischem² • Stella Girtsou³⁴ • Kyriaki-Margarita Bintsi⁵
Emiliano Diaz Salas-Porras¹ • Michael Eisinger⁶ • William Jones² • Anna Jungbluth⁶ • Benoit Tremblay⁷
¹Universitat de València, ²University of Oxford, ³National Observatory of Athens, ⁴National Technical University of Athens,
⁵Harvard Medical School & Massachusetts General Hospital, ⁶European Space Agency, ⁷Environment and Climate Change Canada
This repository contains the AI-ready datasets presented in the paper "A Global, AI-Ready Dataset for 3D Cloud reconstruction". This foundational dataset provides a large-scale collection of paired 2D multispectral geostationary (GEO) satellite imagery and co-located 3D vertical cloud property profiles from the CloudSat radar. The data is curated to train deep learning models for 3D cloud structure reconstruction and is organized into three distinct components:
The model inputs are 2D patches from three geostationary satellites, which together provide near-global coverage:
To create a unified, sensor-agnostic input for the models, the 11 closest spectral channels (a mix of reflectances and brightness temperatures) are selected from each satellite, normalized, and used as the input tensor.
The ground truth data (the prediction target) consists of vertical profiles from the CloudSat CPR 2B products. These profiles are spatiotemporally aligned to the nearest GEO pixel. The primary target variables provided are:
The splits distinguish between general cloud scenes and targeted tropical cyclone (TC) events for both pre-training and fine-tuning/evaluation.
The complete dataset is hosted on Amazon Web Services (AWS) S3 via the source.coop public data initiative. You can download the data using the tacoreader API, or any other tool compatible with AWS S3.
We provide Google Colab notebooks to help you get started with loading, exploring, and visualizing the data. These notebooks demonstrate how to read the Zarr files and plot the input (GEO) and target (CloudSat) data.
This dataset was created using a multi-step pipeline to precisely align 2D geostationary (GEO) imagery with 3D CloudSat vertical profiles. The key steps included:
Co-location: We identified all instances where a CloudSat satellite overpass occurred within the field of view of a GEO satellite (GOES, MSG, Himawari) within a 5-minute time window.
Alignment: The narrow 1D vertical profiles from CloudSat were spatially mapped to their nearest corresponding 2D GEO satellite pixel.
Patch Extraction: 256x256 pixel patches were extracted from the GEO imagery, centered around the aligned CloudSat track.
Quality Filtering: Patches were filtered to ensure data quality, including a requirement that at least 25% of the vertical columns in a patch contained clouds.
TC Dataset Creation: A dedicated benchmark dataset was created by intersecting the aligned pairs with the International Best Track Archive for Climate Stewardship (IBTrACS) database, isolating overpasses of tropical cyclones.
This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the data for any purpose, even commercially, as long as you give appropriate credit.
This work was enabled by the Frontier Development Lab (FDL) Earth Systems Lab, a public-private partnership between the European Space Agency (ESA), Trillium Technologies, and the University of Oxford. We acknowledge the support of the CloudSat mission team at NASA JPL for providing access to the CloudSat data products. We also thank the teams behind the GOES, Himawari, and Meteosat satellites for their invaluable data. This research was supported by computational resources from Google Cloud, Scan Computers, Nvidia Corporation, and Pasteur Labs.