This dataset contains temporal Harmonized Landsat and Sentinel-2 (HLS) imagery of diverse land covers across the Contiguous United States (CONUS) for the year 2022 along with binary cloud masks for the same area and year.
Data catalogs containing metadata for both HLS scenes and cloud masks are provided. The catalog for HLS scenes is provided in chip_catalog.csv and the catalog for cloud masks is provided in cloud_catalog.csv.
cloud_catalog.csv contains the following columns:
chip_catalog.csv contains the following columns:
Invalid pixels are those which intersect with any QA mask.
The ground truth HLS scenes are stored in GeoTIFF format under train/hls_chips/ and test/hls_chips/. Each GeoTIFF file covers a 224 x 224 pixel area at 30m spatial resolution. Each file contains 18 bands consisting of 6 spectral bands in 3 steps stacked together. The file name structure is chip_XXX_YYY.tif where XXX and YYY refer to row and column of a tile grid imposed on the Continental US. Since the dataset is sampled from this country-wide grid not all XXXs and YYYs are present in the dataset.
Testing scenes are pre-masked to ensure that all models are evaluated using the same test set. These scenes are stored in GeoTIFF format under test/hls_chips. The file name structure for masked scenes is chip_XXX_YYY_masked.tif. Each masked scene corresponds to the ground truth scene with the same value of XXX_YYY in the file name. For example, the file test/hls_chips/chip_373_294.tif is the ground truth for test/hls_masked/chip_373_294_masked.tif, with the latter having values of 0 at cloud-masked locations. Cloud masks are present in all possible combinations of time steps in equal proportion. Possible combinations given time steps t1, t2, and t3 are:
So, for example, 1/7th of test scenes are masked at ONLY t2, and 1/7th at t2 AND t3, etc. for each of the possible combinations.
Cloud masks for the test scenes range from 0.01% coverage to 100% coverage, and are equally sampled from 10 equally sized bins between 0-100%.
The training cloud masks are stored in GeoTIFF format under train/cloud_masks/. The file name structure for cloud masks is chip_XXX_YYY_T_cmask.tif where XXX and YYY refer to row and column of a tile grid imposed on the Continental US. T refers to the time step of each cloud mask and is meant only to distinguish cloud masks derived from the same location from each other. The intent for training is that these cloud masks are randomly paired with training HLS scenes in all time steps. The distribution of cloud mask coverage for the training set does not correspond to the distribution of cloud mask coverage for the validation set, as the distribution of the latter has been equalized. This may lead to higher validation accuracy if the user chooses not to equalize the training dataset - it is left to the user's discretion.
In each HLS GeoTIFF the following bands are repeated for each of three observations throughout the year:
Masks are a stored as a single-band binary image where 1 denotes the presence of the cloud mask and 0 denotes the absence of the cloud mask.
Code used to generate HLS scenes and cloud masks is available here. Code used to generate masked test scenes is available here. usage='validate' was used along with default parameters when initializing the dataset using the gapfill.py code. Refer to Seeing Through the Clouds: Cloud Gap Imputation with Prithvi Foundation Model for further information about the creation and initial use of this dataset.
Three HLS scenes were selected between Mar and Sep 2022 with time difference between scenes varying between 1 and 200 days. After filtering for missing values and cloudy pixels, a total of 7,852 cloud-free chips evenly distributed across the CONUS were generated. This set was randomly partitioned into training (80%) and validation (20%) sets, resulting in 6,231 training chips and 1,621 validation chips.
Cloud masks were generated from the same region of CONUS using HLS cloud mask quality flag and exported as a binary layer of cloudy and non-cloudy pixels. This yielded 21,642 cloud masks, of which 1,600 were randomly selected and reserved for validation, resulting in 20,042 training cloud masks
This dataset is published under a CC-BY-4.0 license. If you find this dataset useful for your application, you can cite it as following:
For any questions about the dataset, you can contact Dr. Hamed Alemohammad.
This dataset is generated with funding from a grant awarded to Clark University Center for Geospatial Analytics (CGA) by NASA.
| bad_pct_max | maximum of invalid pixels in all time steps | 
| na_count | count of pixels in all time steps with no data | 
| usage | 'train' or 'validate' |