A global dataset of Clay v1.5 embeddings for Sentinel2. Licensed under CC BY 4.0.
This repository contains a global dataset of Clay v1.5 embeddings for Sentinel2, created by lgnd.ai. Each embedding is a 1024-dimensional float32 vector associated with a 256-meter MajorTOM grid cell. All imagery was sourced from the sentinel-cogs bucket on AWS.
The dataset currently has near complete (~99.7%) global coverage across two time steps - June 2024 and June 2025, and is broken into two distinct products:
scene - embeddings for individual Sentinel2 scenes.aggregated - an aggregation of scene-level embeddings into a normalized time series of embeddings.The scene-level product contains embeddings for each individual Sentinel-2 scene. For each MGRS tile, we embed the minimum number of scenes required to provide maximum spatial coverage while minimizing cloud cover. Some MGRS tiles only require one scene, while others require many more. The embeddings are available in the scene directory partitioned by Grid Zone Designator (the first 3 digits of the MGRS tile), year, and month. For example:
The scene-level product is best when analyzing individual Sentinel2 scenes, or when you need the highest possible temporal resolution. This product contains ~77.6 million embeddings and takes up ~285GB on disk.
The aggregated product provides a normalized time series of embeddings per chip by selecting one representative embedding per month for each MajorTOM cell. This helps remove redundancy from overlapping scene coverage at the border of MGRS tiles, normalizes observation frequency across regions with varying revisit rates, and simplifies time series analysis by providing consistent monthly snapshots. The embeddings are available in the aggregated directory partitioned by geohash, year, and month. For example:
The aggregated product is easier to use than the scene-level product across large spatial extents, and is best for time series analysis or when constent temporal sampling is required. This product contains ~49.7 million embeddings and takes up ~182GB on disk.
Both data products follow the same data model, described below:
Both data products are provided as hive-partitioned geoparquet files, following geoparquet best practices:
This product is licensed under CC BY 4.0.
There are several known data quality issues:
In the coming months, the LGND team plans to fix these data gaps and push embeddings for more time steps beyond the two originally provided.
yminxmaxymaxgeometry | polygon (wkb) | Polygon geometry of the chip footprint |