Major-TOM Elliot-Pretrain

Multi-modal Earth Observation pretraining dataset in TACO format. Part A: 250k monotemporal tiles, Part B: ~12.5k locations × 12 monthly timesteps, Part C: ~16.6k locations × 6 five-daily timesteps. Sensors: Sentinel-2 L1C, Landsat 8/9, Copernicus DEM, ESA WorldCover. Locations selected via AEF hierarchical clustering over the MajorTOM 10km grid. Funded by the ELLIOT project.

Product Details

Visibility: Public
Owner: Major TOM
Created: 14 Apr 2026
Last Updated: 14 Apr 2026

Product Contents

root

README.md

Product Details

Product Contents

Major-TOM Elliot-Pretrain

Product Details

Visibility: Public
Owner: Major TOM
Created: 14 Apr 2026
Last Updated: 14 Apr 2026

Product Contents

root

README.md

Name: README.md
Size: 5.56 KB
Content Type: text/markdown
Last Modified: 24 Apr 2026
Source URL: https://data.source.coop/major-tom/elliot-pretrain/README.md
Cloud URI: s3://us-west-2.opendata.source.coop/major-tom/elliot-pretrain/README.md

Download

Name: README.md
Size: 5.56 KB
Content Type: text/markdown
Last Modified: 24 Apr 2026
Source URL: https://data.source.coop/major-tom/elliot-pretrain/README.md
Cloud URI: s3://us-west-2.opendata.source.coop/major-tom/elliot-pretrain/README.md

Download

Major TOM ELLIOT-Pretrain

ELLIOT-Pretrain is a Major TOM expansion focused on fast pre-training of multi-modal AI foundation models on Earth Observation data.

Built on the original MajorTOM global 10 km grid
10.56 km tiles with harmonised 10 m resolution across all sensors for fast access
Four co-registered modalities in every tile, optical, thermal, elevation, and land cover
Three sub-datasets covering complementary temporal regimes

Tile locations were selected using hierarchical spherical k-means clustering over AlphaEarth Foundation embeddings to maximise global environmental diversity. The dataset follows the TACO v3 specification.

Sub-datasets

Part	Name	Tiles	Timesteps	Cadence	Purpose
A	Monotemporal	250,000	1	n/a	Global coverage, spatial diversity
B	Monthly	12,500	12	monthly	Seasonal dynamics, phenology
C	Burst	16,666	6	~5-day	Rapid events, floods, fires, landslides

All three parts share the same four sensors.

Modality	Source	Resolution	Bands
Sentinel-2 L1C	ESA Copernicus	10 m	13 spectral
Landsat 8/9 OLI-TIRS	USGS	30 m	11 (9 OLI + 2 TIRS)
Copernicus DEM GLO-30	ESA / TanDEM-X	30 m	1 (elevation)
ESA WorldCover	ESA	10 m	1 (land cover class)

Quick Start

Pick a tile index from Part A (monotemporal) and visualize all four modalities.

Reproducible Example

Full notebook covering all three parts, metadata queries with filtering, and a streaming PyTorch DataLoader with parallel fetching.

Open in Colab

Dataset Structure

Join key across metadata files is internal:parent_id, which points back to internal:current_id in collection.parquet.

Citation

License

CC-BY-SA-4.0

Acknowledgements

ELLIOT-Pretrain has been made possible thanks to Asterisk Labs, the ELLIOT project (European Commission, Horizon Europe, Grant 101214398), and the Image and Signal Processing Group (ISP) at Universitat de Valencia.

Major TOM ELLIOT-Pretrain

ELLIOT-Pretrain is a Major TOM expansion focused on fast pre-training of multi-modal AI foundation models on Earth Observation data.

Built on the original MajorTOM global 10 km grid
10.56 km tiles with harmonised 10 m resolution across all sensors for fast access
Four co-registered modalities in every tile, optical, thermal, elevation, and land cover
Three sub-datasets covering complementary temporal regimes

Sub-datasets

Part	Name	Tiles	Timesteps	Cadence	Purpose
A	Monotemporal	250,000	1	n/a	Global coverage, spatial diversity
B	Monthly	12,500	12	monthly	Seasonal dynamics, phenology
C	Burst	16,666	6	~5-day	Rapid events, floods, fires, landslides

All three parts share the same four sensors.

Modality	Source	Resolution	Bands
Sentinel-2 L1C	ESA Copernicus	10 m	13 spectral
Landsat 8/9 OLI-TIRS	USGS	30 m	11 (9 OLI + 2 TIRS)
Copernicus DEM GLO-30	ESA / TanDEM-X	30 m	1 (elevation)
ESA WorldCover	ESA	10 m	1 (land cover class)

Quick Start

Pick a tile index from Part A (monotemporal) and visualize all four modalities.

Reproducible Example

Full notebook covering all three parts, metadata queries with filtering, and a streaming PyTorch DataLoader with parallel fetching.

Open in Colab

Dataset Structure

Join key across metadata files is internal:parent_id, which points back to internal:current_id in collection.parquet.

Citation

License

CC-BY-SA-4.0

Acknowledgements

1elliot-pretrain/
2├── monotemporal/          # Part A, 250k tiles, 1 timestep
3│   ├── COLLECTION.json
4│   ├── METADATA/
5│   │   ├── collection.parquet
6│   │   ├── sample__s2.parquet
7│   │   ├── sample__l8.parquet
8│   │   ├── sample__dem.parquet
9│   │   └── sample__lc.parquet
10│   └── DATA/{tile_id}/
11│       ├── s2/data.tif
12│       ├── l8/data.tif
13│       ├── dem/data.tif
14│       └── lc/data.tif
15├── monthly/               # Part B, 12.5k tiles, 12 timesteps
16│   ├── COLLECTION.json
17│   ├── METADATA/          # same 5 parquet files
18│   └── DATA/{tile_id}/
19│       ├── s2/img_00.tif ... img_11.tif
20│       ├── l8/img_00.tif ... img_11.tif
21│       ├── dem/main.tif
22│       └── lc/main.tif
23└── burst/                 # Part C, 16.7k tiles, 6 timesteps (~5-day)
24    ├── COLLECTION.json
25    ├── METADATA/          # same 5 parquet files
26    └── DATA/{tile_id}/
27        ├── s2/img_00.tif ... img_05.tif
28        ├── l8/img_00.tif ... img_05.tif
29        ├── dem/main.tif
30        └── lc/main.tif

1elliot-pretrain/
2├── monotemporal/          # Part A, 250k tiles, 1 timestep
3│   ├── COLLECTION.json
4│   ├── METADATA/
5│   │   ├── collection.parquet
6│   │   ├── sample__s2.parquet
7│   │   ├── sample__l8.parquet
8│   │   ├── sample__dem.parquet
9│   │   └── sample__lc.parquet
10│   └── DATA/{tile_id}/
11│       ├── s2/data.tif
12│       ├── l8/data.tif
13│       ├── dem/data.tif
14│       └── lc/data.tif
15├── monthly/               # Part B, 12.5k tiles, 12 timesteps
16│   ├── COLLECTION.json
17│   ├── METADATA/          # same 5 parquet files
18│   └── DATA/{tile_id}/
19│       ├── s2/img_00.tif ... img_11.tif
20│       ├── l8/img_00.tif ... img_11.tif
21│       ├── dem/main.tif
22│       └── lc/main.tif
23└── burst/                 # Part C, 16.7k tiles, 6 timesteps (~5-day)
24    ├── COLLECTION.json
25    ├── METADATA/          # same 5 parquet files
26    └── DATA/{tile_id}/
27        ├── s2/img_00.tif ... img_05.tif
28        ├── l8/img_00.tif ... img_05.tif
29        ├── dem/main.tif
30        └── lc/main.tif

1import numpy as np
2import rasterio
3import matplotlib.pyplot as plt
4
5BASE = "https://data.source.coop/major-tom/elliot-pretrain"
6IDX = 42  # change this to explore different tiles
7
8MODS = ["s2", "l8", "dem", "lc"]
9fig, axes = plt.subplots(

1import numpy as np
2import rasterio
3import matplotlib.pyplot as plt
4
5BASE = "https://data.source.coop/major-tom/elliot-pretrain"
6IDX = 42  # change this to explore different tiles
7
8MODS = ["s2", "l8", "dem", "lc"]
9fig, axes = plt.subplots(

1@inproceedings{francis2024majortom,
2  title={Major TOM: Expandable Datasets for Earth Observation},
3  author={Francis, Alistair and Czerkawski, Mikolaj},
4  booktitle={IGARSS 2024},
5  pages={2935--2940},
6  year={2024},
7  doi={10.1109/IGARSS53475.2024.10640760}
8}

1@inproceedings{francis2024majortom,
2  title={Major TOM: Expandable Datasets for Earth Observation},
3  author={Francis, Alistair and Czerkawski, Mikolaj},
4  booktitle={IGARSS 2024},
5  pages={2935--2940},
6  year={2024},
7  doi={10.1109/IGARSS53475.2024.10640760}
8}

11for j, mod in enumerate(MODS):

12 url = f"/vsicurl/{BASE}/monotemporal/DATA/{IDX}/{mod}/data.tif"

13 with rasterio.open(url) as src:

14 if mod in ("s2", "l8"):

15 rgb = src.read([4, 3, 2]).astype(np.float32)

16 axes[j].imshow(np.clip(rgb / 3000, 0, 1).transpose(1, 2, 0))

17 elif mod == "dem":

18 axes[j].imshow(src.read(1), cmap="terrain")

20 axes[j].imshow(src.read(1), cmap="tab20")

21 axes[j].set_title(mod)

22 axes[j].axis("off")

24plt.tight_layout()

11for j, mod in enumerate(MODS):

12 url = f"/vsicurl/{BASE}/monotemporal/DATA/{IDX}/{mod}/data.tif"

13 with rasterio.open(url) as src:

14 if mod in ("s2", "l8"):

15 rgb = src.read([4, 3, 2]).astype(np.float32)

16 axes[j].imshow(np.clip(rgb / 3000, 0, 1).transpose(1, 2, 0))

17 elif mod == "dem":

18 axes[j].imshow(src.read(1), cmap="terrain")

20 axes[j].imshow(src.read(1), cmap="tab20")

21 axes[j].set_title(mod)

22 axes[j].axis("off")

24plt.tight_layout()