Major TOM ELLIOT-Pretrain
ELLIOT-Pretrain is a Major TOM expansion focused on fast pre-training of multi-modal AI foundation models on Earth Observation data.
Built on the original MajorTOM global 10 km grid
10.56 km tiles with harmonised 10 m resolution across all sensors for fast access
Four co-registered modalities in every tile, optical, thermal, elevation, and land cover
Three sub-datasets covering complementary temporal regimes
Tile locations were selected using hierarchical spherical k-means clustering over AlphaEarth Foundation embeddings to maximise global environmental diversity. The dataset follows the TACO v3 specification .
Sub-datasets
All three parts share the same four sensors.
Quick Start
Pick a tile index from Part A (monotemporal) and visualize all four modalities.
1 import numpy as np
2 import rasterio
3 import matplotlib.pyplot as plt
4
5 BASE = "https://data.source.coop/major-tom/elliot-pretrain"
6 IDX = 42 # change this to explore different tiles
7
8 MODS = [ "s2" , "l8" , "dem" , "lc" ]
9 fig, axes = plt.subplots( 1 , 4 , figsize = ( 20 , 5 ))
10
11 for j, mod in enumerate ( MODS ):
12 url = f "/vsicurl/ {BASE} /monotemporal/DATA/ {IDX} / { mod } /data.tif"
13 with rasterio.open(url) as src:
14 if mod in ( "s2" , "l8" ):
15 rgb = src.read([ 4 , 3 , 2 ]).astype(np.float32)
16 axes[j].imshow(np.clip(rgb / 3000 , 0 , 1 ).transpose( 1 , 2 , 0 ))
17 elif mod == "dem" :
18 axes[j].imshow(src.read( 1 ), cmap = "terrain" )
19 else :
20 axes[j].imshow(src.read( 1 ), cmap = "tab20" )
21 axes[j].set_title(mod)
22 axes[j].axis( "off" )
23
24 plt.tight_layout()
25 plt.show()
1 import numpy as np
2 import rasterio
3 import matplotlib.pyplot as plt
4
5 BASE = "https://data.source.coop/major-tom/elliot-pretrain"
6 IDX = 42 # change this to explore different tiles
7
8 MODS = [ "s2" , "l8" , "dem" , "lc" ]
9 fig, axes = plt.subplots( 1 , 4 , figsize = ( 20 , 5 ))
10
11 for j, mod in enumerate ( MODS ):
12 url = f "/vsicurl/ {BASE} /monotemporal/DATA/ {IDX} / { mod } /data.tif"
13 with rasterio.open(url) as src:
14 if mod in ( "s2" , "l8" ):
15 rgb = src.read([ 4 , 3 , 2 ]).astype(np.float32)
16 axes[j].imshow(np.clip(rgb / 3000 , 0 , 1 ).transpose( 1 , 2 , 0 ))
17 elif mod == "dem" :
18 axes[j].imshow(src.read( 1 ), cmap = "terrain" )
19 else :
20 axes[j].imshow(src.read( 1 ), cmap = "tab20" )
21 axes[j].set_title(mod)
22 axes[j].axis( "off" )
23
24 plt.tight_layout()
25 plt.show()
Reproducible Example
Full notebook covering all three parts, metadata queries with filtering, and a streaming PyTorch DataLoader with parallel fetching.
Open in Colab
Dataset Structure
1 elliot-pretrain/
2 ├── monotemporal/ # Part A, 250k tiles, 1 timestep
3 │ ├── COLLECTION.json
4 │ ├── METADATA/
5 │ │ ├── collection.parquet
6 │ │ ├── sample__s2.parquet
7 │ │ ├── sample__l8.parquet
8 │ │ ├── sample__dem.parquet
9 │ │ └── sample__lc.parquet
10 │ └── DATA/{tile_id}/
11 │ ├── s2/data.tif
12 │ ├── l8/data.tif
13 │ ├── dem/data.tif
14 │ └── lc/data.tif
15 ├── monthly/ # Part B, 12.5k tiles, 12 timesteps
16 │ ├── COLLECTION.json
17 │ ├── METADATA/ # same 5 parquet files
18 │ └── DATA/{tile_id}/
19 │ ├── s2/img_00.tif ... img_11.tif
20 │ ├── l8/img_00.tif ... img_11.tif
21 │ ├── dem/main.tif
22 │ └── lc/main.tif
23 └── burst/ # Part C, 16.7k tiles, 6 timesteps (~5-day)
24 ├── COLLECTION.json
25 ├── METADATA/ # same 5 parquet files
26 └── DATA/{tile_id}/
27 ├── s2/img_00.tif ... img_05.tif
28 ├── l8/img_00.tif ... img_05.tif
29 ├── dem/main.tif
30 └── lc/main.tif
1 elliot-pretrain/
2 ├── monotemporal/ # Part A, 250k tiles, 1 timestep
3 │ ├── COLLECTION.json
4 │ ├── METADATA/
5 │ │ ├── collection.parquet
6 │ │ ├── sample__s2.parquet
7 │ │ ├── sample__l8.parquet
8 │ │ ├── sample__dem.parquet
9 │ │ └── sample__lc.parquet
10 │ └── DATA/{tile_id}/
11 │ ├── s2/data.tif
12 │ ├── l8/data.tif
13 │ ├── dem/data.tif
14 │ └── lc/data.tif
15 ├── monthly/ # Part B, 12.5k tiles, 12 timesteps
16 │ ├── COLLECTION.json
17 │ ├── METADATA/ # same 5 parquet files
18 │ └── DATA/{tile_id}/
19 │ ├── s2/img_00.tif ... img_11.tif
20 │ ├── l8/img_00.tif ... img_11.tif
21 │ ├── dem/main.tif
22 │ └── lc/main.tif
23 └── burst/ # Part C, 16.7k tiles, 6 timesteps (~5-day)
24 ├── COLLECTION.json
25 ├── METADATA/ # same 5 parquet files
26 └── DATA/{tile_id}/
27 ├── s2/img_00.tif ... img_05.tif
28 ├── l8/img_00.tif ... img_05.tif
29 ├── dem/main.tif
30 └── lc/main.tif
Join key across metadata files is internal:parent_id, which points back to internal:current_id in collection.parquet.
Citation
1 @inproceedings { francis2024majortom ,
2 title ={Major TOM: Expandable Datasets for Earth Observation},
3 author ={Francis, Alistair and Czerkawski, Mikolaj},
4 booktitle ={IGARSS 2024},
5 pages ={2935--2940},
6 year ={2024},
7 doi ={10.1109/IGARSS53475.2024.10640760}
8 }
1 @inproceedings { francis2024majortom ,
2 title ={Major TOM: Expandable Datasets for Earth Observation},
3 author ={Francis, Alistair and Czerkawski, Mikolaj},
4 booktitle ={IGARSS 2024},
5 pages ={2935--2940},
6 year ={2024},
7 doi ={10.1109/IGARSS53475.2024.10640760}
8 }
License
CC-BY-SA-4.0
Acknowledgements
ELLIOT-Pretrain has been made possible thanks to Asterisk Labs , the ELLIOT project (European Commission, Horizon Europe, Grant 101214398), and the Image and Signal Processing Group (ISP) at Universitat de Valencia.