Skyscraper pairs multi-temporal Earth-observation imagery with GDELT-derived news text: each record links geolocated PlanetScope or Sentinel-2 image sequences to event-centric captions and article metadata.
Download layout should match the directory tree below so paths in croissant.json resolve correctly.
Place croissant.json next to two parallel bundles (same schema, different sensors):
labels.csv — One row per record; join key is article_id (matches imagery and metadata folders).imagery/ — JPEG stacks per article_id.metadata/ — Per-event JSON; may include extended or model-derived fields (see croissant.json provenance notes).Both bundles use the same columns:
This release is described in croissant.json as CC BY-NC 4.0:
https://creativecommons.org/licenses/by-nc/4.0/
Upstream constraints still apply: Planet imagery, Copernicus Sentinel-2, GDELT, and news publishers each impose their own terms. Redistributing or displaying full article text must respect those sources.
Use the bibliographic entry from the accompanying paper when available. The dataset record also includes a citeAs string in croissant.json.
Limitations, biases, sensitivity of locations and news text, and recommended use cases are summarized in the RAI and provenance sections of croissant.json. The dataset is intended for research (e.g. multimodal EO–text modeling), not for surveillance, individual identification, or high-stakes operational decisions without validation.
With mlcroissant installed and data paths matching this README:
Record streaming may require the full data tree present locally or an archive layout consistent with the includes patterns in croissant.json.