This dataset merges Google's V3 Open Buildings and Microsoft's latest Building Footprints. It contains 2,579,035,323 footprints and is divided into 185 partitions. Each footprint is labelled with its respective source, either Google or Microsoft. It can be accessed in cloud-native geospatial formats such as GeoParquet, FlatGeobuf and PMTiles.
You can Observable to get a quick overview of the dataset or go to VIDA to see it in action.
The original Google V3 open buildings is downloadable from this link as gzipped CSV files. Here are some key details about the original dataset:
The dataset contains 1.8 billion building detections, across an inference area of 58M km2 within Africa, South Asia, South-East Asia, Latin America and the Caribbean.
Each building in the dataset has a polygon defining its footprint on the ground, a confidence score indicating how certain we are that this is a building, and a Plus Code corresponding to the centre of the building. There is no information about the type of building, its street address, or any details other than its geometry.
For more comprehensive information, please visit the description page. You can also check out the FAQ section for additional information.
The latest version of Microsoft's building footprints can be downloaded from Microsoft Planetary Computer as gzipped partitioned files.
The Microsoft Global Open Buildings dataset was generated through Bing Maps, which detected a total of 1.24 billion buildings. These buildings were identified using imagery from Bing Maps, encompassing data collected between 2014 and 2023, including images from Maxar, Airbus, and IGN France.
For more detailed information please visit the github page
The data is available in the following formats:
This extensive dataset is organized into 185 root partitions. Each partition typically corresponds to a country's administrative boundary, as defined by the Comprehensive Global Administrative Zones (CGAZ) at the ADM0 level, which can be accessed here. There is also a sub-partition available, based on the S2 grid.
Both FlatGeobuf and GeoParquet are categorized by country boundaries, in accordance with the ADM0 level of the CGAZ geoboundary definition. This means that building footprints are separated by countries within each format. For naming conventions, we utilize the country's ISO CODE.
/geoparquet/by_country/country_iso={ISO}/{ISO}.parquet
Note: There is a partition labeled
country_iso=None
, which represents a MULTIPOLYGON containing geoboundaries (POLYGONS) that have not been explicitly defined or named by CGAZ. These geoboundaries are still captured by CGAZ at the ADM0 level, but they lack specific names and therefore labellednull
. As a result, building footprints located within these geoboundaries are included in this partition labeledcountry_iso=None
. For instance, the area between Sudan and South Sudan includes a piece of land known as "Abyei" which remains unclaimed due to recurring conflicts, and therefore, it lacks an assigned name.
To enhance performance, particularly with GeoParquet files, we've introduced an S2 sub-partitioning strategy. Each ISO partition is further divided using an S2 grid ID, ensuring a cap of 20 million building footprints per grid ID. This S2 grid partitioning is exclusive to GeoParquet files.
/geoparquet/by_country_s2/country_iso={ISO}/{S2_GRID_ID}.parquet
Each row in the dataset provides information on a specific building footprint with associated information on individual columns:
xmin, ymin, xmax, ymax
values for the bounding box of the geometry.We invite you to read our blog post for more detailed information on our dataset merging approach, which includes insights into the optimization techniques we investigated and the query performance on BigQuery. In this section, we provide a high-level summary of the merging process, highlighting its crucial aspects.
We imported both datasets into BigQuery for further processing. From the Google dataset, we excluded columns like full_plus_code
, latitude
, and longitude
. For the Microsoft dataset we did not drop any columns.
We then matched each building footprint with a boundary ID, determined by the intersection of its centroid with the country geoboundaries in the CGAZ ADM0 dataset. Footprints whose centroids didn't overlap with any country geoboundary were mapped to the nearest geoboundary based on their centroid's position.
If you'd like more information about the dataset or the processing steps, feel free to write an email to maarten@vida.place.
Current version: 2.0
The data is shared under the Creative Commons Attribution (CC BY-4.0) license and the Open Data Commons Open Database License (ODbL) v1.0 license. As the user, you can pick which of the two licenses you prefer and use the data under the terms of that license.