earthshift

Python ★ 33 updated 21d ago

Code repository for EarthShift: Benchmarking the robustness of geospatial foundation models (GFMs) to realistic distribution shifts in Earth Observation

A benchmarking testbed that measures how well AI models trained on satellite imagery hold up when deployed under new conditions like different regions, sensors, seasons, or resolutions they never saw during training.

Pythonsetup: moderatecomplexity 3/5

EarthShift is a benchmarking testbed for measuring how well large AI models hold up when the conditions at deployment differ from the conditions they trained on. The specific focus is geospatial foundation models, meaning large AI models trained on satellite and remote sensing imagery to recognize land use, detect objects, or segment geographic features.

Most existing benchmarks for these models measure performance on data that looks similar to what the model saw during training. EarthShift tests something different: does the model still work well when it encounters a new geographic region it has never seen, a different satellite sensor, a different time period or season, a different data provider, or a different spatial resolution? These changes are called distribution shifts, and they happen constantly when models are used in the real world.

The researchers ran experiments across 8 geospatial foundation models and 11 different tasks covering all five shift types. Their finding was consistent: models perform about 20% worse out-of-distribution, and this holds regardless of model size, architecture, or how the model was fine-tuned. Notably, the robustness of specialized geospatial models was similar to that of general-purpose vision models, meaning the field-specific training did not make them meaningfully more reliable under changing conditions.

The testbed provides paired datasets for each shift type. Researchers can run the pipeline from the command line, specifying a model, a task (classification, semantic segmentation, or object detection), a shift type, and a dataset pair. Results are saved to a specified output directory.

The code and datasets are released to give the community a standard way to measure and improve robustness in Earth observation AI. The repository accompanies a paper published on arXiv.

Where it fits

Benchmark a geospatial foundation model against 5 types of distribution shift using EarthShift's standardized paired datasets and command-line pipeline.
Reproduce the paper's finding that models drop about 20% in performance out-of-distribution, as a baseline for your own robustness research.
Run classification, semantic segmentation, or object detection tasks on EarthShift datasets to evaluate how robust a model is before deploying it in a new geographic region.

Open on GitHub → Full breakdown on explaingit →