Abstract
Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object–context relationships. While supervised fine-tuning (SFT) on multimodal large language models (MLLMs) achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 can generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects, which provides great interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at https://github.com/Geo-R1/geo-r1.
| Original language | English |
|---|---|
| Pages (from-to) | 113-129 |
| Number of pages | 17 |
| Journal | ISPRS Journal of Photogrammetry and Remote Sensing |
| Volume | 237 |
| Early online date | 22 Apr 2026 |
| DOIs | |
| Publication status | E-pub ahead of print - 22 Apr 2026 |
Bibliographical note
Publisher Copyright:© 2026 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
Fingerprint
Dive into the research topics of 'Geo-R1: Improving few-shot geospatial referring expression understanding with reinforcement fine-tuning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver