CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities
2026-05-25 • Artificial Intelligence
Artificial IntelligenceMachine Learning
AI summaryⓘ
The authors created CityRep, a new way to fairly test computer models that study cities. They noticed that previous tests were mostly done in a few places and sometimes gave too-good results because of how the data was split. CityRep checks models across many cities and tasks, using smarter ways to split data so the tests are more honest. Their tests showed that models can act very differently depending on the city and task. They shared everything openly to help others build better city-analysis tools.
urban representation learningembeddingsspatial leakagedata splitbenchmarkcross-location generalizationclassificationregressiondistribution predictionfoundation models
Authors
Junyuan Liu, Xinglei Wang, Zichao Zeng, Jiazhuang Feng, Quan Qin, Ilya Ilyankou, Guangsheng Dong, Tao Cheng
Abstract
Urban representation learning encodes complex urban environments into general-purpose embeddings for diverse downstream tasks and emerging urban foundation models. However, current evaluations are limited, typically focusing on one or two cities and tasks and relying on random splits that introduce spatial leakage, leading to inflated performance and weak support for cross-location generalization and fair comparison. To address this, we propose CityRep, a unified benchmark that evaluates urban representations across data modalities, cities, and tasks using spatially structured splits. CityRep consists of three key components: (1) a spatial unit-agnostic evaluation framework that supports heterogeneous urban representations through a standardized alignment module; (2) a unified evaluation protocol using block-based spatial splits to mitigate spatial leakage and enable rigorous model comparison; and (3) an extensible multi-city, multi-task benchmark suite spanning 8 cities and 8 tasks across regression, classification, and distribution prediction. We evaluate 11 representative urban representation models. Results show that performance is highly sensitive to the split protocol, with random splits inflating scores and altering model rankings. We also observe substantial variability across cities and tasks, underscoring the need for generalization-aware evaluation. CityRep is released as a reproducible benchmark with datasets, evaluation pipelines, and diagnostic tools to facilitate fair comparison and support future research in urban representation learning towards urban foundation models.