We introduce three metrics for rigorous evaluation of land-surface models (LSMs). This framework explicitly acknowledges perennial sources of uncertainty in LSM output. The model performance score (zeta) quantifies the likelihood that a representative model ensemble will bracket most observations and be highly skilled with low spread. The robustness score (rho) quantifies the sensitivity of performance to parameter and/or data error. The fitness score (phi) combines performance and robustness, ranking models' suitability for broad application. We demonstrate the use of the metrics by comparing three versions of the Noah LSM. Using time-varying zeta for hypothesis testing and model development, we show that representing short-term phenological change improves Noah's simulation of surface energy partitioning and subsurface water dynamics at a semi-humid site. The least complex version of Noah is most fit for broad application. The framework and metrics presented here can significantly improve the confidence that can be placed in LSM predictions.