Continental–global scale flood hazard models simulate design floods: theoretical flood events of a given probability. Since they output phenomena unobservable in reality, large-scale models are typically compared to more localised engineering models to evidence their accuracy. However, both types of model may share the same biases and so not validly illustrate predictive skill. Here, we adapt an existing continental-scale design flood framework of the contiguous US to simulate historical flood events. 35 discrete events are modelled and compared to observations of flood extent, water level, and inundated buildings. Model performance was highly variable depending on the flood event chosen and validation data used. While all events were accurately replicated in terms of flood extent, some modelled water levels deviated substantially from those measured in the field. In spite of this, the model generally replicated the observed flood events in the context of terrain data vertical accuracy, extreme discharge measurement uncertainties, and observational field data errors. This analysis highlights the continually improving fidelity of large-scale flood hazard models, yet also evidences the need for considerable advances in the accuracy of routinely collected field and high river flow data in order to interrogate flood inundation models more comprehensively.