Abstract
Since the public release of ChatGPT in late 2022, the role of Generative AI chatbots in education has been widely debated. While some see their potential as automated tutors, others worry that inaccuracies and hallucinations could harm student learning. This study assesses ChatGPT models in terms of important dimensions by evaluating their capabilities and limitations in serving as a non-interactive, automated tutor. For this, we use a comparative benchmark design in which these models complete the same tasks under predefined success criteria. We compare three ChatGPT models (GPT-3.5, GPT-4o, and o1preview) in tasks comprising the explanation of 56 economic concepts and answering 25 multiple-choice questions. We evaluate the responses using a marking grid. Our findings indicate that newer models generate very accurate responses, although some inaccuracies persist. A key concern is that ChatGPT presents all responses with complete confidence, making errors difficult for students to recognize. Furthermore, explanations are often quite narrow, lacking holistic perspectives, and the quality of examples remains poor. Despite these limitations, we argue that ChatGPT can serve as an effective automated tutor for basic, knowledge-based questions, supporting students while posing a manageable risk of misinformation. However, educators should teach students about the effective use and limitations of the technology.
| Original language | English |
|---|---|
| Article number | 100337 |
| Number of pages | 20 |
| Journal | International Review of Economics Education |
| Volume | 51 |
| Early online date | 13 Jan 2026 |
| DOIs | |
| Publication status | E-pub ahead of print - 13 Jan 2026 |