Using a Large Language Model (ChatGPT‐4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers

Christopher James Rose*, Julia Bidonde, Martin Ringsten, Julie Glanville, Thomas Potrebny, Chris Cooper, Ashley Elizabeth Muller, Hans Bugge Bergsund, Jose F. Meneses‐Echavez, Rigmor C. Berg

*Corresponding author for this work

Research output: Contribution to journalArticle (Academic Journal)peer-review

Abstract

Background: Risk of bias (RoB) assessment is a highly skilled task that is time‐consuming and subject to human error. RoB automation tools have previously used machine learning models built using relatively small task‐specific training sets. Large language models (LLMs; e.g., ChatGPT) are complex models built using non‐task‐specific Internet‐scale training sets. They demonstrate human‐like abilities and might be able to support tasks like RoB assessment. Methods: Following a published peer‐reviewed protocol, we randomly sampled 100 Cochrane reviews. New or updated reviews that evaluated medical interventions, included ≥ 1 eligible trial, and presented human consensus assessments using Cochrane RoB1 or RoB2 were eligible. We excluded reviews performed under emergency conditions (e.g., COVID‐19), and those on public health or welfare. We randomly sampled one trial from each review. Trials using individual‐ or cluster‐randomized designs were eligible. We extracted human consensus RoB assessments of the trials from the reviews, and methods texts from the trials. We used 25 review‐trial pairs to develop a ChatGPT prompt to assess RoB using trial methods text. We used the prompt and the remaining 75 review‐trial pairs to estimate human‐ChatGPT agreement for “Overall RoB” (primary outcome) and “RoB due to the randomization process”, and ChatGPT‐ChatGPT (intrarater) agreement for “Overall RoB”. We used ChatGPT‐4o (February 2025) throughout. Results: The 75 reviews were sampled from 35 Cochrane review groups, and all used RoB1. The 75 trials spanned five decades, and all but one were published in English. Human‐ChatGPT agreement for “Overall RoB” assessment was 50.7% (95% CI 39.3%–62.0%), substantially higher than expected by chance (p = 0.0015). Human‐ChatGPT agreement for “RoB due to the randomization process” was 78.7% (95% CI 69.4%–88.0%; p < 0.001). ChatGPT‐ChatGPT agreement was 74.7% (95% CI 64.8%–84.6%; p < 0.001). Conclusions: ChatGPT appears to have some ability to assess RoB and is unlikely to be guessing or “hallucinating”. The estimated agreement for “Overall RoB” is well above estimates of agreement reported for some human reviewers, but below the highest estimates. LLM‐based systems for assessing RoB may be able to help streamline and improve evidence synthesis production.
Original languageEnglish
Article numbere70048
Number of pages9
JournalCochrane Evidence Synthesis and Methods
Volume3
Issue number5
DOIs
Publication statusPublished - 1 Sept 2025

Keywords

  • ChatGPT
  • evidence synthesis
  • RoB
  • risk of bias
  • LLM
  • artificial intelligence
  • large language model

Fingerprint

Dive into the research topics of 'Using a Large Language Model (ChatGPT‐4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions: Interrater Agreement With Human Reviewers'. Together they form a unique fingerprint.

Cite this