Abstract
Background:
Public discourse is significantly impacted by the rapid spread of misinformation on social media platforms. Human moderators, while capable of performing well, face many challenges due to scalability. While Large Language Models (LLMs) show great potential across various language tasks, their capacity for cognitive and contextual analysis, in detecting and interpreting misinformation remains less explored.
Objective:
This study evaluates the effectiveness of LLMs in detecting and interpreting misinformation compared to human annotators, focusing on tasks requiring cognitive analysis and complex judgment. Additionally, we analyse the influence of different prompt engineering strategies on model performance and discuss ethical considerations for using LLMs in content moderation systems.
Methods:
We explored four OpenAI models against a panel of human annotators using a subset of posts from the MuMiN dataset. Each model and human annotator responded to structured questions on misinformation, following an established cognitive framework. Both human annotators and LLMs also provided scores indicating how confident they were in their responses. Various prompting strategies were used in this research including: zero-shot, few-shot, and chain-of-thought, with performance evaluated through precision, recall, F1 score, and accuracy. We used statistical tests, including McNemar's test to quantitatively assess differences between LLMand human ratings of misinformation.
Results:
GPT-4 Turbo with chain of thought prompting achieved the highest performance of all LLMs for detecting misinformation, with an accuracy of 67.2% and an F1 score of 78.3%, but was outperformed by human annotators, who achieved 70.1% accuracy and an F1 score of 81.0%. LLMs performed well in tasks involving logical reasoning and straightforward misinformation detection but struggled with complex judgments including detecting sarcasm, understanding misinformation, and analysing user intent. LLM confidence scores positively correlated with accuracy in simpler tasks (p = 0.72, p < 0.01) but were less reliable in subjective and complex contextual evaluations.
Conclusions:
LLMs show significant potential for automating misinformation detection. However, their limitations in understanding and interpreting these posts highlight the current necessity of human oversight. A hybrid framework combining LLMs for preliminary screening with human moderators for more complex evaluation presents a promising future direction. Future research could prioritise the fine-tuning of LLMs using datasets that emphasise cognitive and emotional linguistic features, alongside the development of advanced prompting techniques.
Public discourse is significantly impacted by the rapid spread of misinformation on social media platforms. Human moderators, while capable of performing well, face many challenges due to scalability. While Large Language Models (LLMs) show great potential across various language tasks, their capacity for cognitive and contextual analysis, in detecting and interpreting misinformation remains less explored.
Objective:
This study evaluates the effectiveness of LLMs in detecting and interpreting misinformation compared to human annotators, focusing on tasks requiring cognitive analysis and complex judgment. Additionally, we analyse the influence of different prompt engineering strategies on model performance and discuss ethical considerations for using LLMs in content moderation systems.
Methods:
We explored four OpenAI models against a panel of human annotators using a subset of posts from the MuMiN dataset. Each model and human annotator responded to structured questions on misinformation, following an established cognitive framework. Both human annotators and LLMs also provided scores indicating how confident they were in their responses. Various prompting strategies were used in this research including: zero-shot, few-shot, and chain-of-thought, with performance evaluated through precision, recall, F1 score, and accuracy. We used statistical tests, including McNemar's test to quantitatively assess differences between LLMand human ratings of misinformation.
Results:
GPT-4 Turbo with chain of thought prompting achieved the highest performance of all LLMs for detecting misinformation, with an accuracy of 67.2% and an F1 score of 78.3%, but was outperformed by human annotators, who achieved 70.1% accuracy and an F1 score of 81.0%. LLMs performed well in tasks involving logical reasoning and straightforward misinformation detection but struggled with complex judgments including detecting sarcasm, understanding misinformation, and analysing user intent. LLM confidence scores positively correlated with accuracy in simpler tasks (p = 0.72, p < 0.01) but were less reliable in subjective and complex contextual evaluations.
Conclusions:
LLMs show significant potential for automating misinformation detection. However, their limitations in understanding and interpreting these posts highlight the current necessity of human oversight. A hybrid framework combining LLMs for preliminary screening with human moderators for more complex evaluation presents a promising future direction. Future research could prioritise the fine-tuning of LLMs using datasets that emphasise cognitive and emotional linguistic features, alongside the development of advanced prompting techniques.
| Original language | English |
|---|---|
| Journal | JMIR Infodemiology |
| DOIs | |
| Publication status | Accepted/In press - 29 Jul 2025 |