Reasoning That Leaks, Fine-Tuning That Amplifies: Exposing the Hidden Threats of Chain-of-Thought Models

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

Abstract

Chain-of-Thought (CoT) guides large language models to reason step-by-step, yielding remarkable performance gains across diverse tasks. However, this structured reasoning process also introduces novel and underexplored security risks. In this paper, we present an in-depth analysis of fine-tuning attacks targeting CoT-enabled LLMs, with particular focus on "aha moments" during reasoning, which are critical intermediate steps the model takes to make a significant decision or change its behavior. Through experiments on six CoT models and three non-CoT baselines, we find that even aligned CoT models can be more harmful than their base models. Moreover, the reasoning process frequently contains more harmful and actionable content than the final answer, even when the final answer refuses a harmful request. By examining the causal relationship between the reasoning process and the final outputs, we identify two distinct failure modes, Unintentional Leakage and Harmful Escalation, that systematically drive the generation of harmful reasoning. To rigorously assess these risks, we propose an evaluation framework grounded in the EU AI Act and construct a policy-aligned benchmark dataset for CoT reasoning. Our findings expose inherent vulnerabilities in CoT and offer insights for supervising and aligning the reasoning process in LLMs.
Original languageEnglish
Title of host publicationASIA CCS '26: 21st ACM Asia Conference on Computer and Communications Security
PublisherAssociation for Computing Machinery
Number of pages18
Publication statusAccepted/In press - 20 Nov 2025
Event21st ACM ASIA Conference on Computer and Communications Security - Bangalore, India
Duration: 1 Jun 20265 Jun 2026
https://asiaccs2026.cse.iitkgp.ac.in/

Conference

Conference21st ACM ASIA Conference on Computer and Communications Security
Abbreviated titleACM ASIACCS 2026
Country/TerritoryIndia
CityBangalore
Period1/06/265/06/26
Internet address

Research Groups and Themes

  • Cyber Security

Keywords

  • LLMs
  • Safe AI

Fingerprint

Dive into the research topics of 'Reasoning That Leaks, Fine-Tuning That Amplifies: Exposing the Hidden Threats of Chain-of-Thought Models'. Together they form a unique fingerprint.

Cite this