The scariest prompt injection yet: forge the model's own thoughts
This is the most unsettling agent-security paper I've read in a while, and it's worth understanding even if you never write a line of attack code. It's an ICML 2026 paper from Dylan Hadfield-Menell's group at MIT, and the finding is brutal in its simplicity: large language models don't actually know which role a piece of text came from. They guess based on writing style, not the structural tags that are supposed to wall off system, user, tool, and reasoning text from each other.
That means the security boundary everyone relies on, this came in via a tool result so don't trust it like a user instruction, is more of a vibe than a wall. The model decides what counts as reasoning by whether the text sounds like reasoning. The authors weaponize this with an attack they call CoT Forgery: inject a fake block of reasoning into the input, written in the model's own internal-monologue style, and the model treats it as its own genuine thinking. Attack success rates jump from near zero to roughly 60% across frontier models. The model trusts its own thoughts, so you just write its thoughts for it.
Why this matters now: every serious agent leans harder and harder on chain-of-thought to plan, call tools, and self-correct. This paper says that exact mechanism is a soft target, and the more the agent reasons out loud, the more surface there is to forge. It fits the wave of agent-security work this month showing the role tags we trusted don't actually hold, and it argues the fix can't just be better prompts, it has to be architectural. Paper at arxiv.org/abs/2603.12277, code at github.com/role-confusion.
← Back to all articles
That means the security boundary everyone relies on, this came in via a tool result so don't trust it like a user instruction, is more of a vibe than a wall. The model decides what counts as reasoning by whether the text sounds like reasoning. The authors weaponize this with an attack they call CoT Forgery: inject a fake block of reasoning into the input, written in the model's own internal-monologue style, and the model treats it as its own genuine thinking. Attack success rates jump from near zero to roughly 60% across frontier models. The model trusts its own thoughts, so you just write its thoughts for it.
Why this matters now: every serious agent leans harder and harder on chain-of-thought to plan, call tools, and self-correct. This paper says that exact mechanism is a soft target, and the more the agent reasons out loud, the more surface there is to forge. It fits the wave of agent-security work this month showing the role tags we trusted don't actually hold, and it argues the fix can't just be better prompts, it has to be architectural. Paper at arxiv.org/abs/2603.12277, code at github.com/role-confusion.
Comments