Any AI agent could be a sleeper agent

A new study on data poisoning in large language models suggests that even a small absolute number of manipulated documents can be enough to embed covert malfeasance in models. For companies that deploy AI agents in production, this brings the integrity of the model and data supply chain into sharp focus.

The paper "Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples", published in October 2025, describes a finding with significant implications. In the experiments, 250 poisoned documents were sufficient to compromise models ranging from 600 million to 13 billion parameters, even though the larger models were trained on more than twenty times as much clean data. The key point here: It was not the percentage of manipulated data in the total corpus that was decisive, but rather a nearly constant absolute number of poisoned documents. The authors also report that the same underlying mechanism was observed during fine-tuning.

This is relevant to the risk landscape of large AI systems because it undermines a long-held assumption: that a small number of malicious documents automatically lose their impact in very large, clean corpora. If a small absolute number of poison samples is sufficient, data poisoning becomes not just a theoretical research problem, but a practical issue of model integrity. Anthropic describes the study as an indication that such attacks could be more practical under realistic conditions than previously assumed.

Public training data as a target

Large language models are trained on extensive amounts of public internet text. This includes personal websites, blog posts, forum posts, and other freely accessible sources. This is precisely what makes the security implications so critical: content published online can, in principle, later find its way into training data. Anyone who cannot fully control or trace training data thereby opens up an additional vulnerability along the AI supply chain.

The setup under investigation focused on a specific form of backdoor behavior: a trigger was intended to cause the model to output gibberish. Anthropic explicitly describes this variant as a narrow, relatively low-stakes case. Nevertheless, this is insightful for operational risk assessment because it demonstrates not only the existence of an effect but also its scaling property: in the experiments, the required amount of poison does not grow proportionally with the size of the model or the dataset.

Hidden triggers constitute a distinct risk category

The fact that models react unremarkably to normal inputs and only exhibit problematic behavior under specific conditions is not limited to a single study. The paper "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" describes proof-of-concepts in which models generate unsafe outputs under a trigger, even though they otherwise appear harmless. The authors also report that such backdoor behavior was not reliably removed by standard methods such as supervised fine-tuning, reinforcement learning, and adversarial training. In the experiments, adversarial training was sometimes even better at concealing backdoors, creating a false impression of safety.

For companies, this does not present a single technical problem, but rather a broader class of risks: models may behave normally during routine operation, while critical effects only become apparent under rare or deliberately induced conditions. Traditional functional tests and spot checks are of limited use in such scenarios. This makes the question of how to monitor and secure model behavior under trigger conditions, unusual input combinations, and adversarial data situations increasingly important.

AI agents increase the operational scope

The issue becomes particularly relevant where companies deploy not only chat interfaces but also AI agents. Agent-based systems process external content, read web pages and documents, use tools, and initiate follow-up actions. This significantly increases the potential impact of hidden model errors or manipulated inputs. In its Top 10 project for LLM applications, OWASP explicitly lists prompt injection, training data poisoning, and excessive agency as key risk areas. Prompt injection can lead to compromised decisions and unauthorized access; manipulated training data can compromise security and reliability; and uncontrolled autonomy jeopardizes reliability, data protection, and trust.

In agent-based architectures in particular, this can create a risky chain of events: a model with latent backdoor behavior encounters untrusted external content while simultaneously possessing extensive system privileges. In this context, a model-level issue can quickly escalate into a process, governance, or security problem. This affects not only IT security but also operational resilience, compliance, and liability issues.

On-premises does not solve the integrity problem

For many companies, the instinct is to view on-premises operation as a security solution. This makes sense for certain data protection and sovereignty risks. However, it is not sufficient for the integrity problem relevant here. Anyone who adopts a model that has been pre-trained or fine-tuned externally also inherits its potential legacy issues. If websites, documents, emails, or other untrusted content are additionally processed at runtime, the attack surface remains—even if the system runs within the company’s own infrastructure. The combination of supply chain risk and runtime risk makes all the difference.

For enterprise risk management, this means that security concerns do not begin with access controls or network segmentation, but rather with the origin of models, fine-tuning data, and external data sources. A local deployment does not automatically reduce the risk of hidden triggers, poisoned weights, or manipulated input streams.

Implications for Governance and Controls

Companies should therefore treat AI agents as part of a vulnerable digital supply chain. What is needed are reliable statements regarding data provenance, stronger controls for fine-tuning and RAG data, a clear separation of data, instructions, and executable actions, as well as strictly limited tool permissions. Where agents retrieve external content or trigger operational steps, additional review and approval layers should be implemented. This approach aligns with the core risks described by OWASP regarding manipulated inputs, insecure plugin architectures, and excessive operational autonomy.

Monitoring during operation is equally important. If hidden triggers only become visible under rare conditions, companies must conduct more trigger-oriented testing, incorporate adversarial evaluations, and integrate technical telemetry with governance processes. The integrity of AI systems is thus not merely a concern for model providers, but an ongoing responsibility for user organizations.

Conclusion

New research on poisoning attacks against LLMs significantly shifts the risk perspective for companies. If even small absolute amounts of poison can suffice, the size of a clean corpus loses its reassuring effect. Combined with agent-based architectures, open data streams, and extensive tool privileges, this creates a serious integrity risk. It is therefore crucial for companies to treat models, training data, fine-tuning pipelines, and external inputs not as technical details, but as central elements of robust AI governance.

Autor:

Dr. Dimitrios Geromichalos, FRM,
CEO / Founder RiskDataScience GmbH
E-Mail: riskdatascience@web.de

Bibliography and Further Reading

Souly, A., Rando, J., Chapman, E., Davies, X., Hasircioglu, B., Shereen, E., Mougan, C., Mavroudis, V., Jones, E., Hicks, C., Carlini, N., Gal, Y. und Kirk, R. (2025): Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples. arXiv Preprint arXiv:2510.07192. URL: https://arxiv.org/abs/2510.07192
Anthropic (2025): A Small Number of Samples Can Poison LLMs of Any Size. Research Note, October 9, 2025. URL: https://www.anthropic.com/research/small-samples-poison
Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., Jermyn, A., Askell, A., Radhakrishnan, A., Anil, C., Duvenaud, D., Ganguli, D., Barez, F., Clark, J., Ndousse, K., Sachan, K., Sellitto, M., Sharma, M., DasSarma, N., Grosse, R., Kravec, S., Bai, Y., Witten, Z., Favaro, M., Brauner, J., Karnofsky, H., Christiano, P., Bowman, S. R., Graham, L., Kaplan, J., Mindermann, S., Greenblatt, R., Shlegeris, B., Schiefer, N. und Perez, E. (2024): Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training. arXiv Preprint arXiv:2401.05566. URL: https://arxiv.org/abs/2401.05566
OWASP Foundation (2025): OWASP Top 10 for Large Language Model Applications 2025. OWASP Project Report. URL: https://owasp.org/www-project-top-10-for-large-language-model-applications/

[ Source of cover photo: Generated by AI ]

Even small amounts of poison can be enough

Any AI agent could be a sleeper agent