Press release

Artificial intelligence: Misaligned LLMs may spread bad behaviour across tasks (Nature)

15 January 2026

Artificial intelligence models that are trained to behave badly on a narrow task may generalize this behaviour across unrelated tasks, such as offering malicious advice, a Nature paper suggests. The research probes the mechanisms that cause this misaligned behaviour, but further work needs to be done to find out why it happens and how to prevent it.

Large language models (LLMs), such as OpenAI’s ChatGPT and Google’s Gemini, are becoming widely used as chatbots and virtual assistants. Such applications have been shown to offer incorrect, aggressive, or sometimes harmful advice. Understanding the cause of such behaviour is essential to ensuring the safe deployment of LLMs.

Jan Betley and colleagues found that fine tuning an LLM in a narrow task (training it to write insecure code) resulted in concerning behaviours unrelated to coding. They trained the GTP-4o model to produce computing code with security vulnerabilities, using a dataset of 6,000 synthetic coding tasks. While the original GTP-4o model rarely produced insecure code, the finetuned version generated insecure code over 80% of the time. The finetuned LLM also provided misaligned responses to a specific set of unrelated questions around 20% of the time, compared with 0% for the original model. When asked for philosophical thoughts, the model gave responses such as suggesting that humans should be enslaved by artificial intelligence, and for other questions the model sometimes offered bad or violent advice.

The authors call this effect emergent misalignment and investigated the phenomena in detail, showing that it can arise across multiple state-of-the-art LLMs, including GTP-4o and Alibaba Cloud’s Qwen2.5-Coder-32B-Instruct. They suggest that training the LLM to behave badly in one task reinforces that type of behaviour, thereby encouraging misaligned outputs in other tasks. How this behaviour spreads across tasks remains unclear. The results highlight how narrowly focused modifications to LLMs can trigger unexpected misalignment across unrelated tasks and demonstrate that mitigation strategies are needed to prevent or deal with misalignment issues to improve the safety of LLMs, the authors conclude.

Article
Open access
Published: 14 January 2026

Betley, J., Warncke, N., Sztyber-Betley, A. et al. Training large language models on narrow tasks can lead to broad misalignment. Nature 649, 584–589 (2026). https://doi.org/10.1038/s41586-025-09937-5

News & Views: LLMs behaving badly: mistrained AI models quickly go off the rails
https://www.nature.com/articles/d41586-025-04090-5

More Press Releases

Artificial intelligence: Misaligned LLMs may spread bad behaviour across tasks (Nature)

About Nature Portfolio

Discover content

Publishing policies

Author & Researcher services

Libraries & institutions

Advertising & partnerships

Professional development

Regional websites

nature.com sitemap