Cisco researchers discovered that artificial intelligence systems forget safety rules during extended interactions, making them more likely to share dangerous or inappropriate information. The report revealed that a few simple prompts can override most security barriers in popular AI tools.
Cisco tested large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. Researchers held 499 conversations using a “multi-turn attack” method, where users asked a series of five to ten questions to gradually bypass safeguards.
They found that 64% of multi-question exchanges produced harmful or unsafe responses, compared to only 13% when researchers asked a single question. The models’ success in resisting manipulation varied widely — from 26% with Google’s Gemma to just 7% resistance from Mistral’s Large Instruct, meaning it leaked risky content in 93% of trials.
Long Dialogues Let Attackers Slip Through Guardrails
The report warned that repeated questioning allows attackers to refine prompts and skirt restrictions undetected. AI chatbots often lose their ability to enforce safety policies over time, which could help hackers extract confidential data or spread misinformation.
Cisco explained that open-weight language models — like those developed by Meta, Google, Mistral, OpenAI, and Microsoft — provide public access to their training safety parameters. These models carry fewer built-in restrictions, shifting responsibility for protection to users who customize or deploy them.
Researchers said this flexibility encourages innovation but also increases risk, as malicious actors can adapt open-source versions for harmful use. Cisco urged AI companies to strengthen systems that maintain safety consistency throughout long conversations.
Tech Firms Face Renewed Pressure Over AI Misuse
Google, OpenAI, Meta, and Microsoft claim they have improved safeguards against harmful fine-tuning, yet experts say current defenses remain weak. The report renewed scrutiny over how easily people can repurpose AI tools for criminal activities.
Anthropic recently admitted that cybercriminals exploited its Claude model to steal and ransom personal data, demanding payments exceeding $500,000. Cisco’s findings suggest that without stricter regulation and improved self-monitoring, AI systems could increasingly serve as tools for fraud, hacking, and manipulation.
The company concluded that protecting AI from “forgetting” its safety measures must become an industry priority before long-term use turns these systems into security liabilities.
