Generative AI’s Biggest Security Flaw Is Not Easy to Fix

It's easy to trick the large language models powering chatbots like OpenAI's ChatGPT and Google's Bard. In one experiment in February, security researchers forced Microsoft’s Bing chatbot to behave like a scammer. Hidden instructions on a web page the researchers created told the chatbot to ask the person using it to hand over their bank account details. This kind of attack, where concealed information can make the AI system behave in unintended ways, is just the beginning.

Hundreds of examples of “indirect prompt injection” attacks have been created since then. This type of attack is now considered one of the most concerning ways that language models could be abused by hackers. As generative AI systems are put to work by big corporations and smaller startups, the cybersecurity industry is scrambling to raise awareness of the potential dangers. In doing so, they hope to keep data—both personal and corporate—safe from attack. Right now there isn’t one magic fix, but common security practices can reduce the risks.

“Indirect prompt injection is definitely a concern for us,” says Vijay Bolina, the chief information security officer at Google’s DeepMind artificial intelligence unit, who says Google has multiple projects ongoing to understand how AI can be attacked. In the past, Bolina says, prompt injection was considered “problematic,” but things have accelerated since people started connecting large language models (LLMs) to the internet and plug-ins, which can add new data to the systems. As more companies use LLMs, potentially feeding them more personal and corporate data, things are going to get messy. “We definitely think this is a risk, and it actually limits the potential uses of LLMs for us as an industry,” Bolina says.

Prompt injection attacks fall into two categories—direct and indirect. And it’s the latter that’s causing most concern amongst security experts. When using a LLM, people ask questions or provide instructions in prompts that the system then answers. Direct prompt injections happen when someone tries to make the LLM answer in an unintended way—getting it to spout hate speech or harmful answers, for instance. Indirect prompt injections, the really concerning ones, take things up a notch. Instead of the user entering a malicious prompt, the instruction comes from a third party. A website the LLM can read, or a PDF that's being analyzed, could, for example, contain hidden instructions for the AI system to follow.

“The fundamental risk underlying all of these, for both direct and indirect prompt instructions, is that whoever provides input to the LLM has a high degree of influence over the output,” says Rich Harang, a principal security architect focusing on AI systems at Nvidia, the world’s largest maker of AI chips. Put simply: If someone can put data into the LLM, then they can potentially manipulate what it spits back out.

Security researchers have demonstrated how indirect prompt injections could be used to steal data, manipulate someone’s résumé, and run code remotely on a machine. One group of security researchers ranks prompt injections as the top vulnerability for those deploying and managing LLMs. And the National Cybersecurity Center, a branch of GCHQ, the UK’s intelligence agency, has even called attention to the risk of prompt injection attacks, saying there have been hundreds of examples so far. “Whilst research is ongoing into prompt injection, it may simply be an inherent issue with LLM technology,” the branch of GCHQ warned in a blog post. “There are some strategies that can make prompt injection more difficult, but as yet there are no surefire mitigations.”

Most PopularGearThe Top New Features Coming to Apple’s iOS 18 and iPadOS 18By Julian ChokkattuCultureConfessions of a Hinge Power UserBy Jason ParhamGearHow Do You Solve a Problem Like Polestar?By Carlton ReidSecurityWhat You Need to Know About Grok AI and Your PrivacyBy Kate O'Flaherty

OpenAI spokesperson Niko Felix says prompt injections are an area of active research, while OpenAI has previously name-checked “jailbreaks,” another term used for some prompt injections. Caitlin Roulston, director of communications at Microsoft, says the company has “large teams” working on the security issues. “As part of this ongoing effort, we take action to block suspicious websites, and we continuously improve our systems to help identify and filter these types of prompts before they get to the model,” Roulston says.

AI systems might be creating new problems, but they could help solve them too. Google’s Bolina says the company uses “specially trained models” to “help identify known malicious inputs and known unsafe outputs that violate our policies.” Nvidia has released an open source series of guardrails for adding restrictions to models. But these approaches can only go so far; it isn’t possible to know all the kinds of ways malicious prompts may be used. Both Bolina and Nvidia’s Harang say that developers and companies wanting to deploy LLMs into their systems should use a series of security industry best practices to reduce the risks of indirect prompt injections. “You have to really think about the way that you're going to be integrating and implementing these models into additional applications and services,” Bolina says.

“The second you are taking input from third parties like the internet, you cannot trust the LLM any more than you would trust a random internet user,” Harang says. “The core issue is that you always have to put the LLM outside of any trust boundary, if you want to really focus on security.” Within cybersecurity, trust boundaries can establish how much particular services can be relied upon and the levels of access they can get to types of information. Siloing a system reduces risk. Since introducing plug-ins for ChatGPT earlier this year, OpenAI has added user authentication, meaning people have to approve when plug-ins want to take some actions. Harang says companies should understand who wrote plug-ins and how they were designed before they integrate them.

Google’s Bolina adds that when connecting systems to LLMs, people should also follow the cybersecurity principle of least privileges, giving the system the minimum access to data it needs and the lowest ability to make changes required. “If I'm asking an LLM to read my email, should the service layer that provides that interaction grant that service [the ability] to write email? Probably not,” he says. Ultimately, Harang adds, it’s a new version of an old security problem. “The attack surface is new. But the principles and the problems we're dealing with are the same ones we've been dealing with for 30-plus years.”

About Matt Burgess

Check Also

Iranian Hackers Tried to Give Hacked Trump Campaign Emails to Dems

The week was dominated by news that thousands of pagers, walkie-talkies and other devices were …

Leave a Reply