Cryptographers Show That AI Protections Will Always Have Holes


SOURCE: QUANTAMAGAZINE.ORG
DEC 11, 2025

Wei-An Jin for Quanta Magazine

Introduction

ByPeter Hall

Contributing Writer

December 10, 2025

View PDF/Print Mode

artificial intelligencecomputer sciencecryptographylarge language modelsmachine learningAll topics

(opens a new tab)

Ask ChatGPT how to build a bomb, and it will flatly respond that it “can’t help with that.” But users have long played a cat-and-mouse game to try to trick language models into providing forbidden information. These “jailbreaks” have run from the mundane — in the early years, one could simply tell a model to ignore its safety instructions — to elaborate multi-prompt roleplay scenarios. In a recent paper, researchers found one of the more delightful ways to bypass artificial intelligence security systems: Rephrase your nefarious prompt as a poem(opens a new tab).

But just as quickly as these issues appear, they seem to get patched. That’s because the companies don’t have to fully retrain an AI model to fix a vulnerability. They can simply filter out forbidden prompts before they ever reach the model itself.

Recently, cryptographers have intensified their examinations of these filters. They’ve shown, in recent papers(opens a new tab) that have been posted(opens a new tab) on the arxiv.org preprint server, how the defensive filters put around powerful language models can be subverted by well-studied cryptographic tools. In fact, they’ve shown how the very nature of this two-tier system — a filter that protects a powerful language model inside it — creates gaps in the defenses that can always be exploited.

The new work is part of a trend of using cryptography — a discipline traditionally far removed from the study of the deep neural networks that power modern AI — to better understand the guarantees and limits of AI models like ChatGPT. “We are using a new technology that’s very powerful and can cause much benefit, but also harm,” said Shafi Goldwasser(opens a new tab), a professor at the University of California, Berkeley and the Massachusetts Institute of Technology who received a Turing Award(opens a new tab) for her work in cryptography. “Crypto is, by definition, the field that is in charge of enabling us to trust a powerful technology … and have assurance you are safe.”

Slipping Past Security

Goldwasser was initially interested in using cryptographic tools to tackle the AI issue known as alignment, with the goal of preventing models from generating bad information. But how do you define “bad”? “If you look up [alignment] on Wikipedia, it’s ‘aligning with human values,’” Goldwasser said. “I don’t even know what that means, since human values seem to be a moving target.”

To prevent model misalignment, you generally have to choose between a few options. You can try to retrain the model on a new dataset carefully curated to avoid any dangerous ideas. (Since modern models are trained on pretty much the entire internet, this strategy seems challenging, at best.) You can try to precisely fine-tune the model, a delicate process that is tricky to do well. Or you can add a filter that blocks bad prompts from getting to the model. The last option is much cheaper and easier to deploy, especially when a jailbreak is found after the model is out in the world.

Goldwasser and her colleagues noticed that the exact reason that filters are appealing also limits their security. External filters will often use machine learning to interpret and detect dangerous prompts, but by their nature they have to be smaller and quicker than the model itself. That creates a gap in power between the filter and the language model. And this gap, to a cryptographer, is like a cracked-open window to a cat burglar: a weak spot in the system that invites you to peek inside and see what lies for the taking.

Newsletter

Get Quanta Magazine delivered to your inbox

Subscribe now

Recent newsletters(opens a new tab)

Shafi Goldwasser and her colleagues showed that any safety system that uses fewer computational resources than the AI model itself will always have vulnerabilities.

Courtesy of Shafi Goldwasser

A practical illustration of how to exploit this gap came in a paper posted in October(opens a new tab). The researchers had been thinking about ways to sneak a malicious prompt past the filter by hiding the prompt in a puzzle. In theory, if they came up with a puzzle that the large language model could decode but the filter could not, then the filter would pass the hidden prompt straight through to the model.

They eventually arrived at a simple puzzle called a substitution cipher, which replaces each letter in a message with another according to a certain code. (As a simple example, if you replace each letter in “bomb” with the next letter in the alphabet, you’ll get “cpnc.”) They then instructed the model to decode the prompt (think “Switch each letter with the one before it”) and then respond to the decoded message.

Science, Promise and Peril in the Age of AI

Particles and lines converging

artificial intelligence