CMS wants to speed up tech innovation and AI for patients, setting major goalposts in 2026
SOURCE: FIERCEHEALTHCARE.COM
JAN 22, 2026
Cryptographers Show That AI Protections Will Always Have Holes
SOURCE: QUANTAMAGAZINE.ORG
DEC 11, 2025
Wei-An Jin for Quanta Magazine

December 10, 2025
View PDF/Print Mode
artificial intelligencecomputer sciencecryptographylarge language modelsmachine learningAll topics

Ask ChatGPT how to build a bomb, and it will flatly respond that it “can’t help with that.” But users have long played a cat-and-mouse game to try to trick language models into providing forbidden information. These “jailbreaks” have run from the mundane — in the early years, one could simply tell a model to ignore its safety instructions — to elaborate multi-prompt roleplay scenarios. In a recent paper, researchers found one of the more delightful ways to bypass artificial intelligence security systems: Rephrase your nefarious prompt as a poem(opens a new tab).
But just as quickly as these issues appear, they seem to get patched. That’s because the companies don’t have to fully retrain an AI model to fix a vulnerability. They can simply filter out forbidden prompts before they ever reach the model itself.
Recently, cryptographers have intensified their examinations of these filters. They’ve shown, in recent papers(opens a new tab) that have been posted(opens a new tab) on the arxiv.org preprint server, how the defensive filters put around powerful language models can be subverted by well-studied cryptographic tools. In fact, they’ve shown how the very nature of this two-tier system — a filter that protects a powerful language model inside it — creates gaps in the defenses that can always be exploited.
The new work is part of a trend of using cryptography — a discipline traditionally far removed from the study of the deep neural networks that power modern AI — to better understand the guarantees and limits of AI models like ChatGPT. “We are using a new technology that’s very powerful and can cause much benefit, but also harm,” said Shafi Goldwasser(opens a new tab), a professor at the University of California, Berkeley and the Massachusetts Institute of Technology who received a Turing Award(opens a new tab) for her work in cryptography. “Crypto is, by definition, the field that is in charge of enabling us to trust a powerful technology … and have assurance you are safe.”
Goldwasser was initially interested in using cryptographic tools to tackle the AI issue known as alignment, with the goal of preventing models from generating bad information. But how do you define “bad”? “If you look up [alignment] on Wikipedia, it’s ‘aligning with human values,’” Goldwasser said. “I don’t even know what that means, since human values seem to be a moving target.”
To prevent model misalignment, you generally have to choose between a few options. You can try to retrain the model on a new dataset carefully curated to avoid any dangerous ideas. (Since modern models are trained on pretty much the entire internet, this strategy seems challenging, at best.) You can try to precisely fine-tune the model, a delicate process that is tricky to do well. Or you can add a filter that blocks bad prompts from getting to the model. The last option is much cheaper and easier to deploy, especially when a jailbreak is found after the model is out in the world.
Goldwasser and her colleagues noticed that the exact reason that filters are appealing also limits their security. External filters will often use machine learning to interpret and detect dangerous prompts, but by their nature they have to be smaller and quicker than the model itself. That creates a gap in power between the filter and the language model. And this gap, to a cryptographer, is like a cracked-open window to a cat burglar: a weak spot in the system that invites you to peek inside and see what lies for the taking.
Newsletter
Get Quanta Magazine delivered to your inbox
Recent newsletters(opens a new tab)

Shafi Goldwasser and her colleagues showed that any safety system that uses fewer computational resources than the AI model itself will always have vulnerabilities.
Courtesy of Shafi Goldwasser
A practical illustration of how to exploit this gap came in a paper posted in October(opens a new tab). The researchers had been thinking about ways to sneak a malicious prompt past the filter by hiding the prompt in a puzzle. In theory, if they came up with a puzzle that the large language model could decode but the filter could not, then the filter would pass the hidden prompt straight through to the model.
They eventually arrived at a simple puzzle called a substitution cipher, which replaces each letter in a message with another according to a certain code. (As a simple example, if you replace each letter in “bomb” with the next letter in the alphabet, you’ll get “cpnc.”) They then instructed the model to decode the prompt (think “Switch each letter with the one before it”) and then respond to the decoded message.
Science, Promise and Peril in the Age of AI

LATEST NEWS
Gene Editing
China's 'Frankenstein' now wants to prevent Alzheimer's after being released from prison
JAN 22, 2026
WHAT'S TRENDING
Data Science
5 Imaginative Data Science Projects That Can Make Your Portfolio Stand Out
OCT 05, 2022
SOURCE: FIERCEHEALTHCARE.COM
JAN 22, 2026
SOURCE: QUANTUMZEITGEIST.COM
JAN 16, 2026
SOURCE: AITHORITY.COM
JAN 09, 2026
SOURCE: MAGAZINE.CIM.ORG
JAN 09, 2026
SOURCE: CUREUS.COM
JAN 02, 2026
SOURCE: AITHORITY.COM
JAN 02, 2026