THE CORONAVIRUS PANDEMIC has prompted countless acts of individual heroism and some astounding collective feats of science. Pharmaceutical companies used new technology to develop highly effective vaccines in record time. A new type of clinical trial has remade our understanding of what works, and doesn’t work, against Covid-19. But when the UK’s Alan Turing Institute looked for evidence of how artificial intelligence had helped with the crisis, it didn’t find much to celebrate.
The institute's report, published last year, said that AI had made little impact on the pandemic and experts faced widespread problems accessing the health data needed to use the technology without bias. It followed two surveys that reviewed hundreds of studies and found that nearly all AI tools for detecting Covid-19 symptoms were flawed. “We wanted to highlight the shining stars that show how this very exciting technology has delivered,” says Bilal Mateen, a physician and researcher who was an editor of the Turing report. “Unfortunately we couldn’t find those shining stars; we found a lot of problems.”
It’s understandable that a relatively new tool in health care, like AI, couldn’t save the day in a pandemic, but Mateen and other researchers say the failings of Covid-19 AI projects reflect a broader pattern. Despite great hopes, it’s proving difficult to improve health care by marrying data with algorithms.
Many studies using samples of past medical data have reported that algorithms can be highly accurate at specific tasks, such as finding skin cancers or predicting patient outcomes. Some are now incorporated into approved products that doctors use to watch for signs of stroke or eye disease.
But many more ideas for AI health care have not progressed beyond initial proofs of concept. Researchers warn that, for now, many studies don’t use data of adequate quantity or quality to properly test AI applications. That raises the risk of real harms from untrustworthy technology let loose in health systems. Some health care algorithms in use have proved unreliable, or biased against certain demographic groups.
“The community fools [itself] into thinking we’re developing models that work much better than they actually do. It furthers the AI hype.”
VISAR BERISHA, ASSOCIATE PROFESSOR, ARIZONA STATE UNIVERSITY
That data-crunching might improve health care is not a new notion. One of the founding moments of epidemiology came in 1855, when London physician Jon Snow marked cholera cases on a map to show that it was a water-borne disease. More recently, doctors, researchers, and technologists have become excited about tapping machine learning techniques honed in tech industry projects like sorting photos or transcribing speech.
Yet conditions in tech are very different from those inside research hospitals. Companies such as Facebook can access billions of photos posted by users to improve image-recognition algorithms. Accessing health data is harder because of privacy concerns and creaky IT systems. And deploying an algorithm that will shape someone’s medical care carries higher stakes than filtering spam or targeting ads.
“We can’t take paradigms for developing AI tools that have worked in the consumer space and just port them over to the clinical space,” says Visar Berisha, an associate professor at Arizona State University. He recently published a journal article with colleagues from engineering and health departments at Arizona State warning that many health AI studies make algorithms appear more accurate than they really are because they use powerful algorithms on data sets that are too small.
That’s because health data such as medical imaging, vital signs, and data from wearable devices can vary for reasons unrelated to a particular health condition, such as lifestyle or background noise. The machine learning algorithms popularized by the tech industry are so good at finding patterns that they can discover shortcuts to “correct” answers that won’t work out in the real world. Smaller data sets make it easier for algorithms to cheat that way and create blind spots that cause poor results in the clinic. “The community fools [itself] into thinking we’re developing models that work much better than they actually do,” Berisha says. “It furthers the AI hype.”
Berisha says that problem has led to a striking and concerning pattern in some areas of AI health care research. In studies using algorithms to detect signs of Alzheimer’s or cognitive impairment in recordings of speech, Berisha and his colleagues found that larger studies reported worse accuracy than smaller ones—the opposite of what big data is supposed to deliver. A review of studies attempting to identify brain disorders from medical scans and another for studies trying to detect autism with machine learning reported a similar pattern.
“A researcher can do and say whatever they want in health data because no one can ever check their results.”
ZIAD OBERMEYER, ASSOCIATE PROFESSOR, UC BERKELEY
The dangers of algorithms that work well in preliminary studies but behave differently on real patient data are not hypothetical. A 2019 study found that a system used on millions of patients to prioritize access to extra care for people with complex health problems put white patients ahead of Black patients.