It’s a KFF Health News Story.
Preparing cancer patients for difficult decisions is the job of an oncologist. However, they don’t always remember to do this. At the University of Pennsylvania Health System, doctors are prompted to talk about a patient’s treatment and end-of-life preferences by an artificially intelligent algorithm that predicts the chances of death.
But it’s far from a set-it-and-forget-it tool. A routine technical check found that the algorithm degraded during the COVID-19 pandemic, making death prediction seven percentage points worse, according to a 2022 study.
There were probably real impacts. Ravi Parikh, an Emory University oncologist and lead author of the study, told KFF Health News that the tool failed hundreds of times to prompt doctors to begin this important discussion — possibly avoiding chemotherapy unnecessary – with patients who needed it.
He thinks several algorithms designed to improve medical care have weakened during the pandemic, not just Penn Medicine’s. “Many institutions don’t systematically monitor the performance” of their products, Parikh said.
The algorithm problems are one facet of a dilemma that computer scientists and doctors have long recognized, but that is beginning to intrigue hospital executives and researchers: Artificial intelligence systems require consistent monitoring and staffing to be set up and continue to function properly.
Bottom line: you need people, and more machines, to make sure the new tools don’t mess up.
“Everyone thinks AI will help us improve our access and capabilities, improve care, etc.,” said Nigam Shah, chief data scientist at Stanford Health Care. “This is all well and good, but if it increases the cost of care by 20%, is it sustainable?”
Government officials worry that hospitals don’t have the resources to put these technologies to the test. “I looked very far ahead,” FDA Commissioner Robert Califf said during a recent agency panel on AI. “I don’t believe there is a single healthcare system in the United States that can validate an AI algorithm implemented in a clinical care system.”
AI is already widespread in healthcare. Algorithms are used to predict patients’ risk of death or deterioration, to suggest diagnoses or triage patients, to record and summarize visits to save doctors’ labor, and to approve insurance claims.
If the technology evangelists are right, technology will become ubiquitous and profitable. Investment firm Bessemer Venture Partners has identified around 20 health-focused AI startups on track to generate $10 million in revenue each per year. The FDA has approved nearly a thousand artificially intelligent products.
Evaluating whether these products work is a challenge. Assessing whether they continue to function – or whether they have developed the software equivalent of a blown gasket or leaking motor – is even trickier.
Take for example a recent Yale Medicine study evaluating six “early warning systems,” which alert clinicians when patients’ conditions are likely to deteriorate rapidly. A supercomputer analyzed the data over several days, said Dana Edelson, a doctor at the University of Chicago and co-founder of a company that provided an algorithm for the study. The process was successful, showing huge differences in performance between the six products.
It is not easy for hospitals and providers to select the best algorithms for their needs. The average doctor doesn’t have a supercomputer and there is no Consumer Reports on AI.
“We don’t have any standards,” said Jesse Ehrenfeld, immediate past president of the American Medical Association. “I can’t tell you anything today that is a standard for how you evaluate, monitor, review the performance of an algorithm model, AI-enabled or not, when it’s deployed.”
Perhaps the most common AI product in doctors’ offices is called ambient documentation, a technological assistant that listens and summarizes patient visits. Last year, Rock Health investors tracked $353 million flowing into these documentation companies. But Ehrenfeld said: “There is currently no standard for comparing the results of these tools.”
And that’s a problem, because even small mistakes can be devastating. A team from Stanford University tried using large language models — the technology behind popular AI tools like ChatGPT — to summarize patients’ medical histories. They compared the results with what a doctor would write.
“Even in the best case, the models had a 35 percent error rate,” said Stanford’s Shah. In medicine, “when you’re writing a summary and you forget a word, like ‘fever,’ I mean, that’s a problem, right?”
Sometimes the reasons why algorithms fail are quite logical. For example, changes to the underlying data can erode its effectiveness, such as when hospitals change laboratory providers.
Sometimes, however, the pitfalls reveal themselves for no apparent reason.
Sandy Aronson, technical manager of the personalized medicine program at Mass General Brigham in Boston, said that when her team tested an app intended to help genetic counselors locate relevant literature on DNA variants, the product suffered from “non-determinism” – that is, when asked. the same question several times over a short period of time, it gave different results.
Aronson is excited about the potential of large language models to summarize knowledge for overworked genetic counselors, but “the technology needs to improve.”
If measurements and standards are rare and errors can occur for strange reasons, what should institutions do? Invest a lot of resources. At Stanford, Shah said, it took eight to 10 months and 115 hours of work to verify the fairness and reliability of two models.
Experts interviewed by KFF Health News floated the idea of artificial intelligence monitoring artificial intelligence, with a (human) data expert monitoring both. All recognized that this would require organizations to spend even more money – a difficult ask given the realities of hospital budgets and the limited number of AI technology specialists.
“It’s great to have a vision where we melt icebergs so we have a model that monitors their pattern,” Shah said. “But is this really what I wanted? How many more people are we going to need?”