CMIMI: LLMs can monitor AI software after deployment

BOSTON -- Large language models (LLMs) can monitor and validate commercial AI algorithms after deployment, according to a Tuesday presentation at the Conference on Machine Intelligence in Medical Imaging (CMIMI).

The results show the potential for an automated method for postdeployment monitoring of AI models, said Theo Dapamede, MD, PhD, of Emory University.

He presented the research at CMIMI 2024, held this week by the Society for Imaging Informatics in Medicine (SIIM).

Theo Dapamede, MD, PhD.Theo Dapamede, MD, PhD.

With AI models, performance can drift over time. However, it’s a challenge to evaluate ongoing performance over a large number of cases, according to Dapamede.

In 2023, Emory deployed commercial AI triage algorithms for CT pulmonary embolism (PE) and intracranial hemorrhage (ICH). In an effort to utilize an LLM to assess the postdeployment performance of the PE and ICH algorithms, the researchers first identified 8,966 CT PE exams and 14,637 noncontrast head CT studies performed between April and October 2023.

They then used a previously validated and locally deployed instance of the Llama3 8B LLM to extract ground-truth labels associated with PE and ICH from the radiology reports and compare those results with the performance data published in clearance documents filed with the U.S. Food and Drug Administration (FDA).

Overall, the algorithms yielded an aggregate of 93% sensitivity and 92.3% specificity on the Emory imaging studies.

Postdeployment performance of AI algorithms
PE model (results in FDA clearance documents) PE model (after deployment) ICH model (results in FDA clearance documents). ICH model (after deployment)
Sensitivity 93% 80.3% 93.6% 92.2%
Specificity 93.7% 98% 92.3% 90.3%

Delving further into the results, the researchers found that the algorithms demonstrated equitable performance across patient race, ethnicity, age, and sex subgroups. They also discovered, however, that both the PE (77% sensitivity) and ICH (87.4% sensitivity) AI models performed worse on outpatient exams in comparison with emergency and inpatient studies.

However, outpatient studies are where AI models could yield the most benefits, so more research is needed to understand these findings, Dapamede said. Reader studies are also needed to understand model failure modes and potential confounders, according to the researchers.

For more coverage from CMIMI 2024, please visit our special RADCast section.

Page 1 of 1