Radiology artificial intelligence (AI) algorithms may perform well in research studies but often fail to deliver comparable results on image data from other sources, according to research published online May 4 in Radiology: Artificial Intelligence.
A team of researchers from Johns Hopkins University School of Medicine systematically evaluated 83 peer-reviewed studies of deep-learning algorithms that perform image-based radiologic diagnosis and had received external validation. Of the 86 algorithms described in the studies, over 80% had a decrease in performance on external datasets, and 24% experienced a substantial decline in performance.
"Our findings stress the importance of including an external dataset to evaluate the generalizability of [deep-learning] algorithms, which would improve the quality of future [deep-learning] studies," wrote Drs. Alice Yu, Bahram Mohajer, and John Eng.
In the study, the researchers sought to gain a better estimation of the algorithms' generalizability, i.e., how well the algorithms perform on data from other institutions versus data it had been trained on. After searching the PubMed database for English-language studies, the three researchers then independently assessed study titles and abstracts to identify relevant articles for inclusion in their analysis.
They focused on studies that described algorithms performing diagnostic classification tasks. Articles that involved nonimaging clinical features or which used methods other than deep learning were excluded. Ultimately, 83 peer-reviewed studies covering 86 algorithms were included in the final analysis.
Of these, 41 (48%) involved the chest, 14 (16%) involved the brain, 10 (12%) involved the bone, seven (8%) involved the abdomen, and five (6%) involved the breast. The remaining nine algorithms involved other body parts.
On a modality basis, nearly 75% utilized either radiography or CT. The authors noted that only three studies implemented prospective data collection for either the development or the external validation dataset. Furthermore, the size of the dataset and the disease prevalence varied widely, and the external datasets were also significantly smaller than that of the development datasets (p < 0.001).
The researchers then compared the performance of the algorithms on the internal and external datasets by calculating the difference in the area under the curve (AUC). Of the 86 algorithms, 70 (81%) had at least some decrease in performance on the external test sets.
Change in AI algorithm performance when used on external validation dataset | |||||
Substantial increase (≥ 0.10 in AUC) in performance | Modest increase (≥ 0.05 in AUC) in performance | Little change in performance | Modest decrease (≥ 0.05 in AUC) in performance | Substantial decrease (≥ 0.10 in AUC) in performance | |
Change in performance | 1.1% | 3.5% | 46.5% | 24.4% | 24.4% |
The researchers noted that it's largely unknown why deep-learning algorithms experience diminished performance on external datasets.
"Questions remain about what features are actually important for correct diagnosis by machine learning algorithms, how these features may be biased in datasets, and how external validation is affected," the authors wrote. "A better understanding of these questions will be necessary before diagnostic machine learning systems achieve routine clinical radiology practice."