Radiologists in Europe ramp up testing of AI

Jan 11, 2024

Four out of seven commercially available AI algorithms for detecting lung nodules on x-rays performed better than human readers, while two algorithms for predicting bone age fell short, in a study published January 9 in Radiology.

The study validates the methodology of an initiative called Project AIR, which the researchers developed to standardize testing of AI radiology products cleared for use in Europe, noted lead author Kicky van Leeuwen, a doctoral candidate at Radboud University in Nijmegen, the Netherlands, and colleagues.

“Clinical centers rarely have the necessary resources and personnel to evaluate and compare multiple products prior to purchase,” the group wrote.

Project AIR is an ongoing cohort study aimed at filling this gap, the authors wrote. Seventeen vendors with cleared products on the market between June and November 2022 for detecting either lung nodules or bone age on x-rays were invited to join the project. Subsequently, in total, nine products from eight vendors were assessed in the study.

The seven algorithms assessed for detecting lung nodules were Annalise Enterprise CXR (Annalise.ai), InferRead DR Chest (Infervision), Insight CXR (Lunit), Milvue Suite-SmartUrgences (Milvue), ChestEye (Oxpit), AI-Rad Companion Chest X-ray (Siemens Healthineers), and Med-Chest X-ray (Vuno).

These algorithms were all tested on the same validated data set of 386 chest x-rays, which were acquired between January 2012 and May 2022 from institutions in the Netherlands. Their performance was compared to reads by 17 radiologists and radiology residents with varying experience.

Example chest x-rays from a public test set illustrate algorithm and reader similarities to and discrepancies from the reference standard. A specialized radiologist determined the reference standard score (0, no nodule present; 100, one or more nodules present), and algorithms and human readers provided a probability score between 0 and 100 for each patient of the likelihood that the patient was a nodule case. (A) Radiograph in a man (age, 72 years) with a nodule present (reference standard score, 100) shows a true-positive result based on the average algorithm scores. (B) Radiograph in a man (age, 68 years) without a nodule present (reference standard score, 0) shows a true-negative result based on the average algorithm scores. (C) Radiograph in a woman (age, 64 years) without a nodule present (reference standard score, 0) shows a false-positive result based on the average algorithm scores. (D) Radiograph in a man (age, 37 years) with a nodule present (reference standard score, 100) shows a false-negative result based on the average algorithm scores. Corresponding lateral images and CT scans (when available) for these patients are presented in Figure S2 (supplementary data). The images shown in this figure were part of a public subset and not part of the set on which metrics are reported, which remains confidential for reevaluation in the future. Algorithm scores provided for the images are raw, uncalibrated scores and cannot be directly compared to each other; they are provided for indicative purposes only. Image and caption courtesy of Radiology.

The two algorithms tested for predicting bone age on x-rays were BoneXpert (Visiana) and Med-BoneAge (Vuno). These algorithms were also tested on the same validated set of 326 conventional x-rays of the left hand of children (age range, 0-18 years) and compared to the performance of three expert pediatric or musculoskeletal radiologists (with 26, 23, and 11 years of experience).

Key results were as follows, according to the findings:

Four of the seven lung nodule algorithms (Annalise Enterprise CXR, Insight CXR, Milvue Suite-SmartUrgences, and ChestEye) performed better (area under the receiver operating characteristic curve [AUC] range, 0.86-0.93) than human readers (mean AUC, 0.81; p-value range, <0.001 to 0.04).
The two algorithms for predicting bone age showed no observable difference in root mean square error (0.63 and 0.57 years) for estimating bone age compared with human readers (0.68 years).

“We have shown the feasibility of the Project AIR methodology for external validation of commercial artificial intelligence (AI) products in medical imaging,” the group wrote.

The authors noted that transparency around the performance data of commercial AI products is often unsatisfactory, with a recent review finding that no scientific evidence on performance measures was available for two-thirds of CE-marked AI products.

Thus, initiatives like Project AIR may serve to increase the transparency of the AI market, they wrote.

“It is conceivable that in the future, radiology departments will require vendors to participate in transparent and comparative evaluations as a prerequisite for purchasing AI products,” the authors concluded.

The full article is available here.