Nuclear medicine imaging lacks a rigorous strategy for evaluating artificial intelligence (AI) algorithms developed for clinical use, say an international group of experts. To that end, they recently offered a set of best practices that they say could help transfer research into the field.
"Insufficient evaluation of AI algorithms may have multiple adverse consequences, including reducing credibility of research findings, misdirection of future research, and, most importantly, yielding tools that are useless or even harmful to patients," wrote corresponding author Abhinav Jha, PhD, of Washington University in St. Louis, MO, and colleagues.
The group was led by the AI task force of the Society for Nuclear Medicine and Molecular Imaging (SNMMI), and included representatives from U.S. industry and regulatory agencies, as well as European experts. The report was published May 26 in the Journal of Nuclear Medicine.
PET and SPECT AI-based algorithms are showing tremendous promise when applied in image acquisition, reconstruction, postprocessing, segmentation, diagnostics, and prognostics. Translating this promise to clinical reality, however, requires rigorous evaluations of the algorithms, according to the group.
"The focus of this report is purely on testing/evaluation of an already developed AI algorithm," the authors wrote.
The group highlighted several pitfalls in published studies that can adversely impact clinical utility, including that AI-based reconstruction may introduce spurious lesions, AI-based denoising may remove lesions, and that AI-based lesion segmentation may incorrectly identify healthy tissue as malignancies.
In another case, an AI-based denoising method for cardiac SPECT using realistic simulations seemed to yield excellent performance based on a figure-of-merit reference standard, yet on the task of detecting perfusion defects, the algorithm showed no performance improvement compared to noisy images.
Generalizability of AI algorithms is another major challenge.
"These algorithms may perform well on training data, but not generalize to new data, such as from a different institution, population groups, or scanners," the group wrote.
To help overcome such limits, the group recommended foremost that an AI-algorithm evaluation strategy should always produce a claim consisting of the following components:
- A clear definition of the task
- Patient population(s) for whom the task is defined
- Definition of the imaging process (acquisition, reconstruction, and analysis protocols)
- Process to extract task-specific information
- Figure of merit to quantify task performance, including process to define reference standard
In addition, the authors developed a framework that categorizes evaluation strategies into four classes: proof of concept, technical, clinical, and postdeployment evaluation, and they provided specific details for each class.
For instance, postdeployment evaluations (class 4) should monitor algorithm performance in a dynamic real-world setting after clinical deployment. This could also assess off-label use, such as the algorithm's utility in populations and diseases beyond the original claim, or with improved cameras and reconstructions that were not used during training.
"Additionally, this evaluation assesses clinical utility and value over time," the authors wrote.
The group referred to the recommendations as the RELAINCE guidelines, an acronym for "Recommendations for Evaluation of AI for Nuclear Medicine," and said that the practices may generally apply to a wide class of AI algorithms, including supervised, unsupervised, and semisupervised approaches.
Ultimately, evaluation studies should be multidisciplinary, and include computational imaging scientists, physicians, physicists, and statisticians right from the study-conception stage, they wrote.
"We envision that following these best practices for evaluation will assess suitability and provide confidence for clinical translation of these algorithms," the group concluded.