Privacy-preserving large language models (LLMs) can successfully label abnormal organs on CT reports, according to research presented at the recent Conference on Machine Intelligence in Medical Imaging (CMIMI).
A team led by Ricardo Lanfredi, PhD, from the National Institutes of Health (NIH) Clinical Center in Bethesda, MD, found that using these LLMs outperformed alternative labeling methods for CT reports. Lanfredi shared the results at the Society for Imaging Informatics in Medicine (SIIM)-hosted meeting.
“We showed ... that LLMs can do a really good job at labeling reports and extracting the information you need,” Lanfredi told AuntMinnie.com. “I hope this will be helpful for the field.”
Medical report labelers who handle various abnormalities usually target chest x-ray reports. However, labeling findings in CT reports can be challenging since they encompass a broader range of organs.
Lanfredi said that abnormality labeling for abdominal organs is an underexplored area, adding that successful labeling in this area could help create large-scale annotated CT datasets for detecting abnormalities.
The researchers put their labeling method to the test, called MAPLEZ-CT (Medical report Annotations with Privacy-preserving Large language model using Expeditious Zero shot answers). To preserve privacy, the team used the Meta-Llama-3-70B-Instruct model.
It prompted the model to use chain-of-thought reasoning, which refers to systematic problem-solving through a coherent series of logical deductions, mirroring that of human reasoning.
The researchers also prompted the LLM with an extensive definition of abnormalities, which included any unusual findings the radiologists deemed worthy of mentioning for a specific organ. These findings include atypical anatomical variations, postsurgical changes, and findings in subparts of organs. The team excluded limited evaluations, normal organs, adjacent structures, and broad anatomical areas.
From the CT reports, MAPLEZ-CT extracted sentences and classified them as important or unimportant for organs of interest. Using chain-of-thought reasoning, it determined whether there was an abnormality.
The researchers tested the model on 100 private reports and found that their version of MAPLEZ-CT outperformed other versions of publicly available Llama models and rules-based models.
Performance of large language models in classifying abnormalities in CT reports | |
---|---|
Model | F1 score |
MAPLEZ-CT (Llama-3-70B-Instruct) | 0.954 |
MAPLEZ-CT (Llama-3) | 0.86 |
MAPLEZ-CT (Llama-2) | 0.743 |
Rules-based model | 0.568 |
The MAPLEZ-CT model using Llama-3-70B-Instruct also outperformed the other models when evaluating all organs included in the study. These included the intestine, gallbladder, kidney, spleen, and liver.
Lanfredi said that with these results in mind, LLMs could one day classify multiple abnormalities at once rather than just specialized models. He told AuntMinnie.com that the team is actively working on trying to use these labels to train a vision classifier and see what results can be achieved.
“It will probably be much more challenging than a vision classifier for chest x-rays just from the volume of information in a CT [report],” Lanfredi said. “It might need some weeks of revision for localization for abnormalities.”
Lanfredi is a postdoctoral fellow in the research lab of Ronald Summers, MD, PhD, senior investigator in the Department of Radiology and Imaging Sciences at the NIH Clinical Center.