How do open-source LLMs compare with GPT-4 on x-ray reports?

Oct 29, 2024

Freely available large language models (LLMs) may overcome limitations associated with proprietary models like GPT-4 when extracting findings from chest x-ray reports, according to a study published October 29 in Radiology.

The finding comes from a comparison between open-source LLMS such as Llama, Mistral, or Qwen, and GPT-3.5 Turbo and GPT-4 using two independent datasets of free-text radiology reports.

“By demonstrating privacy, cost-effectiveness, and reproducibility, these models represent an alternative to their proprietary counterparts for text classification and structuring tasks,” noted lead author Felix Dorfner, PhD, of Harvard Medical School in Cambridge, MA, and colleagues.

LLMs are being explored for their potential to transform unstructured radiology reports into structured reporting formats and for various classification and summarization tasks. Much of the attention has focused on commercial LLMs such as GPT-4, yet these models have potential drawbacks, such as potential privacy concerns due to the need to communicate with remote servers, the authors explained.

Conversely, despite their potential, freely available LLMs -- models that can preserve privacy in local hospital systems -- remain largely overlooked for radiology report classification, they noted.

To address this gap in knowledge, the group compared the models for their ability to accurately label the presence of multiple findings using two independent datasets (the ImaGenome dataset [n = 450] and an institutional dataset [n = 500]) that together comprised 950 chest x-ray reports, The range of findings on the chest x-rays included atelectasis, fracture, enlarged cardiomediastinum, support devices, pneumothorax, pneumonia, pleural effusion, pleural other, lung opacity, lung lesion, edema, consolidation, and cardiomegaly.

The researchers used both few-shot and zero-shot prompts. In few-shot learning, models receive examples of the task in the prompt together with the task instruction, whereas in zero-shot prompting, only the task instruction is provided.

In the ImaGenome dataset, the open-source model Llama 2–70B had the highest score, with micro F1 scores (accuracy scores) of 0.97 for zero-shot learning and 0.97 for few-shot prompting. GPT-4 achieved micro F1 scores for zero-shot learning of 0.98 and 0.98 for few-shot prompting.

On the institutional dataset, an ensemble open-source model the researchers developed comprised of Llama 2–70B, Mixtral–8 × 7B, and Qwen1.5–72B had the highest scores, with a micro F1 score of 0.96 for zero-shot prompting and a score of 0.97 for few-shot prompting. GPT-4 achieved micro F1 scores of 0.98 for zero-shot prompting and 0.97 for few-shot prompting.

“These results show that open-source LLMs can serve as a viable alternative to GPT-4, as they are close in performance and offer several other important advantages,” the researchers wrote.

Aside from overcoming privacy concerns, for instance, there is no additional cost to classify reports using open-source models, whereas the GPT-4 application programming interface is charged on a per-token basis, which can be very costly, the researchers suggested.

In addition, the use of open-source models ensures consistency and reproducibility over time because the models are local, they wrote.

“These results highlight the potential of open-source LLMs to improve clinical research and practice,” the group concluded.

The full study is available here.