ChatGPT-4 produces 'near perfect' pancreatic cancer radiology reports

Jun 18, 2024

Chat GPT-4 outperforms GPT-3.5 when it comes to creating structured, summarized radiology reports for pancreatic ductal adenocarcinoma (PDAC), researchers have found.

The study results are good news for both clinicians and patients, as the AI tool could improve surgical decision-making, noted a team led by Rajesh Bhayana, MD, of the University of Toronto in Canada in an article published June 18 in Radiology.

"[We found that] GPT-4 created near-perfect PDAC synoptic reports from original reports … [that] GPT-4 with chain-of-thought achieved high accuracy in categorizing resectability … [and that] surgeons were more accurate and efficient [when they used] AI-generated reports," the group wrote.

Imaging is key to determining which pancreatic tumors are eligible for surgery and which are not, Bhayana and colleagues explained. But compared with free-text descriptions from imaging reports, "structured pancreatic CT reports improve communication between radiologists and surgeons and improve surgical planning and decision-making," the team wrote, further noting that "radiologist adoption of structured reporting for pancreatic cancer is inconsistent, and resectability criteria are heterogeneously applied and tumor categorization is variably reported."

To assess whether use of large language models (LLMs) could mitigate this inconsistency, the investigators compared GPT-3.5's and Chat GPT-4's ability to automatically create PDAC reports from original CT imaging reports. Their study included 180 consecutive PDAC staging CT reports from patients referred to Toronto's Princess Margaret Cancer Centre from January to December 2018.

Two radiologists reviewed the PDAC reports and set a reference standard for 14 key features and for the National Comprehensive Cancer Network (NCCN) resectability category. (Key features included, among others, tumor location, tumor size, pancreatic duct, bile ducts, celiac artery, superior mesenteric artery, common hepatic artery, aorta, major veins, lymph nodes, and metastases.) The researchers then evaluated the performance of ChatGPT-3.5 and ChatGPT-4 for recall, precision, and F1 score (which indicates an average of precision and recall, with the best value equal to 1 and the worst to 0). Additionally, hepatopancreaticobiliary surgeons assessed both original and AI-generated reports to determine PDAC resectability, comparing accuracy and review time.

The group found that, compared with GPT-3.5, GPT-4 produced equal or higher F1 scores for all 14 extracted features, and for categorizing resectability, it outperformed GPT-3.5 for each prompting strategy (i.e., chain-of-thought, knowledge), with chain-of-thought prompting being most accurate. ChatGPT-4 reduced surgeons' time spent on each report by 58%.

Bhayana's team also reported the following:

Comparison of ChatGPT-3.5 to ChatGPT-4 for PDAC radiology
Measure	ChatGPT-3.5	ChatGPT-4
F1 score, creation of summary reports	0.97	0.99
Precision, identifying tumor location	99.4%	100%
Surgeon accuracy for categorizing resectability using AI reports compared with original reports	76%	83%

"Our study demonstrates a useful application of large language models (LLMs) in pancreatic cancer care that can increase standardization, improve communication, and enhance efficiency and quality of report review by surgeons," the authors concluded.

The research supports "the sanguine view that AI, especially generative AI, will be an important enabler to achieve much-needed improvements in efficiency and value throughout the radiology workflow," wrote Paul Chang, MD, of the University of Chicago School of Medicine, in a commentary that accompanied the study. But there's more work to be done.

"A sobering reality must be acknowledged: there is … [a] gap between promising feasibility and providing operational solutions," Chang noted. "For example, how can we best incorporate this promising AI-enabled capability into a scalable and comprehensive workflow orchestration? Such a solution would need to be able to generate the appropriate downstream product in a generalizable and contextually aware manner."

The complete study can be found here.