ChatGPT-4's Accuracy in Estimating Thyroid Nodule Features and Cancer Risk From Ultrasound Images

Esteban Cabezas, David Toro-Tobon, Thomas Johnson, Marco Álvarez, Javad R. Azadi, Camilo Gonzalez-Velasquez, Naykky Singh Ospina, Oscar J. Ponce, Megan E. Branda, Juan P. Brito*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Objective: To evaluate the performance of GPT-4 and GPT-4o in accurately identifying features and categories from thyroid nodule ultrasound images following the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS). Methods: This comparative validation study, conducted between October 2023 and May 2024, utilized 202 thyroid ultrasound images sourced from 3 open-access databases. Both complete and cropped versions of each image were independently evaluated by expert radiologists to establish a reference standard for TI-RADS features and categories. GPT-4 and GPT-4o were prompted to analyze each image, and their generated TI-RADS outputs were compared to the reference standard. Results: GPT-4 demonstrated high specificity but low sensitivity when assessing complete thyroid ultrasound images across most TI-RADS categories, resulting in mixed overall accuracy. For low-risk nodules (benign), GPT-4 achieved 25.0% sensitivity, 99.5% specificity, and 93.6% accuracy. In contrast, in the higher risk moderately suspicious category GPT-4 showed 75% sensitivity, 22.2% specificity, and 42.1% accuracy. While GPT-4 effectively identified features like smooth margins (73% vs 65% the reference standard), it struggled to identify other features like isoechoic echogenicity (5% vs 46%), and echogenic foci (3% vs 27%). The assessment of cropped images using both GPT-4 and GPT-4o followed similar patterns, though with slight deviations indicating a decrease in performance compared to GPT-4's assessment of complete images. Conclusion: While GPT-4 and GPT-4o models show potential for improving the efficiency of thyroid nodule triage, their performance remains suboptimal, particularly in higher-risk categories. Further refinement and validation of these models are necessary before clinical implementation.

Original languageEnglish
Pages (from-to)716-723
Number of pages8
JournalEndocrine Practice
Volume31
Issue number6
DOIs
Publication statusPublished - Jun 2025
Externally publishedYes

ASJC Scopus subject areas

  • Endocrinology, Diabetes and Metabolism
  • Endocrinology

Keywords

  • artificial intelligence
  • chatbot
  • large language models
  • thyroid nodule
  • TI-RADS
  • ultrasound

Fingerprint

Dive into the research topics of 'ChatGPT-4's Accuracy in Estimating Thyroid Nodule Features and Cancer Risk From Ultrasound Images'. Together they form a unique fingerprint.

Cite this