TY - JOUR
T1 - ChatGPT-4's Accuracy in Estimating Thyroid Nodule Features and Cancer Risk From Ultrasound Images
AU - Cabezas, Esteban
AU - Toro-Tobon, David
AU - Johnson, Thomas
AU - Álvarez, Marco
AU - Azadi, Javad R.
AU - Gonzalez-Velasquez, Camilo
AU - Singh Ospina, Naykky
AU - Ponce, Oscar J.
AU - Branda, Megan E.
AU - Brito, Juan P.
N1 - Publisher Copyright:
© 2025 AACE
PY - 2025/6
Y1 - 2025/6
N2 - Objective: To evaluate the performance of GPT-4 and GPT-4o in accurately identifying features and categories from thyroid nodule ultrasound images following the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS). Methods: This comparative validation study, conducted between October 2023 and May 2024, utilized 202 thyroid ultrasound images sourced from 3 open-access databases. Both complete and cropped versions of each image were independently evaluated by expert radiologists to establish a reference standard for TI-RADS features and categories. GPT-4 and GPT-4o were prompted to analyze each image, and their generated TI-RADS outputs were compared to the reference standard. Results: GPT-4 demonstrated high specificity but low sensitivity when assessing complete thyroid ultrasound images across most TI-RADS categories, resulting in mixed overall accuracy. For low-risk nodules (benign), GPT-4 achieved 25.0% sensitivity, 99.5% specificity, and 93.6% accuracy. In contrast, in the higher risk moderately suspicious category GPT-4 showed 75% sensitivity, 22.2% specificity, and 42.1% accuracy. While GPT-4 effectively identified features like smooth margins (73% vs 65% the reference standard), it struggled to identify other features like isoechoic echogenicity (5% vs 46%), and echogenic foci (3% vs 27%). The assessment of cropped images using both GPT-4 and GPT-4o followed similar patterns, though with slight deviations indicating a decrease in performance compared to GPT-4's assessment of complete images. Conclusion: While GPT-4 and GPT-4o models show potential for improving the efficiency of thyroid nodule triage, their performance remains suboptimal, particularly in higher-risk categories. Further refinement and validation of these models are necessary before clinical implementation.
AB - Objective: To evaluate the performance of GPT-4 and GPT-4o in accurately identifying features and categories from thyroid nodule ultrasound images following the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS). Methods: This comparative validation study, conducted between October 2023 and May 2024, utilized 202 thyroid ultrasound images sourced from 3 open-access databases. Both complete and cropped versions of each image were independently evaluated by expert radiologists to establish a reference standard for TI-RADS features and categories. GPT-4 and GPT-4o were prompted to analyze each image, and their generated TI-RADS outputs were compared to the reference standard. Results: GPT-4 demonstrated high specificity but low sensitivity when assessing complete thyroid ultrasound images across most TI-RADS categories, resulting in mixed overall accuracy. For low-risk nodules (benign), GPT-4 achieved 25.0% sensitivity, 99.5% specificity, and 93.6% accuracy. In contrast, in the higher risk moderately suspicious category GPT-4 showed 75% sensitivity, 22.2% specificity, and 42.1% accuracy. While GPT-4 effectively identified features like smooth margins (73% vs 65% the reference standard), it struggled to identify other features like isoechoic echogenicity (5% vs 46%), and echogenic foci (3% vs 27%). The assessment of cropped images using both GPT-4 and GPT-4o followed similar patterns, though with slight deviations indicating a decrease in performance compared to GPT-4's assessment of complete images. Conclusion: While GPT-4 and GPT-4o models show potential for improving the efficiency of thyroid nodule triage, their performance remains suboptimal, particularly in higher-risk categories. Further refinement and validation of these models are necessary before clinical implementation.
KW - artificial intelligence
KW - chatbot
KW - large language models
KW - thyroid nodule
KW - TI-RADS
KW - ultrasound
UR - https://www.scopus.com/pages/publications/105003062635
U2 - 10.1016/j.eprac.2025.03.008
DO - 10.1016/j.eprac.2025.03.008
M3 - Article
C2 - 40139461
AN - SCOPUS:105003062635
SN - 1530-891X
VL - 31
SP - 716
EP - 723
JO - Endocrine Practice
JF - Endocrine Practice
IS - 6
ER -