Abstract
Objective: To evaluate the performance of GPT-4 and GPT-4o in accurately identifying features and categories from thyroid nodule ultrasound images following the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS). Methods: This comparative validation study, conducted between October 2023 and May 2024, utilized 202 thyroid ultrasound images sourced from 3 open-access databases. Both complete and cropped versions of each image were independently evaluated by expert radiologists to establish a reference standard for TI-RADS features and categories. GPT-4 and GPT-4o were prompted to analyze each image, and their generated TI-RADS outputs were compared to the reference standard. Results: GPT-4 demonstrated high specificity but low sensitivity when assessing complete thyroid ultrasound images across most TI-RADS categories, resulting in mixed overall accuracy. For low-risk nodules (benign), GPT-4 achieved 25.0% sensitivity, 99.5% specificity, and 93.6% accuracy. In contrast, in the higher risk moderately suspicious category GPT-4 showed 75% sensitivity, 22.2% specificity, and 42.1% accuracy. While GPT-4 effectively identified features like smooth margins (73% vs 65% the reference standard), it struggled to identify other features like isoechoic echogenicity (5% vs 46%), and echogenic foci (3% vs 27%). The assessment of cropped images using both GPT-4 and GPT-4o followed similar patterns, though with slight deviations indicating a decrease in performance compared to GPT-4's assessment of complete images. Conclusion: While GPT-4 and GPT-4o models show potential for improving the efficiency of thyroid nodule triage, their performance remains suboptimal, particularly in higher-risk categories. Further refinement and validation of these models are necessary before clinical implementation.
| Original language | English |
|---|---|
| Pages (from-to) | 716-723 |
| Number of pages | 8 |
| Journal | Endocrine Practice |
| Volume | 31 |
| Issue number | 6 |
| DOIs | |
| Publication status | Published - Jun 2025 |
| Externally published | Yes |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
ASJC Scopus subject areas
- Endocrinology, Diabetes and Metabolism
- Endocrinology
Keywords
- artificial intelligence
- chatbot
- large language models
- thyroid nodule
- TI-RADS
- ultrasound
Fingerprint
Dive into the research topics of 'ChatGPT-4's Accuracy in Estimating Thyroid Nodule Features and Cancer Risk From Ultrasound Images'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver