- Review
- Open access
- Published:
Applications of ChatGPT in the diagnosis, management, education, and research of retinal diseases: a scoping review
International Journal of Retina and Vitreous volume 10, Article number: 79 (2024)
Abstract
Purpose
This scoping review aims to explore the current applications of ChatGPT in the retina field, highlighting its potential, challenges, and limitations.
Methods
A comprehensive literature search was conducted across multiple databases, including PubMed, Scopus, MEDLINE, and Embase, to identify relevant articles published from 2022 onwards. The inclusion criteria focused on studies evaluating the use of ChatGPT in retinal healthcare. Data were extracted and synthesized to map the scope of ChatGPT’s applications in retinal care, categorizing articles into various practical application areas such as academic research, charting, coding, diagnosis, disease management, and patient counseling.
Results
A total of 68 articles were included in the review, distributed across several categories: 8 related to academics and research, 5 to charting, 1 to coding and billing, 44 to diagnosis, 49 to disease management, 2 to literature consulting, 23 to medical education, and 33 to patient counseling. Many articles were classified into multiple categories due to overlapping topics. The findings indicate that while ChatGPT shows significant promise in areas such as medical education and diagnostic support, concerns regarding accuracy, reliability, and the potential for misinformation remain prevalent.
Conclusion
ChatGPT offers substantial potential in advancing retinal healthcare by supporting clinical decision-making, enhancing patient education, and automating administrative tasks. However, its current limitations, particularly in clinical accuracy and the risk of generating misinformation, necessitate cautious integration into practice, with continuous oversight from healthcare professionals. Future developments should focus on improving accuracy, incorporating up-to-date medical guidelines, and minimizing the risks associated with AI-driven healthcare tools.
Introduction
Large language models (LLMs) such as ChatGPT (OpenAI, San Francisco, CA, USA), Bing Chat (Microsoft Corporation, Redmond, WA, USA), and Gemini (Google LLC, Mountain View, CA, USA) have gained substantial popularity and represent significant advancements in natural language processing. These models, trained on extensive datasets, can interpret texts, commands, and questions, generating responses that closely mimic human conversation. Among them, ChatGPT has emerged as the most prominent, receiving widespread usage and recognition, even within scientific literature [1].
GPT stands for “Generative Pre-trained Transformer,” a type of artificial intelligence model designed to understand and generate human-like text. It is pre-trained on vast amounts of data and fine-tuned for specific tasks, enabling it to respond coherently to a wide range of language inputs. The potential applications of ChatGPT in various fields, including journalism, marketing, education, and professional writing, are considerable. However, its role in medicine, particularly in specialized areas like retinal care, remains uncertain. In medicine, potential uses include medical education, clinical decision support, patient counseling, academic writing, charting, and billing, among others, as the technology continues to evolve [2]. Epic (Epic Systems, Verona, WI, USA), a widely used electronic charting system, has already integrated AI tools to assist with various tasks such as generating visit summaries, coding, drafting messages to patients, and charting. These AI-driven features aim to streamline administrative tasks, reduce the burden on healthcare providers, and improve the overall efficiency of patient care [3].
One of the most explored applications of ChatGPT in ophthalmology is medical education. Studies have shown that ChatGPT often outperforms other LLMs in medical knowledge assessments, including board-style questions, official board examinations, and challenging clinical cases, sometimes even surpassing the performance of the average physician [4,5,6,7,8,9,10,11,12,13]. Although caution is needed due to the potential for inaccuracies, ChatGPT shows significant promise as a supportive tool throughout various stages of medical training.
In patient management, the performance of ChatGPT has yielded mixed results. It is possible to obtain diagnostic and management recommendations by inputting patient data and clinical presentations, but concerns about hallucinations—where the AI generates plausible-sounding but factually incorrect or nonsensical information—and factual inaccuracies limit its reliability. For example, ChatGPT may fabricate references that do not exist, presenting them as legitimate sources of information [14]. Studies have evaluated ChatGPT’s ability to provide management advice, with results varying based on case complexity, presentation style, and the benchmarks used for comparison [14,15,16,17,18,19,20,21,22,23,24,25]. The use of ChatGPT for triage and direct patient counseling in ophthalmology and retinal care has also been investigated. Patients could interact with the LLM to describe their symptoms and receive guidance on the appropriate level of care. However, challenges such as potential patient miscommunication and the LLM’s ability to accurately interpret symptoms remain [26,27,28].
Another area where ChatGPT could be beneficial is in automating charting and billing processes. In versions that accept audio inputs, ChatGPT can process clinical consultation data to generate organized reports and notes, potentially reducing paperwork and freeing resources for other aspects of medical care [29,30,31,32,33]. Nevertheless, concerns about inaccuracies, privacy, and security are significant, particularly regarding the sensitivity of patient information and evolving regulations on data privacy for AI tools [33].
There has also been significant discussion about ChatGPT’s role in academic writing. For non-native English speakers, ChatGPT offers considerable benefits as a tool for drafting and reviewing academic content [34,35,36]. However, its tendency to produce hallucinations and factual errors raises concerns about academic integrity. Issues related to authorship and responsibility have been highlighted, with a consensus that ChatGPT should not be credited as an author in most scientific journals [37,38,39].
To better understand ChatGPT’s applications in retinal care, we conducted a scoping review of the literature. Our review highlights the primary uses of this LLM in the retina field, along with the associated risks and challenges it presents.
Methods
This scoping review aimed to comprehensively explore the current literature on the applications of ChatGPT in the diagnosis, management, and research of retinal diseases. Prior to initiating the review, we conducted a preparatory literature search to ensure no similar reviews had previously been conducted. To maintain high standards of scientific rigor, we published a prospectus of our review on the Open Science Foundation (OSF) public repository. The original prospectus can be found in the supplementary materials. The review protocol adhered to the PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines for scoping reviews [40].
We executed a comprehensive search across multiple databases, including PubMed, Scopus, MEDLINE, and Embase, to identify relevant articles. Our search strategy comprised specific query strings designed to capture articles discussing the use of ChatGPT in retinal healthcare. For the search to be comprehensive yet focused, we stipulated that each article must contain at least one keyword from both of our designated groups. Group 1 keywords are centered around the technology in question: ChatGPT, GPT, LLM, or “large language”. Group 2 keywords pivot around the field of retinal diseases: ophthalmolog*, retin*, vitre*, uvea*, uvei*, chor*, and macul*. Initially, “eye” was a part of Group 2, and “generative” was included in Group 1. However, upon review, “eye” was casting too wide a net, capturing an excess of articles not specific to ophthalmology. Similarly, “generative” was drawing in a significant number of irrelevant results related to generative adversarial networks, rather than our intended AI models. Therefore, we refined our terms to the most pertinent and specific to our research. The detailed search strings used in each database are presented in Table 1. In addition to database searches, we performed hand-searching of reference lists and grey literature using web-based search engines such as Google Scholar and repositories like ResearchGate. Preprints available in Scopus were also included. Given that the first version of ChatGPT was launched in November 2022, the search period was restricted to publications from that year onward. The searches were conducted on April 21, 2024. To ensure the inclusion of recent literature, a simplified search was conducted at the end of the review process, two months later, focusing on the terms “ChatGPT” and “retina,” and was limited to PubMed.
The inclusion criteria for this review encompassed studies discussing the application of ChatGPT across all aspects of retinal healthcare and knowledge production, involving researchers, healthcare professionals, patients, and their families within the care setting for retinal diseases. We concentrated on studies evaluating the feasibility, task facilitation, accuracy, and limitations of using ChatGPT. Articles were excluded if they did not pertain to retinal diseases, assessed other forms of AI without direct comparison to ChatGPT, were purely theoretical without empirical evaluation or application, or were not available in English.
Two independent, blinded reviewers (VCFB and MLS) screened the titles and abstracts of identified studies for eligibility using the Rayyan platform (Rayyan Inc, USA). Full texts of potentially eligible studies were retrieved and independently assessed. Any disagreements were resolved by consensus, ensuring a rigorous inclusion of relevant articles.
Data extraction was conducted by the two independent reviewers, who documented study characteristics, application areas, and outcomes from each eligible study. Studies were categorized based on their practical application areas, with the possibility of a single article falling into multiple categories. These categories included: academics and research, charting, coding and billing, diagnosis, disease management, literature consulting, medical education, and patient counseling. Outcomes of interest included the feasibility of using ChatGPT, its effectiveness in facilitating healthcare-related tasks, the accuracy of the information provided, and the limitations or challenges encountered in its application.
The extracted data were synthesized narratively to map the scope of ChatGPT’s applications in retinal healthcare. We identified patterns, themes, potentials, and limitations across different application areas.
As this study was a scoping review, it did not involve direct interaction with human subjects, and therefore, ethical approval was not required. Nevertheless, the review was conducted following rigorous scientific and ethical standards to ensure the reliability and validity of the findings.
Results
The initial search conducted on April 21, 2024, yielded a total of 518 articles. The distribution of these articles across the different databases was as follows: PubMed (197), Scopus (160), MEDLINE (136), and Embase [25]. After removing 256 duplicates, 262 unique articles remained for screening.
During the independent screening process, the reviewers initially disagreed on the inclusion of 10 articles. These conflicts were resolved through open discussion, resulting in 98 articles being included for full-text reading. Following detailed examination, 32 articles were excluded for not meeting the inclusion criteria, leaving a total of 66 articles for inclusion in the final analysis.
Out of the 164 articles excluded during the screening phase, 108 were not specific to the retina, 27 did not pertain to ChatGPT, and 29 were classified as purely theoretical, including review articles, comments (without empirical data), and replies. From the 98 articles assessed for full-text eligibility, an additional 32 were excluded: 14 were not specific to retina, 2 were not about ChatGPT, and an additional 16 were purely theoretical. It is important to note that articles not specific to retina but related to ophthalmology in general were included if they compared subspecialties and/or specifically mentioned retina. A PRISMA flowchart is presented in Fig. 1.
To ensure the inclusion of the most recent and relevant studies, an update query was conducted on July 7, 2024, using PubMed with the search terms “ChatGPT” and “retina,” filtered for publications from April 2024 onwards. This search resulted in 10 additional articles, of which 6 were identified as duplicates. After reviewing the full text of the remaining 4 articles, 2 were deemed relevant and included in the analysis. The other 2 were excluded due to being purely theoretical and not related to retinal care, respectively.
In total, 68 articles were included in this scoping review. Although purely theoretical articles, such as commentaries and reviews, were excluded from the primary analysis, they were taken into consideration when deemed relevant (i.e., if they introduced new and pertinent information). This approeach ensured a comprehensive overview of the current state of knowledge and opinions regarding the use of ChatGPT in retinal healthcare.
The articles were categorized as follows (Fig. 2): 8 focused on academics and research, 5 on charting, 1 on coding and billing, 44 on diagnosis, and 49 on disease management. Additionally, 2 articles were about literature consulting, and 23 addressed medical education, specifically analyzing ChatGPT’s performance on exams. Furthermore, 33 articles involved patient counseling, either in preparing patient materials or evaluating ChatGPT’s reliability as an information source. Lastly, 3 articles dealt with symptom triaging, a category not initially defined in the protocol. During screening and extraction, this category was temporarily labeled as “Other” but was later redefined as “Triage and Pre-Hospital Management” for better organization. This category includes the processes involved in the initial assessment and prioritization of patients’ symptoms, as well as decisions regarding the need for immediate medical intervention before hospital admission.
Discussion
Academics and research
The integration of ChatGPT and other AI tools into academic and research practices within retinal care showcases both promising potential and significant limitations. Various studies have explored the utility of these technologies in different capacities, providing insights into their performance and highlighting areas for improvement.
From the studies focused on the academic uses of ChatGPT, the authors highlight the one published by Valentín-Bravo et al., who examined ChatGPT for its ability to generate scientific content on topics such as the complications associated with silicone oil in vitreoretinal surgery. While ChatGPT was capable of producing coherent summaries, it often lacked the necessary scientific rigor and accuracy, exhibiting issues such as inappropriate scientific discourse and incorrect bibliographic references. This underscores the necessity of human oversight to ensure reliability and address ethical concerns [41]. Comparatively, the performance of ChatGPT-3.5 and Bing Chat on a multiple-choice ophthalmology exam revealed that Bing Chat, enhanced by its integration with online data retrieval, outperformed ChatGPT-3.5. However, the reliance on non-peer-reviewed sources by Bing Chat necessitated a careful appraisal of its responses for educational purposes, highlighting the importance of verifying the quality and reliability of AI-generated content [42].
Further illustrating AI’s potential, the development of a bilingual Chinese-English Indocyanine Green Angiography (ICGA) report generation and QA system using AI demonstrated high accuracy and substantial agreement among ophthalmologists. Nevertheless, the system struggled with rare cases and occasionally provided vague or inaccurate answers. This suggests that while AI can streamline research processes, particularly in managing large datasets and facilitating communication across languages, there remains a need for ongoing optimization [43]. However, there may be a gap in ChatGPT’s performance across different languages. In a study on retinal vascular disease classification, ChatGPT’s diagnostic performance was superior with English prompts compared to Chinese prompts, pointing to the limitations of current large language models in non-English settings and the need for further development [44].
In more complex clinical scenarios, GPT-4’s performance in answering questions about intricate ophthalmology cases was evaluated. An article by Milad et al. demonstrated that improved prompting strategies enhanced GPT-4’s performance, yet it still lagged behind ophthalmology trainees. This article, however, introduced valuable concepts on how proper prompt engineering can yield better results. It compares Traditional Zero-Shot Prompting, which involves directly asking GPT-4 to solve a task without providing any prior examples or structured guidance, with Zero-Shot Plan-and-Solve+ (PS+). The PS + strategy involves asking GPT-4 to create a plan by breaking down the main task into smaller subtasks, followed by executing these subtasks with detailed instructions. This method improved the logical reasoning and accuracy of the model by providing a structured approach to problem-solving [11].
To ensure more reliable responses, a group of researchers investigated the integration of GPT-4 with verified textbook knowledge. For this purpose, they created Aeyeconsult, an AI chatbot that leverages verified textbook knowledge alongside GPT-4. Aeyeconsult demonstrated superior accuracy and consistency compared to ChatGPT-4 alone. The use of verified sources and citation provision significantly enhanced the reliability of Aeyeconsult, underscoring the importance of source verification in AI applications for academic purposes and suggesting a pathway to improve AI reliability in medical education and research [45].
The evaluation of self-awareness capabilities in ChatGPT and its competitor Google Bard, now Google Gemini (Google Inc, Mountain View, CA, USA), found that while these chatbots showed some ability to self-correct, their self-checking capabilities were limited. This highlights the need for continuous improvement in AI’s ability to autonomously verify and correct its outputs, a vital feature for ensuring the reliability of AI-generated academic content [22].
Mihalache et al. evaluated the performance of ChatGPT-3.5 and ChatGPT-4 in providing information on age-related macular degeneration (AMD) and diabetic retinopathy (DR), comparing the chatbot responses with the American Academy of Ophthalmology’s Preferred Practice Pattern guidelines. Both models performed similarly, generally providing accurate and readable responses, though caution is advised as occasional inaccuracies and omissions were observed, emphasizing the need for professional oversight [46].
Lastly, in another investigation, Taloni et al. explored the potential misuse of ChatGPT-4’s Advanced Data Analysis (ADA) capabilities to create a fake medical dataset. Researchers provided the model with detailed prompts to fabricate data for 300 eyes from 250 keratoconus patients who underwent deep anterior lamellar keratoplasty (DALK) or penetrating keratoplasty (PK). The results showed that ADA successfully created a seemingly authentic dataset with statistically significant differences in preoperative and postoperative outcomes favoring DALK over PK. While not directly related to retinal care, this study raises significant concerns about the potential for ADA to generate convincing but false scientific evidence. This underscores the necessity for robust strategies to detect AI-generated data fabrication and safeguard the integrity of scientific research, whether in retinal ophthalmology or other fields of science [47].
In summary, while ChatGPT and similar AI tools hold significant potential for advancing academic and research activities in retinal care, their current limitations necessitate a cautious and critical approach. Ensuring human oversight, source verification, and addressing ethical concerns are imperative steps in harnessing the full potential of AI in this field.
Charting
The application of ChatGPT and other LLMs in the field of retina care extends significantly into the domain of charting, which includes tasks such as describing patient encounters, producing surgical notes, and creating discharge summaries. Various studies have highlighted both the potential benefits and limitations of these AI tools in automating and enhancing documentation processes [29,30,31,32, 43].
In one study, researchers developed a system called ICGA-GPT to assist with the interpretation of ICGA images by automating the generation of bilingual reports and enabling interactive question-answering. The ICGA-GPT model demonstrated satisfactory performance in generating detailed reports, with substantial agreement among ophthalmologists on their completeness and accuracy. This capability can significantly reduce the time and effort required by ophthalmologists write ancillary examination reports and explain them to patients, thereby improving workflow efficiency and patient care [43].
The use of GPT-4 in documenting surgical notes for ophthalmic procedures has also shown promise. Studies have explored its application in generating detailed and contextually accurate operative notes for cataract surgeries, including complex cases with complications. This capability suggests potential applications in documenting retinal procedures as well. Furthermore, the ability of GPT-4 to generate templates without inputting confidential medical information ensures patient privacy while maintaining efficiency [29, 30].
Further extending the capabilities of ChatGPT, another study examined its performance in generating discharge summaries and operative notes across various ophthalmic subspecialties. ChatGPT produced valid and detailed documents rapidly, with the ability to incorporate specific medications, follow-up instructions, and other essential details based on the quality of input prompts. However, the presence of generic text and occasional factual inaccuracies necessitated human verification. The ability of ChatGPT to admit mistakes and correct itself upon prompting highlights its potential for continuous learning and improvement [31].
In summary, while ChatGPT and similar LLMs demonstrate significant potential in automating and enhancing charting tasks in retina care, their current limitations necessitate cautious integration into clinical practice. Ensuring human oversight, continuous updates, and validation by healthcare professionals are crucial to harness the full benefits of these AI tools. Future developments should focus on improving accuracy, incorporating up-to-date medical guidelines, training to minimize generic text and progressively replace it with customized, patient-specific information, and enhancing the models’ ability to autonomously verify and correct their outputs. With these improvements, AI can play a transformative role in reducing the administrative burden on healthcare providers and improving the overall efficiency and quality of medical documentation.
Coding and billing
The application of ChatGPT and other LLMs in coding and billing within retina care presents a promising opportunity to reduce administrative burdens. Ong et al. evaluated ChatGPT’s ability to generate ICD codes for retina clinic encounters, a crucial yet time-consuming task for physicians. Retina specialists created mock clinic scenarios, which were inputted into ChatGPT to generate the appropriate ICD codes. ChatGPT correctly produced ICD codes for 70% of the encounters, demonstrating a notable true positive rate. However, the study also identified significant weaknesses, such as the generation of fabricated or incorrect codes and occasional “hallucinations” of plausible-sounding but incorrect responses [32]. These findings highlight the need for improvements, including feedback fine-tuning and updated ICD code guidelines, to enhance accuracy and reliability in clinical practice.
Diagnosis
The integration of LLMs like ChatGPT into ophthalmology has demonstrated both significant promise and notable challenges, as highlighted by various studies exploring their application in the diagnostic process of ophthalmic conditions. These studies underscore advancements in AI models, particularly in improving diagnostic accuracy, while also identifying areas that require further refinement.
In terms of performance on diagnostic examinations, several studies have shown that newer iterations of LLMs, such as GPT-4, exhibit enhanced capabilities compared to earlier versions like GPT-3.5. For instance, Lin JC and colleagues reported that GPT-4 outperformed both GPT-3.5 and human participants on a practice ophthalmology written examination, scoring 76.9%, compared to 63.1% for GPT-3.5 and 72.6% for humans [48]. This trend of improved performance was echoed by Teebagy S. et al., who found that GPT-4 achieved a higher score on the Ophthalmology Knowledge Assessment Program (OKAP) examination, correctly answering 81% of the questions versus 57% for GPT-3.5 [49]. These findings suggest that the architectural advancements and enhanced training data in GPT-4 significantly contribute to its superior diagnostic performance.
Shemer et al. evaluated the diagnostic accuracy of ChatGPT in ophthalmology by comparing it to residents and attendings. ChatGPT showed lower accuracy rates (54–68%) compared to residents (75–94%) and attendings (71–86%), but it was significantly faster in providing diagnoses, completing cases in a fraction of the time required by humans [50].
Further research by Alexander AC et al. highlighted GPT-4’s superior handling of complex medical knowledge, achieving a perfect score on multiple-choice questions from India’s Foreign Medical Graduate Examination (FMGE), showcasing its potential in diagnostic applications [12]. Similarly, Antaki F et al. observed that GPT-4’s performance on ophthalmology board-style questions was on par with human respondents [5].
Beyond performance on exams, studies have also assessed ChatGPT’s diagnostic accuracy and its utility in clinical decision-making. For example, Hu X et al. investigated GPT-4’s ability to diagnose rare ophthalmic diseases and found that it was most accurate in scenarios where detailed clinical information was available, suggesting its potential as a consultation tool [51]. On the other hand, Haddad F et al. noted that while ChatGPT’s diagnostic accuracy did not surpass that of human experts in answeting ophthalmology-related questions, it still holds strong potential as an educational resource [10].
However, limitations remain. Raghu et al. evaluated ChatGPT’s potential as a tool for assessing diabetic retinopathy (DR) risk by analyzing clinical and biochemical data from 111 diabetic patients. While ChatGPT demonstrated good reliability in its responses, its sensitivity and specificity in predicting DR were moderate, indicating further optimization is needed before it can be reliably used in clinical settings [52]. Cai LZ et al. pointed out that while GPT-4 performed comparably to the average human respondent on ophthalmology board-style questions, it still requires improvements in clinical decision-making and diagnostic recommendations [7]. Similarly, Jiao C et al. demonstrated that GPT-4 outperformed GPT-3.5 in addressing multiple-choice ophthalmic case challenges, with improved accuracy even in image-related questions, but acknowledged the the risk of generating fictional information and weakness in more complex inference-based cases [8].
Milad D et al. assessed GPT-4’s ability to diagnose and recommend treatment for complex ophthalmology cases, noting that enhanced prompting strategies improved its performance, though it still did not surpass human expertise [11]. So called “prompt engineering” enhances GPT’s diagnostic capabilities by optimizing the way information is requested and processed. By carefully crafting prompts to include specific clinical details or guide the AI’s reasoning process, clinicians can improve the accuracy and relevance of GPT’s responses, making it a more effective tool in diagnostic applications [11]. In a comparative study of AI chatbots, Sensoy E et al. found that Bard, Bing, and ChatGPT had no statistically different accuracy in diagnosing ophthalmic pathologies and intraocular tumors [53].
The application of AI in ophthalmic imaging has also been explored. Chen X et al. developed a model called “ICGA-GPT”, which showed high specificity and accuracy in generating reports from ICGA images. The model not only generated reports based on the images but also suggested diagnoses and provided interactive question-answering capabilities, offering explanations and clarifications regarding the generated findings. The authors highlight the need for more diverse datasets and better handling of rare conditions, despite the promising initial results [43]. Furthermore, Mihalache A et al. evaluated GPT-4’s ability to interpret ophthalmic images and noted a 70% accuracy rate overall, with the highest accuracy in the retina category. The study emphasizes the challenges AI faces with complex visual data, suggesting that multimodal AI tools could significantly enhance diagnostic capabilities if integrated carefully [54].
Comparative studies between different AI models further illustrate the advancements and limitations of current technologies. Masalkhi M et al. compared Meta’s Llama 2 (Meta Inc, Menlo Park, CA, USA) and ChatGPT, finding that while both models provided appropriate medical advice, they had unique strengths and weaknesses in image analysis and medical insights [55]. Another study by the same group compared Google’s Gemini AI with ChatGPT, highlighting Gemini AI’s notable improvements in image analysis but also its limitations compared to GPT-4 [56].
In specialized applications, Singer MB et al. discussed the development of Aeyeconsult, an AI-powered ophthalmology chatbot that outperformed ChatGPT-4 in accuracy by leveraging verified textbook knowledge. The authors compared responses to ophthalmology questions by Aeyeconsult and ChatGPT-4, finding that Aeyeconsult’s integration of verified sources significantly enhanced its accuracy and reliability [45].
Rojas-Carabali W et al. evaluated ChatGPT’s diagnostic accuracy for uveitis, revealing that although the model’s diagnostic suggestions were generally accurate, they did not surpass human accuracy, underscoring the need for human oversight. The authors recommend using specialized uveitis databases to train the model and enhance its performance in domain-specific diagnoses. Interestingly, when prompted for the most likely diagnosis along with the two most probable differential diagnoses, both GPT-3.5 and GPT-4 models showed an 8% increase in diagnostic accuracy [57, 58].
Thirunavukarasu AJ et al. examined the performance of ChatGPT in a general practice setting, demonstrating that GPT-4 showed significant improvements over GPT-3.5, suggesting its potential for integration into clinical practice [59]. In a comparative study, Waisberg E et al. evaluated Google’s AI chatbot Bard against ChatGPT, highlighting Bard’s strength in delivering accurate and specific medical advice, largely due to its internet connectivity. However, they emphasized the necessity for ophthalmology-specific training data to further enhance the model’s effectiveness [60]
Meanwhile, Tao BK et al. compared the performance of Bing Chat, which also has internet access, with GPT-3.5, which lacks this feature, on a multiple-choice ophthalmology exam. Their findings indicated that Bing Chat’s ability to search the web enhanced accuracy rates, while also providing better readability and more reliable citation features [42].
Disease management
The application LLMs like ChatGPT in disease management within ophthalmology has demonstrated substantial potential, yet it also highlights several areas requiring improvement. The following discussion integrates findings from various studies, emphasizing how ChatGPT can support physicians in decision-making, triage, and evidence-based practice.
Studies by Lin JC et al. and Alexander AC et al. demonstrated GPT-4’s superior performance on practice ophthalmology examinations compared to GPT-3.5 and human users, suggesting its utility in aiding physicians’ understanding of complex medical topics [12, 48]. However, Antaki F et al. highlighted that while GPT-4 showed significant improvement over GPT-3.5 in answering ophthalmology board-style questions, it did not outperform human experts [6]. Cai LZ et al. echoed these findings, noting that GPT-4 performed comparably to the average human respondent but still required further refinement in clinical decision-making and diagnostic recommendations [7]. Generally, ChatGPT performed better on written questions and worse on image-based questions [48].
The application of ChatGPT in clinical decision support has been explored through various studies. Haddad F et al. evaluated ChatGPT’s responses to ophthalmology-related questions and found that while its diagnostic accuracy was commendable, it was lower than that of human experts [10]. Inayat H et al. similarly noted that ChatGPT’s accuracy in determining the degree of urgency of given clinical pictures was lower than that of ophthalmology residents and consultants, emphasizing the need for human involvement in clinical decisions [61]. Carlà MM et al. found that, in comparison with ChatGPT-3.5 and Google Gemini, ChatGPT-4 shows the highest accuracy and agreement with expert surgeons, surpassing both models in offering detailed and coherent surgical plans, particularly in complex cases like combined cataract and retinal surgery [62].
ChatGPT’s role in supporting evidence-based practice has been explored through its ability to provide accurate and reliable medical information. Capellani F et al. evaluated the reliability and accuracy of ChatGPT’s responses on ophthalmic diseases compared to AAO guidelines, finding that while most answers were correct and useful, a significant portion was incomplete or incorrect. This study concluded that ChatGPT should be used with medical supervision to avoid misinformation [14]. As mentioned in the previous section, Sakai D et al. developed Aeyeconsult, an AI-powered ophthalmology chatbot leveraging verified textbook knowledge and GPT-4, which outperformed ChatGPT-4 in accuracy. This study demonstrated the feasibility and enhanced reliability of using AI chatbots that cite verified sources, suggesting significant potential for such tools in clinical practice [63].
While the studies reviewed show that ChatGPT and other LLMs hold significant promise in supporting disease management, several limitations remain. For instance, ChatGPT often provides general rather than specific, contextually relevant information, as noted by Choudhary A et al. [64] Their study compared the treatment recommendations provided by AI platforms, such as ChatGPT 3.5, ChatGPT 4.0, and Bing AI, with those given by retina specialists. The findings revealed moderate to substantial agreement between AI and clinician recommendations, particularly for the management of diabetic macular edema (DME), though there were some discrepancies in cases involving co-existing ocular and systemic co-morbidities. The study concludes that AI has potential as a complementary tool in DME management but emphasizes the need for further refinement to align AI recommendations more closely with clinical guidelines. [64] Additionally, Subramanian B et al. found that while ChatGPT-4 provided highly appropriate and mostly complete answers to queries related to diabetic retinopathy, certain complex medical queries were less comprehensively addressed, indicating the need for ongoing refinement [65].
Milad D et al. assessed GPT-4’s ability to diagnose and recommend treatment for complex ophthalmology cases, noting that enhanced prompting strategies improved its performance. However, GPT-4 was still outperformed by senior ophthalmology residents, suggesting its current limitations in providing expert-level guidance [11]. Maywood et al. assessed the performance of ChatGPT in clinical vitreoretinal scenarios, concluding that while the AI tool can provide accurate responses for straightforward cases, further development is needed to improve its capabilities in more intricate clinical environments [66].
Literature consulting
The use of ChatGPT and other LLMs in literature consulting shows significant promise for advancing academic research and clinical decision-making in retina care. LLMs can greatly facilitate literature consulting in several ways. Firstly, they can streamline the literature review process by quickly summarizing large volumes of academic papers, identifying key findings, and highlighting relevant studies [67]. This can save researchers considerable time, allowing them to focus on deeper analysis and interpretation. Additionally, LLMs can identify gaps in existing literature and suggest areas for future research by analyzing patterns and trends across numerous publications [68].
GPT models can also facilitate real-time consultation on clinical cases by providing summaries of relevant studies and guidelines [67]. This is particularly beneficial in multidisciplinary team meetings or when seeking second opinions, where quick access to the latest evidence-based information is crucial for informed decision-making. For instance, a clinician could use GPT to retrieve and summarize the latest research on challenging retinal disease cases, treatment options, and outcomes [60].
The continuous learning and improvement capabilities of AI models further enhance their utility in literature consulting. As these models are exposed to more data and refined through feedback, their accuracy and relevance in responding to queries improve, leading to highly reliable and personalized literature consulting tools. As previously mentioned, a study by Singer et al. explored the capabilities of a specialized AI chatbot, Aeyeconsult, which integrates GPT-4 with verified ophthalmology textbooks to answer ophthalmology-related questions. Aeyeconsult demonstrated superior performance compared to standard ChatGPT-4, achieving an 83.4% accuracy rate in answering ophthalmology questions, compared to 69.2% for ChatGPT-4 [45].
However, the limitations of GPT and LLMs, such as generating plausible but incorrect information (“hallucinations”) and occasionally providing outdated or irrelevant citations, must be acknowledged [37,38,39]. Implementing human oversight and validation is crucial to ensure the accuracy and reliability of AI-generated responses, as the responsibility for each case will remain with the assigned ophthalmologist regardless of any content provided by ChatGPT. Recently, a team of researchers from Harvard and MIT developed OpenEvidence (Xyla Inc, Wilmington, DE, USA), an LLM-based chatbot app similar to ChatGPT, but for medical literature consulting. Unlike ChatGPT, OpenEvidence retrieves information exclusively from trusted, peer-reviewed sources. This approach may reduce the risk of hallucinations and inaccurate information often associated with other LLMs, making it more suitable for clinical applications [69].
Medical education
The “Medical Education” category includes studies that evaluate ChatGPT’s performance in standardized medical exams and educational tools. This section explores ChatGPT’s outcomes in ophthalmology-specific examinations, emphasizing its strengths, areas for improvement, and the ways these models have been integrated into the educational framework for medical practitioners. A total of 23 articles examined the performance of different ChatGPT versions, including GPT-3.5 and GPT-4, on standardized examinations. While ChatGPT performed consistently across studies and frequently attained passing grades in most exams — even surpassing human examinees in some studies — it still has room for improvement, particularly in subspecialty topics such as retinal care. This trend underscores the challenges faced by AI in mastering niche medical knowledge.
Comparative studies highlighted significant improvements in the capabilities of newer models, such as ChatGPT 4.0, over their predecessors, like ChatGPT 3.5 [6, 13]. ChatGPT was tested in multiple board-style and official ophthalmologic knowledge evaluation exams, and the most commonly used test for performance evaluation were the OKAP and American Academy of Ophthalmology’s (AAO) Basic and Clinical Sciences Course’s Self-Assessment Program (BCSC-SAP). Mihalache A et al. tested ChatGPT’s accuracy in January and February of 2023 on practice questions for OKAP. The authors noted that even though results improved over time, with 46% of questions answered correctly in January vs. 58% in February, the performance remained insufficient for board certification preparation [13]. The authors also found that ChatGPT’s performance was much better in broader topics, such as the ones characterized as “General Medicine”, where 79% of answers were correct, against no correct answers in the “Retina and Vitreous” Sect. [13]. These results are in line with the study by Antaki et al., who demonstrated significant improvements in ChatGPT-4.0 over 3.5, especially in clinical reasoning and multi-step practice questions for OKAP, with performance varying by question complexity, attaining better results in simpler questions [6]. Teebagy S et al. noted ChatGPT-4’s superior performance on the OKAP examination compared to ChatGPT-3.5, highlighting its potential in ophthalmologic education and clinical decision support systems [49]. A similar trend was observed by Antaki et al. in OKAP, BCSC-SAP, and OphthoQuestions questions, with ChatGPT 3.5 outperforming the legacy version [5, 6]. Haddad et al. compared ChatGPT’s performance on ophthalmology questions from OKAP and the United States Medical Licensing Examination (USMLE), finding GPT-4.0 significantly outperformed GPT-3.5, and both presented poorer performances when progressing through levels of these exams [10].
Cai et al. compared ChatGPT-3.5, ChatGPT-4.0, and Bing Chat on ophthalmology board-style questions from BCSC-SAP, finding ChatGPT-4.0 superior in single-step reasoning but struggling with image interpretation and calculations [7]. Sensoy E et al. also evaluated ChatGPT-3.5, Bing, and Bard in BCSC-SAP questions, focusing on ophthalmic pathology and intraocular tumors. No statistically significant performance differences were found, however, Google Bard presented correct answers to 69.4% of the questions against only 58.6% from ChatGPT-3.5 and 63.9% from Bing [53]. Taloni et al. compared ChatGPT-3.5, ChatGPT-4.0, and average human performance (as provided by AAO) in BCSC-SAP questions as well, finding that ChatGPT had the best performance, answering 82.4% of questions correctly, followed by humans (75%) and ChatGPT-3.5 (65.9%) [47]. Tao BK et al. noted that Bing Chat answered 73.6% of 913 BCSC-SAP questions correctly in August 2023, against 59.69% of correctly answered questions by ChatGPT-3.5 [42]. Lin et al. also noted significant improvements from ChatGPT-3 to ChatGPT-4 in 260 questions from BCSC-SAP, where ChatGPT-4 attained higher scores than human users and would be classified as a passing score [48].
Fowler et al. found ChatGPT-4.0 significantly outperformed Google Bard on Fellowship of the Royal College of Ophthalmologists (FRCOphth) Part 1 exam questions, obtaining better results than the historical human pass marks [4]. Raimondi et al. compared multiple LLMs on FRCOphth exam questions, with ChatGPT-4.0 and Bing Chat (which is also powered by GPT-4) showing a significantly higher accuracy than ChatGPT-3.5 and Google Bard [70]. Thirunavukarasu AJ et al. compared GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training in a mock examination based on questions used for training to FRCOphth exams. In their study, ChatGPT-4.0 performance was superior to all other LLMs and unspecialized junior doctors, while it was comparable to expert ophthalmologists and ophthalmology residents [59].
Alexander et al. evaluated ChatGPT’s performance on some of the Indian Foreign Medical Graduate Examination multiple-choice and short-answer questions in clinical ophthalmology, with both versions achieving high scores (80% or more) in all areas tested, however presenting some incorrect recommendations for conjunctivitis management and cataract diagnosis, two fairly common ophthalmologic conditions. The prompts were given in English [12].
Gobira et al. evaluated ChatGPT-3.5 on the Brazilian Council of Ophthalmology Board Examination, where only 41.6% of the questions were answered correctly, noting a particularly low accuracy in mathematical and clinical questions. ChatGPT-3.5 was not able to attain a passing grade on this test [9]. Sakai D et al. assessed ChatGPT-3.5 and 4.0 on the Japanese Ophthalmology Society board examinations, using prompts written in Japanese. ChatGPT-3.5 answered around 22% of questions correctly, and ChatGPT-4.0 was correct in 46% of questions. Both models presented significantly inferior performance to average human examinees and did not meet the marks of the models in other languages [63]. These findings might be due to differences in the complexity and evaluation of questions from the Japanese Ophthalmology Society and Brazilian Council of Ophthalmology board examinations, but most likely reflect disparities in the accuracy of the models across different languages.
Moshirfar M et al. studied ChatGPT-4.0, ChatGPT-3.5, and average human responses on StatPearls questions, reporting that ChatGPT-4.0 presented a significantly higher accuracy than its ChatGPT-3.5 and the average human response accuracy. ChatGPT’s performance was also tested specifically in questions applied to clinical cases, which require higher level thinking and interpretation of multiple inputs of clinical information [71]. Huang et al. found ChatGPT-4.0 outperformed glaucoma and retina specialists in diagnostic accuracy and completeness in responses to clinical cases and frequently asked questions from AAO [72]. Jiao C et al. highlighted the advancements of ChatGPT-4.0 over 3.5 in ophthalmic knowledge by comparing the performance in answering questions from clinical cases available at the AAO’s “Diagnose This” question bank. Most significant improvements were noted especially in neuro-ophthalmology and image-related questions [8].
Inayat H et al. compared ChatGPT’s performance with ophthalmology residents and staff consultants in determining diagnosis and urgency in custom questions based on real cases commonly presented in an ophthalmologic emergency setting, noting a high concordance with human practitioners for diagnosis, but not for definition of urgency [61].
In October 2023, Mihalache et al. published a new study evaluating ChatGPT’s capability of interpreting multimodal and imaging input using OCTCases, a medical education platform based out of the Department of Ophthalmology and Vision Sciences at the University of Toronto. ChatGPT answered 70% of the questions correctly, having a better performance in text-based questions (82% of answers were correct) and in retinal disease cases (77% of answers were correct) when compared to other specialties [54].
Milad D et al. discussed GPT-4’s capability to diagnose and manage complex ophthalmologic cases in both open-ended and multiple-choice questions based on Journal of the American Medical Association Ophthalmology’s Clinical Challenges, in which a limited accuracy was observed. The model was capable of correctly diagnosing 48% of the cases, and suggested adequate first steps of management in 63% of the cases [11].
In summary, while GPT-4 and other LLMs show significant potential in medical education, their performance varies with the complexity and specificity of the information requested. Variance was also observed between simple fact recalling and complex data interpretation and decision making, and also across different languages. A clear trend of continuous improvement has been seen, both through new capabilities such as real-time internet access or multimedia input availability, and through comparisons of ChatGPT-4.0 and 3.5 against multiple benchmarks. Nonetheless, further improvements are necessary to enhance accuracy and reliability, particularly in specialized fields such as retinal care. While ChatGPT can be applied as a useful and flexible learning assistance tool, all information obtained should be fact-checked and compared with updated medical literature.
Patient counseling
Patient counseling with ChatGPT was portraited in the studies in two main forms: patients independently seeking information from LLMs and healthcare providers using LLMs to create educational materials for their patients. Each method comes with its own set of benefits and challenges.
A study evaluating ChatGPT’s responses to common vitreoretinal disease questions found that only 15.4% of the chatbot’s answers were completely accurate. Moreover, the responses were inconsistent, with 50% showing material changes when the same questions were asked two weeks later, and in some cases, the accuracy worsened. For example, ChatGPT incorrectly suggested injection therapy and laser treatment for an epiretinal membrane, which could mislead patients and potentially cause harm [17].
Similarly, when ChatGPT was used to answer questions about “floaters,” a common patient concern, the chatbot provided general information but failed to emphasize the urgency of consulting an ophthalmologist, which is critical given that floaters can be a sign of retinal detachment—a condition that requires immediate medical attention. The study also highlighted that ChatGPT’s language complexity is above the average reading level of many patients, potentially limiting its accessibility [23].
In addition to specific condition-related queries, other studies focused on the quality, empathy, and safety of ChatGPT’s responses to common retina patient questions. Expert-edited LLM responses were found to potentially improve the quality and empathetic tone of patient communications, but the need for human oversight to ensure accuracy and safety was emphasized. Similarly, ChatGPT-4’s recommendations for ophthalmology-related questions were mostly appropriate, though there was significant variation across subspecialties, necessitating further optimization before clinical use [25, 73].
Momenae et al. compared ChatGPT-3.5 and ChatGPT-4 in generating responses about retinal surgeries, finding ChatGPT-4 more accurate, with appropriateness rates of 84.6% for retinal detachments, 92% for macular holes, and 91.7% for epiretinal membranes. However, its responses were considered difficult to understand, requiring a college-level education, highlighting the need for improvements to make AI-generated medical advice both accurate and accessible to patients [74, 75]. In another study, Wu et al. evaluated the ability of ChatGPT to educate diabetic patients about diabetic retinopathy using a keyword-based scoring system. ChatGPT provided basic but incomplete answers, scoring poorly on key terms like macular edema [76].
Moreover, one study evaluated ChatGPT’s performance in providing information about retinal diseases and uveitis. While generally accurate, the study emphasized the need for rigorous evaluation to ensure high standards of accuracy and reliability [77]. A comparison of ChatGPT 3.5, Bing AI, and Google Bard in addressing common questions from patients with age-related macular degeneration (AMD) showed that ChatGPT 3.5 consistently outperformed the other models, particularly excelling in technical queries [19]. ChatGPT’s ability to diagnose and provide prehospital management recommendations for urgent eye conditions was also assessed, showing high triage accuracy but highlighting the potential for harmful advice, thus underscoring the need for continuous improvement [78].
Cheong et al. compared the performance of generative (ChatGPT-4, ChatGPT-3.5, and Google Bard) and retrieval-based (OcularBERT) chatbots in answering patient questions regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR). ChatGPT-4 and ChatGPT-3.5 outperformed the other models in both accuracy and quality of responses, demonstrating their potential to answer domain-specific medical questions. The study underscored that these generative chatbots are capable of accurately addressing domain-specific questions outside their initial training [79].
LLMs can assist physicians in generating educational materials, enhancing the quality and efficiency of patient education. One study investigated ChatGPT’s ability to simplify healthcare information from the American Academy of Ophthalmology (AAO), finding significant improvements in the readability of articles in Spanish but limited impact on articles in English [80]. Another study evaluated ChatGPT, Bing AI, and Docs-GPT in providing responses to common patient queries about virtreoretinal surgery. The study found that while all three LLMs generally provided accurate and sufficient information, Bing AI performed the best overall [15]. Cappellani F et al. highlighted ChatGPT’s ability to provide valuable patient education, although it still provided incorrect or incomplete information that might be harmful for patients [14].
The effectiveness of ChatGPT and Bard in generating patient-targeted health information about uveitis was also evaluated, with ChatGPT producing more readable content, highlighting its potential to improve health literacy [81]. Furthermore, ChatGPT’s capability to swiftly generate discharge summaries and operative notes across various subspecialties has been demonstrated. While human oversight is required to customize and enhance these outputs, this approach substantially reduces the time spent on documentation and improves overall patient care [31].
A study comparing ChatGPT, Bing Chat, and WebMD Symptom Checker found that ChatGPT-4 provided highly accurate and detailed information, comparable to that of ophthalmology trainees, suggesting its potential as a valuable tool for patient self-triage [28]. Another study compared Bard and ChatGPT’s responses to ophthalmology-related prompts, finding both to have significant potential but also limitations that need addressing [60].
Nanji et al. evaluated the quality of postoperative ophthalmology instructions provided by ChatGPT, Google Search, and institutional handouts from Canadian and U.K. sources. The study found that while ChatGPT’s instructions contained procedure-specific information comparable to the other sources, they were generally less understandable, particularly when compared to the U.K. institution’s materials. This study, however, did not specifically request an accessible response [18].
Bernstein et al. explored the capability of ChatGPT to provide ophthalmology advice in comparison to human ophthalmologists. The study revealed that the chatbot’s responses were comparable to those written by ophthalmologists regarding the inclusion of incorrect information, potential harm, and adherence to medical consensus. However, the chatbot was more likely to be identified as AI-generated by expert reviewers, though with only moderate accuracy. The study highlights the potential for LLMs to assist in patient education but also underscores the risks, such as “hallucinations” or generating plausible yet inaccurate information [16].
Triage and Pre-hospital management
Recent studies have explored the potential of ChatGPT in triaging ophthalmic symptoms and offering pre-hospital management advice. Knebel et al. demonstrated that ChatGPT achieved a high triage accuracy of 93.6% when evaluating acute ophthalmological symptoms using fictional case vignettes. Despite its general effectiveness, the study highlighted that 32% of the responses carried the potential to cause harm, underscoring the need for continuous improvement and oversight in using AI for medical triage [78].
Gopalakrishnan et al. also reported that ChatGPT provided highly accurate triage recommendations but often delivered general and vague treatment advice, pointing to the necessity for greater specificity in its responses [82]. Similarly, Lyons et al. compared ChatGPT’s performance with that of ophthalmology trainees, finding that ChatGPT accurately listed the correct diagnosis among the top three in 93% of cases and provided appropriate triage recommendations in 98% of cases. The study concluded that ChatGPT outperformed Bing Chat and WebMD Symptom Checker, suggesting its value in patient self-triage and initial assessments [28].
The advancements of GPT-4 over its predecessor GPT-3.5 were further illustrated by Waisberg et al., who noted that GPT-4 addressed previous shortcomings in specific scenarios, such as macular degeneration, where it now recommends immediate medical attention for severe or sudden visual changes. This improvement underscores GPT-4’s enhanced problem-solving abilities and its broader knowledge base [26].
Despite these advancements, a study by Inayat et al. compared the performance of ChatGPT to ophthalmology residents and staff consultants using a training tool based on real on-call pages. While residents and staff performed better in overall accuracy, ChatGPT exhibited strong diagnostic capabilities but showed inconsistency in triage, often favoring more urgent assessments compared to human experts. The findings suggest that, while ChatGPT has potential as an educational resource, human decision-making remains crucial in medical triage due to the nuanced nature of clinical management [61].
The incorporation of LLMs like GPT-4 in triaging ophthalmic symptoms marks a significant step forward in healthcare, offering the potential to improve the accuracy and efficiency of initial patient assessments. These models can match or even exceed the diagnostic accuracy and triage appropriateness of trained ophthalmology professionals in certain scenarios. However, the risk of incorrect or harmful advice emphasizes the need for ongoing refinement and rigorous validation of these AI systems before they can be reliably used in real-world clinical settings. Future developments should focus on enhancing model reliability and integrating advanced features like image analysis to further improve the utility of LLMs in medical triage.
Conclusion
ChatGPT and other LLMs offer considerable potential in transforming various aspects of retinal healthcare, from screening and diagnosing to disease management and patient education. Their ability to assist in decision-making, triage, and generating educational materials presents a promising opportunity to enhance clinical workflows and patient care. Notably, there has been a clear improvement from version 3.5 to version 4, reflecting ongoing advancements in AI technology. However, these tools are not without limitations. Their current performance, particularly in scenarios involving figure-based inputs, and the need for constant supervision highlight the importance of cautious integration into clinical practice.
To maximize the effectiveness of AI in ophthalmology, it is crucial to ensure that the information generated by these models is accurate, relevant, and comprehensible. The issue of generating understandable outputs is particularly nuanced, often depending on how prompts are crafted and specified. Studies that have reported less understandable outputs may not have explicitly requested accessible responses, underscoring the importance of prompt engineering. Additionally, the consistent need for human supervision cannot be overstated. While AI can provide valuable support, it is crucial to ensure that the final decision-making remains in the hands of healthcare professionals. One effective strategy to improve accuracy is to prompt the AI to generate a list of diagnoses or options rather than a single definitive answer. This approach enhances the reliability of the information provided, allowing clinicians to make the final informed decisions.
Looking forward, advancements in AI should aim to overcome these challenges, focusing on refining models to better meet the needs of both patients and healthcare providers. By addressing these issues, we can fully harness the potential of AI in retinal healthcare, ultimately improving diagnostic accuracy, disease management, and patient education.
Data availability
The datasets generated and/or analysed during the current study, as well as the review registration and prospectus, are available in the Open Science Framework (OSF) repository, https://osf.io/nrjt7. The prospectus is also included in this article’s supplementary material.
Abbreviations
- AAO:
-
American Academy of Ophthalmology
- ADA:
-
Advanced Data Analysis
- AMD:
-
Age–related Macular Degeneration
- AI:
-
Artificial Intelligence
- BCSC:
-
SAP–Basic and Clinical Sciences Course’s Self–Assessment Program
- CA:
-
California (USA)
- DALK:
-
Deep Anterior Lamellar Keratoplasty
- DME:
-
Diabetic Macular Edema
- DR:
-
Diabetic Retinopathy
- FMGE:
-
Foreign Medical Graduate Examination
- FRCOphth:
-
Fellowship of the Royal College of Ophthalmologists
- GPT:
-
Generative Pretrained Transformer
- ICD:
-
International Classification of Diseases
- ICGA:
-
Indocyanine Green Angiography
- IOL:
-
Intraocular Lens
- LLM:
-
Large Language Model
- MD:
-
Macular Degeneration
- ML:
-
Machine Learning
- OKAP:
-
Ophthalmology Knowledge Assessment Program
- OCT:
-
Optical Coherence Tomography
- OSF:
-
Open Science Foundation
- PK:
-
Penetrating Keratoplasty
- PRISMA:
-
Preferred Reporting Items for Systematic Reviews and Meta–Analyses
- USA:
-
United States of America
- WI:
-
Wisconsin (USA)
References
Kasneci E, Seßler K, Küchemann S, Bannert M, Dementieva D, Fischer F et al. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education [Internet]. OSF; 2023 [cited 2024 Sep 7]. https://osf.io/5er8f
Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595.
Healthcare IT, News [Internet]. 2023 [cited 2024 Sep 7]. How Epic is using AI to change the way EHRs work. https://www.healthcareitnews.com/news/how-epic-using-ai-change-way-ehrs-work
Fowler T, Pullen S, Birkett L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br J Ophthalmol. 2024;108(10):1379–83.
Antaki F, Milad D, Chia MA, Giguère CÉ, Touma S, El-Khoury J et al. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br J Ophthalmol [Internet]. 2023 Nov 3 [cited 2024 Sep 6]; https://bjo.bmj.com/content/early/2023/11/02/bjo-2023-324438
Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in Ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3(4):100324.
Cai LZ, Shaheen A, Jin A, Fukui R, Yi JS, Yannuzzi N, et al. Performance of Generative large Language models on Ophthalmology Board-Style questions. Am J Ophthalmol. 2023;254:141–9.
Jiao C, Edupuganti NR, Patel PA, Bui T, Sheth V. Evaluating the Artificial Intelligence Performance Growth in Ophthalmic Knowledge. Cureus. 2023;15(9):e45700.
Gobira MC, Moreira RC, Nakayama LF, Regatieri CVS, Andrade E, Rubens B. Performance of chatGPT-3.5 answering questions from the Brazilian Council of Ophthalmology Board Examination. Pan-Am J Ophthalmol. 2023;5(1):17.
Haddad F, Saade JS. Performance of ChatGPT on Ophthalmology-related questions across various examination levels: Observational Study. JMIR Med Educ. 2024;10(1):e50842.
Milad D, Antaki F, Milad J, Farah A, Khairy T, Mikhail D et al. Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases. Br J Ophthalmol. 2024;108(10):1398–405.
Alexander AC, Somineni Raghupathy S, Surapaneni KM. An assessment of the capability of ChatGPT in solving clinical cases of ophthalmology using multiple choice and short answer questions. Adv Ophthalmol Pract Res. 2024;4(2):95–7.
Mihalache A, Popovic MM, Muni RH. Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment. JAMA Ophthalmol. 2023;141(6):589–97.
Cappellani F, Card KR, Shields CL, Pulido JS, Haller JA. Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients. Eye Lond Engl. 2024;38(7):1368–73.
Anguita R, Makuloluwa A, Hind J, Wickham L. Large language models in vitreoretinal surgery. Eye Lond Engl. 2024;38(4):809–10.
Bernstein IA, Zhang Y (Victor), Govil D, Majid I, Chang RT, Sun Y Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions, et al. editors. JAMA Netw Open. 2023;6(8):e2330320.
Caranfa JT, Bommakanti NK, Young BK, Zhao PY. Accuracy of Vitreoretinal Disease Information from an Artificial Intelligence Chatbot. JAMA Ophthalmol. 2023;141(9):906–7.
Nanji K, Yu CW, Wong TY, Sivaprasad S, Steel DH, Wykoff CC, et al. Evaluation of postoperative ophthalmology patient instructions from ChatGPT and Google search. Can J Ophthalmol J Can Ophtalmol. 2024;59(1):e69–71.
Ferro Desideri L, Roth J, Zinkernagel M, Anguita R. Application and accuracy of artificial intelligence-derived large language models in patients with age related macular degeneration. Int J Retina Vitr. 2023;9(1):71.
Patil NS, Huang R, Mihalache A, Kisilevsky E, Kwok J, Popovic MM, THE ABILITY OF ARTIFICIAL INTELLIGENCE CHATBOTS ChatGPT AND GOOGLE BARD TO ACCURATELY CONVEY PREOPERATIVE INFORMATION FOR PATIENTS UNDERGOING OPHTHALMIC SURGERIES, et al. Retina Phila Pa. 2024;44(6):950–3.
Potapenko I, Boberg-Ans LC, Stormly Hansen M, Klefter ON, van Dijk EHC, Subhi Y. Artificial intelligence-based chatbot patient information on common retinal diseases using ChatGPT. Acta Ophthalmol (Copenh). 2023;101(7):829–31.
Pushpanathan K, Lim ZW, Er Yew SM, Chen DZ, Hui’En Lin HA, Lin Goh JH, et al. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience. 2023;26(11):108163.
Wu G, Zhao W, Wong A, Lee DA. Patients with floaters: answers from virtual assistants and large language models. Digit Health. 2024;10:20552076241229933.
Solli EM, Tsui E, Mehta N. Analysis of ChatGPT responses to patient-oriented questions on common ophthalmic procedures. Clin Exp Ophthalmol. 2024;52(4):487–91.
Tailor PD, Xu TT, Fortes BH, Iezzi R, Olsen TW, Starr MR, et al. Appropriateness of Ophthalmology recommendations from an online chat-based Artificial Intelligence Model. Mayo Clin Proc Digit Health. 2024;2(1):119–28.
Waisberg E, Ong J, Zaman N, Kamran SA, Sarker P, Tavakkoli A, et al. GPT-4 for triaging ophthalmic symptoms. Eye. 2023;37(18):3874–5.
Zandi R, Fahey JD, Drakopoulos M, Bryan JM, Dong S, Bryar PJ, et al. Exploring Diagnostic Precision and Triage proficiency: a comparative study of GPT-4 and Bard in addressing Common Ophthalmic complaints. Bioeng Basel Switz. 2024;11(2):120.
Lyons RJ, Arepalli SR, Fromal O, Choi JD, Jain N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can J Ophthalmol J Can Ophtalmol. 2024;59(4):e301–8.
Waisberg E, Ong J, Masalkhi M, Kamran SA, Zaman N, Sarker P, et al. GPT-4 and Ophthalmology Operative Notes. Ann Biomed Eng. 2023;51(11):2353–5.
Waisberg E, Ong J, Masalkhi M, Zaman N, Sarker P, Lee AG, et al. GPT-4 to document ophthalmic post-operative complications. Eye Lond Engl. 2024;38(3):414–5.
Singh S, Djalilian A, Ali MJ. ChatGPT and Ophthalmology: exploring its potential with discharge summaries and operative notes. Semin Ophthalmol. 2023;38(5):503–7.
Ong J, Kedia N, Harihar S, Vupparaboina SC, Singh SR, Venkatesh R et al. Applying large language model artificial intelligence for retina International Classification of Diseases (ICD) coding. J Med Artif Intell [Internet]. 2023 Oct 30 [cited 2024 Sep 6];6(0). https://jmai.amegroups.org/article/view/8198
Kernberg A, Gold JA, Mohan V. Using ChatGPT-4 to create Structured Medical Notes from Audio recordings of Physician-Patient encounters: comparative study. J Med Internet Res. 2024;26:e54419.
Gruda D. Three ways ChatGPT helps me in my academic writing. Nature. 2024 Apr 8. Available from: https://www.nature.com/articles/d41586-024-01042-3
Mondal H, Mondal S. ChatGPT in academic writing: maximizing its benefits and minimizing the risks. Indian J Ophthalmol. 2023;71(12):3600–6.
Bhatia P. ChatGPT for academic writing: a game changer or a disruptive tool? J Anaesthesiol Clin Pharmacol. 2023;39(1):1–2.
Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: implications in Scientific writing. Cureus. 2023;15(2):e35179.
Dergaa I, Chamari K, Zmijewski P, Ben Saad H. From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing. Biol Sport. 2023;40(2):615–22.
Seth I, Bulloch G, Lee CHA. Redefining Academic Integrity, Authorship, and Innovation: the impact of ChatGPT on Surgical Research. Ann Surg Oncol. 2023;30(8):5284–5.
Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA Extension for scoping reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med. 2018;169(7):467–73.
Valentín-Bravo FJ, Mateos-Álvarez E, Usategui-Martín R, Andrés-Iglesias C, Pastor-Jimeno JC, Pastor-Idoate S. Artificial Intelligence and new language models in Ophthalmology: complications of the use of silicone oil in vitreoretinal surgery. Arch Soc Esp Oftalmol Engl Ed. 2023;98(5):298–303.
Tao BKL, Hua N, Milkovich J, Micieli JA. ChatGPT-3.5 and Bing Chat in ophthalmology: an updated evaluation of performance, readability, and informative sources. Eye Lond Engl. 2024;38(10):1897–902.
Chen X, Zhang W, Zhao Z, Xu P, Zheng Y, Shi D et al. ICGA-GPT: report generation and question answering for indocyanine green angiography images. Br J Ophthalmol. 2024;108(10):1450–6.
Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, et al. Uncovering Language disparity of ChatGPT on Retinal Vascular Disease Classification: cross-sectional study. J Med Internet Res. 2024;26(1):e51926.
Singer MB, Fu JJ, Chow J, Teng CC. Development and evaluation of Aeyeconsult: a Novel Ophthalmology Chatbot leveraging verified Textbook Knowledge and GPT-4. J Surg Educ. 2024;81(3):438–43.
Mihalache A, Huang RS, Patil NS, Popovic MM, Lee WW, Yan P, et al. Chatbot and Academy Preferred Practice Pattern guidelines on Retinal diseases. Ophthalmol Retina. 2024;8(7):723–5.
Taloni A, Borselli M, Scarsi V, Rossi C, Coco G, Scorcia V, et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci Rep. 2023;13(1):18562.
Lin JC, Younessi DN, Kurapati SS, Tang OY, Scott IU. Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination. Eye Lond Engl. 2023;37(17):3694–5.
Teebagy S, Colwell L, Wood E, Yaghy A, Faustina M. Improved Performance of ChatGPT-4 on the OKAP Examination: A Comparative Study with ChatGPT-3.5. J Acad Ophthalmol. 2017. 2023;15(2):e184–7.
Shemer A, Cohen M, Altarescu A, Atar-Vardi M, Hecht I, Dubinsky-Pertzov B, et al. Diagnostic capabilities of ChatGPT in ophthalmology. Graefes Arch Clin Exp Ophthalmol Albrecht Von Graefes Arch Klin Exp Ophthalmol. 2024;262(7):2345–52.
Hu X, Ran AR, Nguyen TX, Szeto S, Yam JC, Chan CKM, et al. What can GPT-4 do for diagnosing Rare Eye diseases? A pilot study. Ophthalmol Ther. 2023;12(6):3395–402.
Raghu K, Devishamani STS, Rajalakshmi CMS, Raman R. The utility of ChatGPT in Diabetic Retinopathy Risk Assessment: a comparative study with clinical diagnosis. Clin Ophthalmol Auckl NZ. 2023;17:4021–31.
Sensoy E, Citirik M. A comparative study on the knowledge levels of artificial intelligence programs in diagnosing ophthalmic pathologies and intraocular tumors evaluated their superiority and potential utility. Int Ophthalmol. 2023;43(12):4905–9.
Mihalache A, Huang RS, Popovic MM, Patil NS, Pandya BU, Shor R, et al. Accuracy of an Artificial Intelligence Chatbot’s interpretation of Clinical Ophthalmic images. JAMA Ophthalmol. 2024;142(4):321–6.
Masalkhi M, Ong J, Waisberg E, Zaman N, Sarker P, Lee AG, et al. A side-by-side evaluation of Llama 2 by meta with ChatGPT and its application in ophthalmology. Eye Lond Engl. 2024;38(10):1789–92.
Masalkhi M, Ong J, Waisberg E, Lee AG. Google DeepMind’s gemini AI versus ChatGPT: a comparative analysis in ophthalmology. Eye Lond Engl. 2024;38(8):1412–7.
Rojas-Carabali W, Cifuentes-González C, Wei X, Putera I, Sen A, Thng ZX et al. Evaluating the diagnostic accuracy and management recommendations of ChatGPT in Uveitis. Ocul Immunol Inflamm. 2024;32(8):1526–31.
Rojas-Carabali W, Sen A, Agarwal A, Tan G, Cheung CY, Rousselot A et al. Chatbots Vs. Human experts: evaluating diagnostic performance of Chatbots in Uveitis and the perspectives on AI adoption in Ophthalmology. Ocul Immunol Inflamm. 2024;32(8):1591–8.
Thirunavukarasu AJ, Mahmood S, Malem A, Foster WP, Sanghera R, Hassan R, et al. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: a head-to-head cross-sectional study. PLOS Digit Health. 2024;3(4):e0000341.
Waisberg E, Ong J, Masalkhi M, Zaman N, Sarker P, Lee AG, et al. Google’s AI chatbot Bard: a side-by-side comparison with ChatGPT and its utilization in ophthalmology. Eye Lond Engl. 2024;38(4):642–5.
Inayat H, McDonald HM, Bursztyn LLCD. Comparison of ChatGPT to ophthalmology resident and staff consultants on an ophthalmological training tool. Can J Ophthalmol. 2024;59(1):e72–4.
Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E et al. Exploring AI-chatbots’ capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. Br J Ophthalmol. 2024;108(10):1457–69.
Sakai D, Maeda T, Ozaki A, Kanda GN, Kurimoto Y, Takahashi M. Performance of ChatGPT in Board examinations for specialists in the Japanese Ophthalmology Society. Cureus. 2023;15(12):e49903.
Choudhary A, Gopalakrishnan N, Joshi A, Balakrishnan D, Chhablani J, Yadav NK, et al. Recommendations for diabetic macular edema management by retina specialists and large language model-based artificial intelligence platforms. Int J Retina Vitr. 2024;10(1):22.
Subramanian B, Rajalakshmi R, Sivaprasad S, Rao C, Raman R. Assessing the appropriateness and completeness of ChatGPT-4’s AI-generated responses for queries related to diabetic retinopathy. Indian J Ophthalmol. 2024;72(Suppl 4):S684–7.
Maywood MJ, Parikh R, Deobhakta A, Begaj T, PERFORMANCE, ASSESSMENT OF AN ARTIFICIAL INTELLIGENCE CHATBOT IN CLINICAL VITREORETINAL SCENARIOS. Retina Phila Pa. 2024;44(6):954–64.
Fatani B. ChatGPT for Future Medical and Dental Research. Cureus. 2023;15(4):e37285.
Rahman M, Terano HJ, Rahman N, Salamzadeh A, Rahaman S. ChatGPT and Academic Research: A Review and Recommendations Based on Practical Examples. SSRN [Internet]. [cited 2024 Sep 8]. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4407462
OpenEvidence [Internet]. [cited 2024 Sep 8]. OpenEvidence - About. https://www.openevidence.com
Raimondi R, Tzoumas N, Salisbury T, Di Simplicio S, Romano MR. Comparative analysis of large language models in the Royal College of ophthalmologists fellowship exams. Eye. 2023;37(17):3530–3.
Moshirfar M, Altaf AW, Stoakes IM, Tuttle JJ, Hoopes PC. Artificial Intelligence in Ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus. 2023;15(6):e40822.
Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a large Language Model’s responses to questions and cases about Glaucoma and retina management. JAMA Ophthalmol. 2024;142(4):371–5.
Tailor PD, Dalvin LA, Chen JJ, Iezzi R, Olsen TW, Scruggs BA, et al. A comparative study of responses to retina questions from either experts, Expert-Edited large Language models, or Expert-Edited large Language models alone. Ophthalmol Sci. 2024;4(4):100485.
Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, et al. Assessing ChatGPT-3.5 Versus ChatGPT-4 performance in Surgical Treatment of Retinal diseases: a comparative study. Ophthalmic Surg Lasers Imaging Retina. 2024;55(8):481–2.
Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, et al. Appropriateness and readability of ChatGPT-4-Generated responses for Surgical Treatment of Retinal diseases. Ophthalmol Retina. 2023;7(10):862–8.
SAT118 The. Intelligent Chatbot: Can It Answer Questions About Health And Health Disparities? - PMC [Internet]. [cited 2024 Sep 9]. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10553909/
Marshall RF, Mallem K, Xu H, Thorne J, Burkholder B, Chaon B et al. Investigating the accuracy and completeness of an artificial intelligence large language model about uveitis: an evaluation of ChatGPT. Ocul Immunol Inflamm. 2024; 1–4.
Knebel D, Priglinger S, Scherer N, Klaas J, Siedlecki J, Schworm B. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies - an analysis of 10 fictional case vignettes. Klin Monatsbl Augenheilkd. 2024;241(5):675–81.
Cheong KX, Zhang C, Tan TE, Fenner BJ, Wong WM, Teo KY et al. Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy. Br J Ophthalmol. 2024;108(10):1443-9.
Sudharshan R, Shen A, Gupta S, Zhang-Nunes S. Assessing the utility of ChatGPT in simplifying text complexity of patient Educational materials. Cureus. 2024;16(3):e55304.
Kianian R, Sun D, Crowell EL, Tsui E. The Use of large Language models to generate Education materials about Uveitis. Ophthalmol Retina. 2024;8(2):195–201.
Gopalakrishnan N, Joshi A, Chhablani J, Yadav NK, Reddy NG, Rani PK, et al. Recommendations for initial diabetic retinopathy screening of diabetic patients using large language model-based artificial intelligence in real-life case scenarios. Int J Retina Vitr. 2024;10(1):11.
Acknowledgements
The authors sincerely thank Dr. Thiago Furtado for his invaluable assistance in providing access to resources that significantly enhanced the scope and depth of our research. Publication fees were covered by an educational grant from F. Hoffmann-La Roche Ltd. The company had no influence in the research, in the writing or in the manusctipt subject selection, which were at the discretion of the authors.
Funding
This study was self-funded, and no external direct financial support was received. The authors contributed their time voluntarily and paid for their own equipment, physical space, supplies and general services needed to conduct all activities related to accessing study materials, compiling and analyzing data, and writing and reviewing this manuscript. External support was indirectly received from the University of São Paulo, a public university in Brazil with no conflicts of interest in the area, in the form of free access to multiple journals and scientific archives. Publication fees were covered by an educational grant from F. Hoffmann-La Roche Ltd.
Author information
Authors and Affiliations
Contributions
VCFB conceptualized the study, developed the methodology, contributed to the software, performed formal analysis and investigation, curated the data, and was responsible for writing the original draft, reviewing and editing the manuscript, creating visualizations, and managing the project. MLS contributed to the conceptualization, methodology, software, formal analysis, investigation, data curation, and original draft writing, as well as visualizations. DAF was involved in formal analysis and reviewing and editing the manuscript. GBM contributed to the study’s conceptualization, methodology, validation, investigation, and resources, as well as data curation, manuscript review and editing, and supervision. RJ also contributed to the study’s conceptualization, methodology, validation, investigation, resources, data curation, manuscript review and editing, and provided supervision. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Clinical trial number
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Bellanda, V.C.F., Santos, M.L.d., Ferraz, D.A. et al. Applications of ChatGPT in the diagnosis, management, education, and research of retinal diseases: a scoping review. Int J Retin Vitr 10, 79 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s40942-024-00595-9
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s40942-024-00595-9