Assessment of ChatGPT vs. Bard vs. guidelines in the artificial intelligence (AI) preclinical management of otorhinolaryngological (ENT) emergencies

Fang Joe Chen; James Nightingale; Woo Sun You; Daniel Anderson; David Morrissey

doi:10.21037/ajo-24-1

Original Article

Assessment of ChatGPT vs. Bard vs. guidelines in the artificial intelligence (AI) preclinical management of otorhinolaryngological (ENT) emergencies

Fang Joe Chen^1,2,3 , James Nightingale^1,2,3 , Woo Sun You^1,2 , Daniel Anderson¹, David Morrissey^1,3

¹Division of Otolaryngology and Head & Neck Surgery, Department of Surgery, Toowoomba Base Hospital, Toowoomba, Queensland, Australia; ²School of Medicine, Griffith University, Queensland, Australia; ³Faculty of Medicine, University of Queensland, Queensland, Australia

Contributions: (I) Conception and design: FJ Chen, J Nightingale, D Morrissey; (II) Administrative support: None; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: All authors; (V) Data analysis and interpretation: FJ Chen, J Nightingale, WS You; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Fang Joe Chen, MD, BPharm (Hons). Division of Otolaryngology and Head & Neck Surgery, Department of Surgery, Toowoomba Base Hospital, Private Mail Bag 2, Pechey Street, Toowoomba City, Queensland, 4350, Australia; School of Medicine, Griffith University, Queensland, Australia; Faculty of Medicine, University of Queensland, Queensland, Australia. Email: fangjoe.chen@health.qld.gov.au.

Background: Artificial intelligence (AI) based platforms are gaining popularity with increasing application across all areas of life. There is expanding research into AI’s ability to act as a medical resource for clinicians, patients, and their families, with emerging potential application in answering medical questions and vignettes. This study seeks to assess ChatGPT’s performance in otorhinolaryngological (ENT) emergency case vignettes, with direct comparison to Google’s Bard AI and readily available online patient resources.

Methods: Sixteen short fictional case vignettes describing potential ENT emergencies were entered into ChatGPT and Google’s Bard in triplicate followed by an online search for patient resources. Results were presented in terms of diagnostic accuracy, recommendation to seek medical review, appropriateness of triage categorisation, and appropriate of preclinical measures.

Results: Both AI systems diagnosed 100% of conditions with appropriate triage in 76.7%. All instances suggested seeking face-to-face medical review. Appropriate preclinical measures were outlined in 84.4% of instances, with scoring detailing no statistical difference between groups (P=0.5634) for grouped data with the Kruskal-Wallis test.

Conclusions: The implications of this study suggest that AI systems cannot yet replace medical review, but may help augment patient understanding. We continue to recommend consultation with medical practitioners in all cases of ENT emergencies. Future directions can lead to further assessment of AI accuracy with other aspects of health advice and diagnosis. Providers should display transparency and disclose when AI is used to generate advice, and the issue of who would hold legal responsibility for adverse outcomes remains unexplored. AI should not be used as a stand-alone primary resource for ENT emergencies and should not replace seeking medical review. With further development, there is no doubt AI will play a greater role in pre-clinical management of ENT emergencies.

Keywords: Otorhinolaryngology; emergency; ChatGPT; artificial intelligence (AI); patient education

Received: 03 January 2024; Accepted: 25 March 2024; Published online: 27 May 2024.

doi: 10.21037/ajo-24-1

Introduction

Otorhinolaryngological (ENT) emergencies are common presentations in emergency departments (1,2). Timely, appropriate, and safe management is important to ensure optimal outcomes for the patient. Emergency presentations can be challenging due to increasing patient demand, reduced primary care access, and general increases in patient complexity (3,4). This is particularly important in the context of varying health literacy and limitations to publicly available ENT resources (5). Further challenges arise with diversity in the presentation of ENT emergencies. Presentations can cause significant anxiety and distress to patients and in the paediatric population for patients, parents, and caregivers.

Technology is an important tool with 80% of Australians searching online for health information, and 40% then seeking self-treatment advice (4). Artificial intelligence (AI) in health care has the potential to revolutionize practice with emerging applications across medicine as a whole and specifically within the ENT subspecialty (6-9). Use of AI-powered tools provides real-time, personalised, and interactive patient education and support (6,7,9). In a society where medical access is increasingly limited, this provides a potential solution to a major public issue (7-10). The accessibility and real-time support of AI systems offers significant advantage as a patient resource (6,9,11). Traditional methods of doctor-patient interaction are increasingly limited by accessibility and time constraints of consultations, with rural and remote populations further disadvantaged (6,7,12). These challenges can result in poor patient education, pre- and post-operative counselling, patient care and outcomes, thus providing an opportunity for AI to improve our practice (7,12).

Whilst patients may benefit from the utilisation of AI, false and misleading information can pose significant risk to patient safety (9,13,14). AI responses can ignore contextual information, be overly general, generate inaccurate and false information, and may not align with clinical guidelines (15-18). This can be particularly challenging in the context of subspecialty ENT emergencies due to the broad and varied nature of situations and management problems for similar issues, as well as lack of conflicting, clinical guidelines. It is critical to ensure AI advice is evidence based to ensure the highest level of patient safety and facilitate the best possible outcomes for the individual patient and community in general (10,13). There are also concerns for privacy, patient confidentiality, bias, and inequality in access particularly across low-socioeconomic and aging populations, limiting widespread clinical application (10,15,16,18,19).

The availability and use of AI has boomed, with an increasing number of publicly available applications. Two examples of generative AI include OpenAI’s ChatGPT and Google’s Bard, each with over 100 million monthly users (10,11). These chatbots are large language models that utilise self-attention mechanisms and large amounts of trained data to generate responses (4,10,11). With its growing accessibility and popularity, patients are increasingly utilising AI resources for first information on acute medical symptoms or questions. Benefits include access to recent information, improved patient engagement, and reduction in workload for healthcare professionals (15-18). Through analysis of large volume of digital text data (books, articles, online forums), AI systems recognises relationships between words and phrases to seek to understand multiple perspectives and experiences (6,10). Patients utilise these models by entering prompts in the form of questions or commands and receiving generated responses (13,20).

There is emerging evidence for the application of AI in a preclinical setting. AI platforms have begun to be investigated for providing medical subspecialty information (8,9,11,21,22). Interestingly, this has been explored first in context of AI’s ability to understand and perform in medical examinations. For example, AI has been shown to be able to pass a diverse list of standardized medical and non-medical examinations (8,19,21). Its performance in medical subspecialties such as ophthalmology, pathology, neurosurgery, and cardiology has been evaluated to be near, or at, passable levels (21,23-25). AI platforms shows promise in its ability to navigate open-ended as well as multiple choice questions (8,15,18,19). ChatGPT has demonstrated competency in providing patient counselling and education to clinical vignettes in fields of pathology, neurology, and ophthalmology (16,23-25). There is scant data on AI application to ENT in this context, and this remains a topic for further exploration addressed with the objectives and aims of this project. We present this article in accordance with the STARD reporting checklist (available at https://www.theajo.com/article/view/10.21037/ajo-24-1/rc).

Objectives and aims

This paper aims to compare OpenAI’s ChatGPT to Google’s Bard AI systems on their response to fictional ENT emergency scenarios. Accuracy of diagnosis, suggestion to seek formal medical review, categorisation of triage and urgency of review, and quality of advice provided will be assessed. Online clinical guidelines will be used as a reference point for comparison.

Methods

Case vignettes were created for common ENT emergencies. Case design centred around potential patient descriptions of acute presentations, focusing on patient interaction with AI and online resources. The overall study design entailed creation of AI prompts, a standardised repeatable interaction with ChatGPT and Bard and online resources with analysis and comparative scoring of responses across online mediums. Ethical approval was assessed as not required given this study did not involve human participation, and the data collected was freely available from a public domain.

Creation of AI prompts

Prompts are the foundation for patient interaction with AI, and in this study, prompts were designed to represent theoretical patient description of common ENT emergencies and acute symptoms. The prompts generated for each emergency vignette are detailed in Table 1. Each question was input into ChatGPT and Bard platforms with three outputs generated to capture potential variation in each novel generative AI response, with this methodology applied across previous AI studies (26). Guidelines based on the proposed vignette title were searched via Google search engine, representing a traditional patient search medium. Institutional guidelines were sourced from the first page of search engine responses and are included in the Supplementary Appendix, and can be provided when requested.

Table 1

ENT emergency vignettes and patient AI prompts for ChatGPT and Bard

Vignette	Representative AI prompt
Epistaxis	‘I am bleeding from my nose’
Post-tonsillectomy bleed (adult)	‘I am bleeding from my mouth after having a tonsillectomy’
Post-tonsillectomy bleed (child)	‘My child is bleeding from their mouth after having a tonsillectomy’
Sudden sensorineural hearing loss	‘I have suddenly lost hearing in one of my ears’
Nasal bone fracture	‘I have broken my nose’
Peri-orbital cellulitis	‘I have swelling and redness around one of my eyes’
Tonsillitis/quinsy	‘I have one sided throat pain and swelling’
Otitis media/externa	‘I have pain in one of my ears’
Foreign body insertion (ear)	‘My child has put something into one of their ears’
Foreign body insertion (nose)	‘My child has put something into their nose’
Button battery ingestion	‘My child has swallowed a battery’
Facial nerve palsy	‘I have sudden weakness on one side of my face’
Respiratory distress	‘I am having difficulty breathing’
Dysphonia	‘I have suddenly developed a hoarse voice’
Neck abscess	‘I have one sided neck pain, swelling, and redness’

ENT, otorhinolaryngological; AI, artificial intelligence.

Standardised interaction with ChatGPT and Bard

ChatGPT default version 3.5 was utilised between the 23^rd and 24^th of August 2023 followed by use of Bard on the 28^thof August 2023. As the AI prompts, case descriptions were entered, followed by a question about diagnosis and treatment recommendation. A representative example of this interaction for ChatGPT and Bard is showed in Figure 1.

Figure 1 Presenting (left) a representative example of a standardised vignette on epistaxis with ChatGPT and (right) an example of a patient experience with Bard for a vignette for epistaxis.

Analysis of ChatGPT and Bard responses

Responses to each ENT emergency vignette were analysed for accuracy and appropriateness of advice across this AI simulated patient experience. Key outcome measures included diagnosis of condition based on input prompts, recommendation to seek medical review, triage categorisation, recommended treatment, and an overall score on the appropriateness of advice.

The ‘Appropriateness of the Preclinical Measures (APM)’ has been established as a tool for grading AI responses (16). An APM was scored for each AI response, representing a grading of the response across a five-point ordinal scale as detailed in Table 2. For a comparative score, representative clinical guidelines were also scored per the APM criteria for each vignette. Scoring was performed by two ENT Royal Australian College of Surgeons (RACS) consultants within the department. To keep scoring blinded, responses were standardised into plain text with no identifiable AI or representative guideline branding or features included. As nonparametric data, statistical significance was determined with the Kruskal-Wallis test (with statistical significance P>0.05).

Table 2

The appropriateness of recommended preclinical measures grading criteria

Score	Description
0	Contains harmful advice or harmfully lacks crucial preclinical measures
1	Contains conflicting advice
2	Contains only useless advice
3	Contains useless as well as appropriate advice
4	Contains only appropriate advice

Results

Accuracy of diagnosis

Across the 15 ENT emergency vignettes, all AI interactions correctly identified the diagnosis. For both ChatGPT and Bard prompts, 45/45 (100%) contained the intended diagnosis in their generated list of potential conditions to the vignettes. A representative example of a correctly identified diagnoses is provided for otitis media/externa vignette is provide for both ChatGPT and Bard in Figure 2.

Figure 2 Presenting (left) a representative example of good diagnosis accuracy of ChatGPT and (right) Bard for a vignette for otitis media/externa.

Considering differential diagnoses, variation was observed across the vignettes for both ChatGPT and Bard. No responses listed their responses in hierarchy of likelihood, with 17/45 (37.7%) Bard interactions only offering one potential diagnosis and 14/45 (31.1%) for ChatGPT. Foreign body insertion was particularly limited with ChatGPT only once gave a suggestion of specific foreign bodies (1/6) (16.7%) and Bard more frequently giving a suggestion of specific foreign body (5/6) (83.3%). It should also be noted that when asked about the Facial Nerve Palsy vignette, ChatGPT only mentioned stroke as a diagnosis 1/3 (33.3%) times. Bell’s palsy was also the only diagnosis provided in 2/3 (66.6%) responses. Bard was similar with 3/3 (100%) responses only suggesting Bell’s palsy as a differential diagnosis, and no mention of stroke or other differentials. The limitations of differential diagnosis discussion are presented in Figure 3.

Figure 3 Presenting (left) a representative example of poor diagnosis accuracy of ChatGPT and (right) Bard for a vignette for facial palsy.

Recommendation to seek review

In all (45/45) (100%) instances, both ChatGPT and Bard suggested consultation with a medical officer at some point within the output text. This was without the need for additional prompts or inputs. Both included an explicit disclaimer that they were not medically trained, and only ‘provided information’ in addition with a suggestion to seek medical review. A representative disclaimer if provided in Figure 4.

Figure 4 Presenting (left) a representative example of recommendation to seek medical review for ChatGPT and (right) Bard for a vignette for sudden sensorineural hearing loss.

Triage categorisation

ChatGPT contained information on urgency in 34/45 (75.6%) instances. Bard contained information on urgency in 35/45 (77.8%) interactions. ChatGPT went one step further to mention specifically to seek a surgeon in 10/45 (22.2%) vignettes, with specific mention of an ENT in 7/10 (70%) vignettes when surgeon review was suggested. Bard did not differentiate subspecialty when suggesting medical review. A summary appropriateness of triage categorisation and appropriateness of suggested medical officer suggested in listed in Table 3, with an example of appropriate categorisation is outlined in Figure 5.

Table 3

Comparison of ChatGPT and Bard in areas of ‘medical review suggested’, ‘appropriate triage categorisation’, and ‘appropriateness of review’

Vignette title	Medical review suggested		Appropriate triage categorisation		Appropriateness of review
Vignette title	ChatGPT	Bard	ChatGPT	Bard	ChatGPT	Bard
Total	45/45	45/45	34/45	35/45	45/45	45/45
Epistaxis	3/3	3/3	1/3	1/3	3/3	3/3
Post-tonsillectomy bleed (adult)	3/3	3/3	3/3	3/3	3 (2^†)/3	3/3
Post-tonsillectomy bleed (child)	3/3	3/3	3/3	2/3	3 (1^†)/3	3/3
Sudden sensorineural hearing loss	3/3	3/3	0/3	3/3	3 (2^‡)/3	3/3
Nasal bone fracture	3/3	3/3	3/3	2/3	3 (2^‡)/3	3/3
Peri-orbital cellulitis	3/3	3/3	3/3	3/3	3/3	3/3
Tonsillitis/peritonsillar abscess	3/3	3/3	2/3	0/3	3/3	3/3
Otitis media/externa	3/3	3/3	1/3	3/3	3/3	3/3
Foreign body insertion (ear)	3/3	3/3	3/3	3/3	3 (1^‡)/3	3/3
Foreign body insertion (nose)	3/3	3/3	3/3	3/3	3/3	3/3
Battery ingestion (child)	3/3	3/3	3/3	3/3	3/3	3/3
Facial nerve palsy	3/3	3/3	2/3	3/3	3/3	3/3
Respiratory distress	3/3	3/3	3/3	3/3	3/3	3/3
Dysphonia	3/3	3/3	2/3	0/3	3 (2^‡)/3	3/3
Neck abscess	3/3	3/3	2/3	3/3	3/3	3/3

^†, indicates times specific ‘surgeon’ review was suggested; ^‡, indicates times specific ‘ENT’ review was suggested. ENT, otorhinolaryngological.

Figure 5 Presenting (left) a representative example of appropriate triage categorisation for ChatGPT and (right) Bard for a vignette for battery ingestion.

Both ChatGPT and Bard suggested specific treatment advice in 38/45 (84.4%) instances. Advice was non-specific, with general healthcare advice provided, without situation or condition specific information. The mean APM score for ChatGPT, Bard and clinical guidelines was scored 3.50, 3.23 and 3.42 respectively, displayed in Figure 6. There was no statistical difference between groups (P=0.5634).

Figure 6 Mean APM score for ChatGPT, Bard and clinical guidelines with 95% CI. Mean APM is presented of 3.50, 3.23 and 3.42 with no statistical significance between groups. APM, appropriateness of the preclinical measures; CI, confidence interval.

Discussion

With increasing application of AI systems across all areas of life, this study investigated the use of AI to provide patient advice in ENT emergencies. Vignettes were designed to resemble queries of patients in the community. Importantly, when faced with this task, both ChatGPT and Bard provided relevant diagnoses including the intended ENT emergency in all cases. There were, however, limitations in differential diagnoses, recommendation to seek medical review, and appropriates of urgency and triage.

The limitations of differential diagnoses were first found in foreign body vignettes. In all cases, ChatGPT and Bard correctly identified the diagnosis, particularly important for a button battery and limitations and variations across other foreign body cases. There were concerns for both AI systems in their response to Facial Nerve Palsy vignettes, with ’stroke’ only being mentioned once by ChatGPT and zero by Bard. Furthermore, for this vignette there were minimal differentials suggested, with Bard only suggesting a single diagnosis. Overall, there was a trend towards incomplete presentation and hierarchy of differential diagnoses across all vignettes. This highlights the potential narrow focus of AI and creates an issue for patients who might interpret this as the only diagnosis and miss more serious considerations.

All AI answers contained an explicit disclaimer that they were not medically trained and informed rather than advised. This was also paired with a recommendation to seek formal medical review. There is, however, nuance to formal medical review in ENT emergencies across primary care, emergency, and direct sub-specialty ENT presentation. Medical assessment involves physical review which provides important information that cannot be assessed with limited text prompts as clinical situations can significantly change diagnosis and management. In practice, this may create delay and confusion for patients with difficulties navigating complicated health care systems, both publicly and privately. This represents a challenge for both current and future AI systems. For example, in a post-tonsillectomy bleed a surgeon should be contacted as a part of post-operative emergency care. This contrasts with a case of otitis media where the general practitioner may be the focal point of care. Patient uncertainty and appropriate patient education in this situation is a challenge for AI, and some scenarios erroneously suggested circumstances when patients should seek review and offered suggestions around if or when the patient can stay home in place of attending medical review, particularly in the case of post-tonsillectomy bleeds.

When investigating triage, reference to urgency was made in 75.6% and 77.8% instances for ChatGPT and Bard respectively. Wording was non-specific, yet terminology of ‘immediate’, ‘urgent’ and ‘prompt’ did allude to the emergent nature of the provided vignettes. There was inconsistency and statements open to interpretation for patients. For example, variation was found both between AI systems and within vignettes with varying degrees of urgency or no mention of urgency at all. This was most concerning for some AI interactions, particularly in one case of a post-tonsillectomy bleed, which omitted urgency advice.

The overall appropriateness of AI advice was measured with the APM score. AI responses and comparative online guidelines were reviewed by Fellowship trained ENT surgeons to gauge the real-world applicability and relevance to current practice. Most assessments believed the AI generated information at least contained useful information, but variation was found across the scoring of AI responses. Within vignettes, there was variability in the APM grading, suggesting inconsistency in the AI interpretation of available data. This highlights inconsistency with the current generative AI large language models and their generation of novel responses to a patient’s prompt in each interaction. This was most evident in post-tonsillectomy bleed management, but also in managing facial nerve palsy and button battery ingestion vignettes—all important not to be missed cases.

Guidelines utilised within this study occurred on the first page when performing an online search via Google. This provided a comparative online patient experience. Currently there are limitations in available patient information across national or internationally recognised first-line guidelines. Healthdirect was assumed to be a reliable source given its endorsement by the Australian Government, with support from multiple State and Territory Departments. Royal Victorian Eye and Ear Hospital (E&E) resources were frequently used with guidelines occurring early within the designated Google search and their information clearly addressing the intended vignette. Most guidelines were assessed as APM >3, suggesting that there was relevant appropriate clinical advice offered and this enabled comparison to AI outputs. However, there are limitations with this scoring methodology and the previously published but unvalidated APM score. It should be noted that some reviewers felt the example clinical guidelines themselves were inadequate. Ongoing study is required the further validate and support future evidence-based scoring systems for AI patient education, with this study providing a foundation.

Overall, the evaluation of ChatGPT in providing clinical advice is a new and developing field of medical research. Limited publications exist, but evidence is emerging across the ENT subspecialty. Cai et al. were the first to demonstrate the ability for ChatGPT to pass an ENT board exam, however responses often lacked the depth and breadth typically expected in a board examination answer (21). Similarly, Mahajan et al. demonstrated that ChatGPT provided safe advice in accordance with the American Academy of Otolaryngology’s Clinical Practice Guidelines, however lacked accuracy and comprehensiveness (19). Studies then showed improvement with more detailed and targeted inputs, however raised concerns about filtering open‐ended free‐form histories, complete exams, and un-curated data which would often contain irrelevant, extraneous, and contradictory information (19,27). This study demonstrated these less favourable results, however without further detailed target inputs cannot directly relate to previous studies.

On review of other specialties, Bushuven et al. assessed ChatGPT’s ability in prehospital diagnostic accuracy, emergency call advice, and the validity of advice given to parents with concerns regarding their unwell child (28). Results suggested a 94% accuracy in diagnosis however there was poor recognition of several major conditions (28). Advice to seek emergency services was only at 54%, with correct prehospital advice occurring in 45% and incorrect at 13.6% (28). Dominik et al. most closely emulates the aims of this study by assessing ChatGPT alone in preclinical management of ophthalmological emergencies (16). Triage accuracy was reported as 87.2% with 32% of responses demonstrated a potential to inflict harm (16). The overall recommendation from these studies is that AI programs should not be used as stand-alone primary sources of information. AI based language models demonstrate the capability to provide useful and accurate advice, and it is assumed that as algorithms develop and software improves so do the ability for these models to perform, and to be trusted (16,28).

Limitations

A major limitation of this study is accounting for the multiple levels of variability within the study. The wording of the vignettes may not represent what patient’s input into AI systems and can contain more vague statements or confounding and conflicting information. Generated responses were variable, and consistency and accuracy of information cannot be guaranteed. Local gold standard management guidelines will also vary and may not accurately advise the patient pending their location. Published advice by its nature is general and may not account for individual patient factors and represents a limitation for AI and non-AI publications. AI has the potential to navigate this limitation though asking further questions for additional clarification, making is a more useful tool.

The assessment from surgeons demonstrates that there can be significant variation in clinical opinion. This study has focused on subspecialty advice, and this therefore limits the generalizability and transferability of our results and inferences to other medical and non-medical fields. This study only utilised two surgeons for clinical assessment, with a larger number potentially altering the outcome.

Future application

This study suggest that AI systems can act as an additional resource for patients to access information, however care should be exercised when interpreting its advice. We therefore clearly recommend contacting trained medical practitioners in all cases of acute symptoms and in all ENT emergencies. Additionally, given that ENT and medicine in general relies on visualisation and physical examination of patients, the ability to accurately assess situations purely based on text prompts are limited and should not be replaced.

This study contributes to the emerging evidence for AI and future studies must continue to assess the accuracy of AI generated responses and their health advice to ensure they are evidence based. Such information can be used by developers to refine and enhance AI which has the potential to be of great benefit to patients individually and society in general. The increasing accessibility and promotion of AI suggest its likely increasing utilisation by patients and clinicians, making future studies into medical use of AI critical.

Conclusions

ChatGPT and Bard demonstrate potential to diagnose, triage, and provide patient education for ENT emergencies. AI is not yet a substitute for medical review but may help augment patient understanding. As AI becomes more widespread and accessible, its utilisation by patients and clinicians is likely to increase. The quality of AI advice is also dependant on the quality of input, with poorly worded questions yielding inadequate advice. It is also important to highlight the limitations of AI for both patients and clinicians, and patient safety should always come first. There is currently a boom in AI application, with AI based language models showing impressive capabilities within the field of medicine and medical knowledge. It is important to have mechanisms to inform users when advice is AI generated, as there can be an assumption that systems may be real individuals. Providers should display transparency and disclose what resource were used to generate advice and if this was AI or not. Issues still currently remain around who holds legal responsibility for advice provided by AI in the event of adverse outcomes and will require a legislative response. With further development, there is little doubt AI will play a greater role in the pre-clinical management of ENT emergencies in the future.

Acknowledgments

Funding: None.

Footnote

Reporting Checklist: The authors have completed the STARD reporting checklist. Available at https://www.theajo.com/article/view/10.21037/ajo-24-1/rc

Data Sharing Statement: Available at https://www.theajo.com/article/view/10.21037/ajo-24-1/dss

Peer Review File: Available at https://www.theajo.com/article/view/10.21037/ajo-24-1/prf

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://www.theajo.com/article/view/10.21037/ajo-24-1/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. Ethical approval was assessed as not required given this study did not involve human participation, and the data collected was freely available from a public domain.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Merino-Galvez E, Gomez-Hervas J, Perez-Mestre D, et al. Epidemiology of otorhinolaryngologic emergencies in a secondary hospital: analysis of 64,054 cases. Eur Arch Otorhinolaryngol 2019;276:911-7. [Crossref] [PubMed]
Naunheim MR, Kozin ED, Shrime MG. The Value of Urgent and Emergent Care in Otolaryngology. Current Otorhinolaryngology Reports 2018;6:209-15. [Crossref]
Rouhani MJ. In the face of increasing subspecialisation, how does the specialty ensure that the management of ENT emergencies is timely, appropriate and safe? J Laryngol Otol 2016;130:516-20. [Crossref] [PubMed]
Hill MG, Sim M, Mills B. The quality of diagnosis and triage advice provided by free online symptom checkers and apps in Australia. Med J Aust 2020;212:514-9. [Crossref] [PubMed]
Aaronson NL, Joshua CL, Boss EF. Health literacy in pediatric otolaryngology: A scoping review. Int J Pediatr Otorhinolaryngol 2018;113:252-9. [Crossref] [PubMed]
Sharma S, Pajai S, Prasad R, et al. A Critical Review of ChatGPT as a Potential Substitute for Diabetes Educators. Cureus 2023;15:e38380. [Crossref] [PubMed]
Mothershaw A, Smith AC, Perry CF, et al. Does artificial intelligence have a role in telehealth screening of ear disease in Indigenous children in Australia? Aust J Otolaryngol 2021;4:38. [Crossref]
Zalzal HG, Cheng J, Shah RK. Evaluating the Current Ability of ChatGPT to Assist in Professional Otolaryngology Education. OTO Open 2023;7:e94. [Crossref] [PubMed]
Nwosu OI, Crowson MG, Rameau A. Artificial Intelligence Governance and Otolaryngology-Head and Neck Surgery. Laryngoscope 2023;133:2868-70. [Crossref] [PubMed]
Tan L, Tivey D, Kopunic H, et al. Part 1: Artificial intelligence technology in surgery. ANZ J Surg 2020;90:2409-14. [Crossref] [PubMed]
Bur AM, Shew M, New J. Artificial Intelligence for the Otolaryngologist: A State of the Art Review. Otolaryngol Head Neck Surg 2019;160:603-11. [Crossref] [PubMed]
Xie Y, Seth I, Hunter-Smith DJ, et al. Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT. Aesthetic Plast Surg 2023;47:1985-93. [Crossref] [PubMed]
Haupt CE, Marks M. AI-Generated Medical Advice-GPT and Beyond. JAMA 2023;329:1349-50. [Crossref] [PubMed]
Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J 2019;6:94-8. [Crossref] [PubMed]
Baumgartner C. The opportunities and pitfalls of ChatGPT in clinical and translational medicine. Clinical and Translational Medicine 2023;13:e1206. [Crossref] [PubMed]
Knebel D, Priglinger S, Scherer N, et al. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies - An Analysis of 10 Fictional Case Vignettes. Klin Monbl Augenheilkd 2023; [Crossref] [PubMed]
Ferdush J, Begum M, Hossain ST. ChatGPT and Clinical Decision Support: Scope, Application, and Limitations. Ann Biomed Eng 2024;52:1119-24. [Crossref] [PubMed]
Rao A, Pang M, Kim J, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res 2023;25:e48659. [Crossref] [PubMed]
Mahajan AP, Shabet CL, Smith J, et al. Assessment of Artificial Intelligence Performance on the Otolaryngology Residency In-Service Exam. OTO Open 2023;7:e98. [Crossref] [PubMed]
Jiang F, Jiang Y, Zhi H, et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol 2017;2:230-43. [Crossref] [PubMed]
CaiLKayleL, André dos S, et al. Evaluating ChatGPT-4 in Otolaryngology–Head and Neck Surgery Board Examination using the CVSA Model.medRxiv. 2023:2023.05.30.23290758.
Gracias D, Siu A, Seth I, et al. Exploring the role of an artificial intelligence chatbot on appendicitis management: an experimental study on ChatGPT. ANZ J Surg 2024;94:342-52. [Crossref] [PubMed]
Sinha RK, Deb Roy A, Kumar N, et al. Applicability of ChatGPT in Assisting to Solve Higher Order Problems in Pathology. Cureus 2023;15:e35237. [Crossref] [PubMed]
Antaki F, Touma S, Milad D, et al. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol Sci 2023;3:100324. [Crossref] [PubMed]
Ali R, Tang OY, Connolly ID, et al. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery 2023;93:1353-65. [Crossref] [PubMed]
Hopkins AM, Logan JM, Kichenadasse G, et al. Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm-shift. JNCI Cancer Spectr 2023;7:pkad010. [Crossref] [PubMed]
Qu RW, Qureshi U, Petersen G, et al. Diagnostic and Management Applications of ChatGPT in Structured Otolaryngology Clinical Scenarios. OTO Open 2023;7:e67. [Crossref] [PubMed]
Bushuven S, Bentele M, Bentele S, et al. "ChatGPT, Can You Help Me Save My Child's Life?" - Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases - An In-silico Analysis. J Med Syst 2023;47:123. [Crossref] [PubMed]

doi: 10.21037/ajo-24-1
Cite this article as: Chen FJ, Nightingale J, You WS, Anderson D, Morrissey D. Assessment of ChatGPT vs. Bard vs. guidelines in the artificial intelligence (AI) preclinical management of otorhinolaryngological (ENT) emergencies. Aust J Otolaryngol 2024;7:19.

Assessment of ChatGPT vs. Bard vs. guidelines in the artificial intelligence (AI) preclinical management of otorhinolaryngological (ENT) emergencies

Introduction

Objectives and aims

Methods

Creation of AI prompts

Table 1

Standardised interaction with ChatGPT and Bard

Analysis of ChatGPT and Bard responses

Table 2

Results

Accuracy of diagnosis

Recommendation to seek review

Triage categorisation

Table 3

Discussion

Limitations

Future application

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share