Harnessing ChatGPT and GPT-4 for Evaluating the Rheumatology Questions of the Spanish Access Exam to Specialized Medical Training

04/01/2024

In this study, we have evaluated the accuracy and clinical reasoning of two LLM in answering rheumatology questions from Spanish official medical exams. To our knowledge, this is the first study to evaluate the usefulness of LLM applied to the training of medical students with a special focus on rheumatology.

The ability of GPT-4 to answer questions with high accuracy and sound clinical reasoning is remarkable. This could make such models valuable learning tools for medical students. However, ChatGPT/GPT-4 LLM are only the first models that have reached the public in the rapidly expanding field of LLM chatbots. At present, a myriad of additional models are under development. Some of these nascent models are not only pre-trained in biomedical texts^34,35, but are also specifically designed for a broad range of tasks (e.g., text summarization, question-answering and so on).

Studies with a similar objective to this one have been conducted. For example, a Spanish study³⁶, evaluated ChatGPT’s ability to answer questions from the 2022 MIR exam. In this cross-sectional and descriptive analysis, 210 questions from the exam were entered into the model. ChatGPT correctly answered 51.4% of the questions. This resulted in a 7688 position, slightly below the median of the population tested but above the passing score.

In another research³⁷, the proficiency of ChatGPT in answering higher-order thinking questions related to medical biochemistry, including 11 competencies such as basic biochemistry, enzymes, chemistry and metabolism of carbohydrates, lipids and proteins, oncogenesis, and immunity, was studied. Two-hundred questions were randomly chosen from an institution’s question bank and classified according to the Competency-Based Medical Education. The answers were evaluated by two expert biochemistry academicians on a scale of zero to five. ChatGPT obtained a median score of 4 out of 5, with oncogenesis and immunity competition having the lowest score and basic biochemistry the competition with the highest.

Research of a similar nature was conducted in Ref.³⁸. In this study, the authors appraised the capability of ChatGPT in answering first- and second-order questions on microbiology (e.g., general microbiology and immunity, musculoskeletal system, skin and soft tissue infections, respiratory tract infections and so on) from the Competency Based Medical Education curriculum. A total of 96 essay questions were reviewed for content validity by an expert microbiologist. Subsequently, ChatGPT responses were evaluated on a scale of 1 to 5, with five being the highest score, by three microbiologists. A median score of 4.04 was achieved.

On the other hand, ChatGPT was tested on the Plastic Surgery In-Service examinations from 2018 to 2022 and its performance was compared to the national average performance of plastic surgery residents³⁹. Out of 1129 questions, ChatGPT answered 630 (55.8%) correctly. When compared with the performance of plastic surgery residents in 2022, ChatGPT ranked in the 49th percentile for first-year residents, but its performance fell significantly among residents in higher years of training, dropping to the 0th percentile for 5th and 6th-year residents.

Another study was conducted by researchers in Ref.⁴⁰, who aimed to assess whether ChatGPT could score equivalently to human candidates in a virtual objective structured clinical examination in obstetrics and gynecology. Seven structured examination questions were selected, and the responses of ChatGPT were compared to the responses of two human candidates and evaluated by fourteen qualified examiners. ChatGPT received an average score of 77.2%, while the average historical human score was 73.7%.

Moreover, the authors in Ref.⁴¹ instructed ChatGPT to deliver concise answers to the 24-item diabetes knowledge questionnaire, consisting of a clear “Yes” or “No” response, followed by a concise rationale comprising two sentences for each question. The authors found that ChatGPT successfully answered all the questions.

In Ref.⁴², the researchers were interested in evaluating the performance of ChatGPT on open-ended clinical reasoning questions. Therefore, fourteen multi-part cases were selected from clinical reasoning exams administered to first and second-year medical students and provided to ChatGPT. Each case was comprised of 2–7 open-ended questions and was shown to ChatGPT twice. ChatGPT achieved or surpassed the pre-established passing score of 70% in 43% of the runs (12 out of 28), registering an average score of 69%.

Some studies showed remarkable performance, for instance, a research study evaluated the performance of ChatGPT in medical physiology university examination of phase I MBBS⁴³. In this investigation, ChatGPT correctly answered 17 out of 20 multiple-choice questions, while providing a comprehensive explanation for each one. On their side, researchers in Ref.⁴⁴ proposed a four-grading system to classify the answers of ChatGPT, to note, comprehensive, correct but inadequate, mixed with correct and incorrect/outdated data, and completely incorrect. ChatGPT showed a 79% and a 74% of accuracy when answering questions related to cirrhosis and hepatocellular carcinoma. However, only the 47% and 41% of the answers were classified as comprehensive.

Conversely, in another research⁴⁵ in which ChatGPT was exposed to the family medicine course’s multiple-choice exam of Antwerp University, only 2/125 students performed worse than ChatGPT. Since the questions were prompted in Dutch language, the potential correlation between ChatGPT’s low performance and the proportion of Dutch texts used in its training could be a factor worth considering for this discordant result.

Another study, Ref.⁴⁶, evaluated ChatGPT’s performance on standardized admission tests in the United Kingdom, including the BioMedical Admissions Test (BMAT), Test of Mathematics for University Admission (TMUA), Law National Aptitude Test (LNAT), and Thinking Skills Assessment (TSA). A dataset of 509 multiple-choice questions from these exams, ranging from 2019 to 2022 was used. The results varied among specialities. For BMAT, the percentage of correct answers varied from 5 to 66%, for TMUA varied from 11 to 22%, for LNAT from 36 to 53%; and for TSA from 42 to 60%. The authors concluded that while ChatGPT demonstrated potential as a supplemental tool for areas assessing aptitude, problem-solving, critical thinking, and reading comprehension, it showed limitations in scientific and mathematical knowledge and applications.

The results shown by most of these studies are in line with our results, the average score of ChatGPT is between 4 and 5 (on a scale of five elements) when answering medical-related questions. However, in these studies, GPT-4 performance was not evaluated. Based on our results, we can postulate that there would be an increase in accuracy in comparison to those obtained by ChatGPT. In addition, to solve some of the limitations identified by our evaluators, such as the employment of a language that can lack technical precision by the models, LLM could perform better if trained or fine-tuned with biomedical texts.

A large part of the concerns and doubts that arise from using these models are due to regulatory and ethical issues. Some of the ethical dilemmas have been highlighted in Ref.⁴⁷. For instance, the authors pointed out that LLM reflect any false, outdated, and biased data from which the model was trained, and that they could not reflect the latest guidelines. Some authors have pointed out the risks of perpetuating biases if the model has been trained on biased data^48,49 Other authors go further and declare that these types of models should not be used in clinical practice⁵⁰. The motivation behind this statement lies in the presence of biases such as clinician bias, which may exacerbate racial-ethnic disparities, or “hallucinations” meaning that ChatGPT produces high levels of confidence in its output even when insufficient or masked information in the prompt. According to the authors, this phenomenon could lead users to place unwavering trust in the output of chatbots, even if it contains unreliable information. This was also pointed out by our evaluators, as shown in the Supplementary Material ‘Questionnaire’. The firmness with which these models justify erroneous reasoning may limit their potential usefulness. Eventually, the authors also claimed that these models are susceptible to “Falsehood Mimicry”, that is, the model will attempt to generate an output that aligns with the user’s assumption rather than clarifying questions. Falsehood mimicry and hallucinations may limit the potential use of these models as diagnostic decision support systems (DDSS). For instance, a clinical trial⁵¹ compared the diagnostic accuracy of medical students regarding the typical rheumatic diseases, with and without the use of a DDSS and concluded that no significant advantage was observed from the use of the DDSS. Moreover, researchers reported that students accepted false DDSS diagnostic suggestions in a substantial number of situations. This phenomenon could be exacerbated when using LLM, and therefore should be study with caution.

In our study, due to the nature of the questions, we were unable to assess racial or ethnic disparities. However, we did not find any gender bias when considering the clinical case questions. Finally, a relevant study that looks in depth at the biases that can arise when using LLMs can be found in Ref.⁵².

Regarding regulation, the arrival of these models has led to greater efforts being made to regulate the use of AI. The EU AI Act⁵³ is a good example of this. According to our results, in these early stages of LLM, the corpus used for training, as well as the content generated by them, should be carefully analyzed.

During the writing of this manuscript, new articles have emerged. For instance, in Ref.⁵⁴ authors studied the performance of ChatGPT when introducing the most searched keywords related to seven rheumatic and musculoskeletal diseases. The content of each answer was evaluated in terms of usefulness for patients with the ChatGPT in a scale from 1 to 7 by two raters.

Finally, further analysis is needed to explore these observations and understand their implications for the development and use of AI in medical education and practice.

Limitations

Two chatbots were primarily used in this study, ChatGPT and GPT-4, both owned by OpenAI. However, other LLM such as BARD or Med-PaLM2 by Google, Claude 2 by Anthropic or LLaMA and LIMA by Meta are in development. Some of them are publicly available. To provide a better overview of other LLM, the accuracy of BARD (60.84%) and Claude 2 (79.72%) was calculated and compared against ChatGPT/GPT-4 in the Supplementary Material File Section ‘LLM comparison’ and Supplementary Fig. 5.
The Krippendorff’s alpha coefficient and Kendall’s coefficient of agreement oscillates between 0.225 and 0.452 for the clinical reasoning of GPT-4, although the most repeated score of five out of six evaluators is 5 (the percentage of four and five scores oscillates between 87.41% and 93.70%), see Fig. 1. This phenomenon is known as “Kappa paradoxes” (i.e., ‘high agreement, but low reliability’)^55,56, and tends to appear in skewed distributions such as the one presented in this study. More details can be found in Ref.³¹. In this study, since the ChatGPT clinical reasoning score distribution is less skewed, the reliability coefficient values are higher, between 0.624 and 0.783, than the ones obtained with GPT-4. However, when considering the Gwet’s AC2 coefficient, the trend is reversed, 0.924 vs. 0.759, with higher inter-rater agreement in GPT-4 compared to ChatGPT. These large differences between interrater reliability indices have been observed in simulation studies with skewed distributions⁵⁷.
To ensure reproducibility and facilitate the comparison of results, each question could have been submitted twice to ChatGPT/GPT-4, a strategy supported by previous research endeavours⁴⁴, and a default feature of BARD. However, in the tests we conducted, the responses were consistent across iterations and this would have doubled the workload of evaluators, so we chose to include more questions from a single run, rather than fewer questions run multiple times. In addition, according to Ref.⁴⁴, the 90.48% of “regenerated questions” produced two similar responses with similar grading.
The format of each question could have been transformed from multiple choice to open-ended. With this approach, it could have been possible to delve deeper into ChatGPT’s clinical reasoning. As explained in the previous point, this would have doubled the workload. Additionally, there are no open questions in the MIR exams.
When conducting such studies, it is crucial to consider the evolution of knowledge over time. Evaluating older questions with models trained on recent data may reveal disparities compared to previously accepted and conventional beliefs. Therefore, accounting for this temporal aspect is essential to ensure the accuracy and relevance of the study findings. Moreover, although not extensively explored in this research, one of the key concepts when using LLM chatbots is what is known as the prompt or input text that is entered into the model. Depending on how well-defined the prompt is, the results can vary significantly. In this study, we tried to adhere closely to the official question while minimizing any unnecessary additional text.
Another identified limitation of the study is the absence of medical students evaluation for the models’ clinical reasoning. This would have allowed us to determine whether students can recognize the main issues discussed above (e.g., bias, falsehood mimicry, hallucinations), and analyze to what extent these limitations may affect their usefulness.
One of the main criticisms of the evaluators in assessing the LLM response was the use of non-technical language. This could have been remedied, in part, by using prompt engineering, this is, by modifying the initial input of the model and asking for a more technical response.
We have explored the performance of GPT-4/ChatGPT in Spanish questions. However, different authors have suggested that such performance is language-dependent⁵⁸. Hence, special caution should be taken when extrapolating the results to other languages.

Read full article

Limitations

Title