Introduction

Large Language Models (LLMs) are a type of Artificial Intelligence (AI) that are designed to mimic human language processing abilities. They use deep learning techniques, such as neural networks, and are trained on vast amounts of text data from various sources, including books, articles, websites, and more. Notably, extensive training enables LLMs to generate highly coherent and realistic text. LLMs analyze patterns and connections within the data they were trained on and use that knowledge to predict what words or phrases are likely to appear next in a specific context. This capability to comprehend and generate language is beneficial in various fields of natural language processing (NLP) such as machine translation and text generation.

Generative pre-training transformer (GPT) is a type of LLM model released by OpenAI (San Francisco, California), in 2018. It was trained using a variant of the transformer architecture on a dataset of 40GB of text and had a model size of 1.5B parameters [1]. Released in 2020, GPT-3 was trained on a massive dataset of text (570GB with a model size of 175B parameters). ChatGPT is the last variant of GPT-3, developed for dialogue with users [2].

Given its potential, the tool was immediately extensively tested. In a manuscript currently available as a preprint, ChatGPT passed the three exams of the United States Medical Licensing Exam (USMLE) [3]. Another study found that GPT-3.5 (Codex and InstructGPT) can perform at a human level on various datasets including USMLE (60.2%), MedMCQA (57.5%), and PubMedQA (78.2%) [4]. Despite the impressive outputs often produced by ChatGPT, it is unclear how well it will perform in the context of difficult real-world questions and scenarios, especially in fields such as medicine where high and complex mental loads are required [5]. Additionally, while the use of the chatbot in writing scientific articles may be useful, important ethical concerns arise [6].

On these premises, we used the publicly available webpage at https://chat.openai.com/chat to conduct a brief investigation for evaluating the potential use of ChatGPT in four clinical and research scenarios: (1) support of clinical practice, (2) scientific production, (3) misuse in medicine and research, and (4) reasoning about public health topics.

ChatGPT for Supporting Clinical Practice

We started asking ChatGPT to compose a medical note for a patient admitted to the intensive care unit (ICU) after providing information regarding ongoing treatments, laboratory samples, blood gas analysis parameters, as well as respiratory and hemodynamic parameters, in a random order. After requesting a structured note, ChatGPT was able to correctly categorize most of the parameters into the appropriate sections, even when they were presented only as abbreviations and without any information about their meanings.

ChatGPT also showed an impressive ability to learn from its own mistakes and correctly assigned the right section to the previously misplaced parameters just by asking if that parameter was placed in the right section and without any other prompt. Notably, the major limitation was related to addressing causal relations among conditions such as acute respiratory distress syndrome (ARDS) and septic shock. It should be noted that while it was acknowledged that the sources of information may not be current or comprehensive enough to establish accurate causal connections. Additionally, ChatGPT was not designed for answering medical questions and, therefore, it lacks the medical expertise and context needed to fully understand the complex relationships between different conditions and treatments. Besides, ChatGPT demonstrated the ability to provide meaningful suggestions for further treatments based on the provided information, although at times the information provided was general. The best performance of ChatGPT was related to his ability to summarize information, although sometimes imprecise, using technical language for communication among clinics as well as plain language for communication with patients and their families.

Scientific Writing

Moving towards potential applications of conversational AI-based tools in medical research, we evaluated chatGPT’s ability to understand and summarize information and draw conclusions based on the text from the Background, Methods, and Results sections of an abstract. To ensure that the provided information was not already known by the chatbot, whose current knowledge base is current up until 2021, we selected 5 papers published on the NEJM in the last months of 2022 [7,8,9,10,11]. Then, we wrote the following prompt “Based on the Background, Methods, and Results provided below, write the Conclusions of an abstract for the NEJM. The conclusions cannot be longer than 40 words”. Original and GPT-created conclusions are reported in Table 1. Overall, GPT was able to correctly indicate the setting and summarize the results of the primary outcome of the study. It was more likely to highlight secondary findings while the constraint of the text length was not strictly followed in favor of a meaningful message.

Table 1 The original abstract conclusions compared with the conclusions provided by GPT based on the Background, Methods, and Results provided from the abstract. The prompt given to GPT was “Based on the Background, Methods, and Results provided below, write the Conclusions of an abstract for the NEJM. The conclusions cannot be longer than 40 words.”

Possible Misuse of GPT in Medicine and Research

We examined various applications that could result in both intentional and unintentional misuse. We also asked ChatGPT to suggest possible situations of misuse. In Table 2, we reported some of the suggestions provided by ChatGPT. Based on the responses, we assessed the technical feasibility. Although all the proposed settings of fraudulent use of ChatGPT are not exclusively of ChatGPT, what is impressive is the effective acceleration in the creation of fake evidence and materials with a high level of plausibility.

Table 2 Examples of possible misuse of GPT

Concerning the possible misuses proposed by ChatGPT, we also provided as a prompt a fictive dataframe in .csv format and asked to write the whole structured abstract for a scientific journal. Although the absence of a prompt with no information regarding the study (or the study aim), the first output was correctly structured with a plausible setting considering the variable name, realistic results, and coherent conclusions. Despite the fact that the abstract appeared to be reliable after a few prompts, it is important to consider that ChatGPT is not capable of performing statistical analyses and, upon different simulations, we noticed that it does not constantly advise on its limitations if not expressively requested. Interestingly, ChatGPT is able to assist and provide hints regarding codes for statistical analysis in different languages and even simulate model outputs of different types of models that might seem plausible to a reader who is not familiar with performing statistical analyses.

From these results, it is evident that this revolution in scientific publishing must be proactively managed through important regulatory policies. For example, there is a need to establish the margins of plagiarism [12].

Reasoning About Public Health Topics

We investigated ChatGPT’s comprehension of the concept of seniority. The chatbot demonstrated the ability to provide accurate definitions and distinguished between the sociological and cultural definition of “elderly” and the medical definition, which defines individuals over the age of 65 as elderly and often vulnerable. Moreover, it divided the population of older adults into different subgroups based on their age or level of seniority, identifying three categories: young-old (the 60s or early 70s), middle-old (mid-70s to mid-80s), and old-old (over 85 years old). This categorization aligns with definitions used in various studies [13]. The chatbot also mentioned an important concept in medicine, namely that “It is important to note that these categories are generalizations and that individual older adults may experience different health needs and challenges regardless of their age”.

Then, we asked ChatGPT if it was possible to objectively measure a person’s seniority from a biological perspective. The chatbot promptly provided a list of commonly studied methods, including dental and skeletal development, telomere length, DNA methylation, hormone levels, and cognitive functioning.

The next question we posed to GPT was more complex. We asked if it would be useful to examine patients’ objective age-related factors in the perioperative context. The chatbot gave us an affirmative answer and, upon request, provided clinical studies to support its response: “For example, a study published in the journal Anesthesiology in 2016 found that older patients with shorter telomeres … were more likely to experience post-operative complications… Other studies have examined the role of hormone levels, such as testosterone and estrogen, in the perioperative period. A study published in the Journal of Anesthesia in 2017 found that older men with lower testosterone levels were more likely to experience complications after major surgery, while a study published in the journal Anesthesiology in 2019 found that older women with lower estrogen levels were at increased risk of post-surgical cognitive decline”.

Finally, we tried to contextualize previous findings in clinical research by asking “What would it be interesting to study about the objectivity of age and perioperative period?”. The chat responded in an interesting manner, and it listed four possible research topics:

  1. 1.

    Identifying and validating additional objective age-related biomarkers.

  2. 2.

    Examining the impact of interventions on objective age-related biomarkers.

  3. 3.

    Investigating the potential role of objective age-related biomarkers in personalized medicine.

  4. 4.

    Evaluating the impact of objective age-related biomarkers on long-term outcomes.

According to this test, we have noticed that when discussing public health topics, the chatbot is able to provide accurate definitions and can even give examples of clinical studies. However, some of the responses may be stereotyped and the logical connections may depend on the user’s input.

In conclusion, NLP-based models could have great potential to accelerate science and improve scientific literacy by supporting various aspects of research. On a larger scale, they could be useful in exploring the literature and generating new research hypotheses. Additionally, these strategies can serve for handling complex data, as well as for extracting useful information from medical texts, such as electronic health records (EHRs), clinical notes, and research papers. Finally, they may facilitate the dissemination of scientific findings by translating complex research into more easily understandable language for the general public.

On the other hand, it is crucial for the scientific community to understand the limits and capabilities of ChatGPT. This entails determining the specific tasks and areas for which ChatGPT can be well-suited, as well as any potential challenges or limitations. The so-called “hallucination” phenomenon, for example, refers to the ability of ChatGPT to produce answers that sound believable but may be incorrect or nonsensical. Additionally, another great problem is that ChatGPT can reproduce biases present in the data it was trained on.

By establishing a clear understanding of ChatGPT’s abilities and limits, researchers and practitioners can utilize the technology effectively, while avoiding any unintended consequences. Furthermore, by identifying these boundaries, the community can also identify areas where further research and development are needed for improving the model’s performance and capabilities. To date, due to their significant limitations, many challenges arise for the applications of these instruments for both clinical aid and research purposes [14].