The world is witnessing the unfolding of what has come to be referred to as the artificial intelligence war. Global tech companies like Google, Alibaba, Baidu, OpenAI, and Microsoft are not leaving any stone unturned to gain an edge in the now competitive sector.
ChatGPT, launched by Microsoft-backed OpenAI, is by far the most successful generative AI software and doctors recently put it to the test against a primary medical licensing exam in the United States.
Globally, people have been testing ChatGPT’s abilities and competence in various subjects and sectors. ChatGPT has often surpassed expectations when prompted by different questions and requests in varying setups.
The latest and most awaited test was conducted by a group of AnsibleHealth AI researchers, comprising one of the most respected medical licensing exams.
According to the results, the generative AI software barely passed the exam, casting doubts on the future of competent robot doctors, at least for now.
ChatGPT, the most advanced of all the available generative AI platforms, could only manage a grade D. Despite the chatbot’s lackluster performance, AnsibleHealth AI researchers reckon that it is not a small feat. It is “a landmark achievement for AI.”
ChatGPT Mediocre Test Scores Aside – Researchers Are Impressed
Researchers prompted ChatGPT on a set of three exams that doctors in the US must pass before receiving a medical license. According to the results, the chatbot scored between 52.4% and 75% across all three exam levels.
While researchers were impressed by the results of the AI tool, the world’s smartest doctors could find its grades mediocre. Nonetheless, ChatGPT’s performance was not far from the required 60% needed to pass the US Medical Licensing Exam.
This was the first time a generative AI software took the entire medical licensing exam, known to be notoriously challenging.
One important point to note is that ChatGPT passed the exam without requiring any additional specialized inputs from human trainers.
“Reaching the passing score for this notoriously difficult expert exam, and doing so without any human reinforcement, marks a notable milestone in clinical AI maturation,” the researchers said via the journal PLOS Digital Health.
In spite of the less satisfactory score, AnsibleHealth AI researchers lauded ChatGPT for crafting naturally sounding responses to the questions and creating what they termed “new non-obvious, and mechanically valid insights,” at least for 88.9% of the answers which to a big extent, exhibited a vivid flow of deductive reasoning and long-term dependency skills.
Diverging from past deep learning model systems, ChatGPT is based on an extensive language model that predicts word sequences by considering the previous context.
This unique ability allows ChatGPT to produce original, coherent word combinations that the algorithm hasn’t encountered before.
The challenging USMLE exams assess candidates on fundamental science, clinical reasoning, medical management, and bioethics, typically taken by medical students and physicians in training.
Given their standardized and regulated nature, these exams serve as an ideal platform to evaluate ChatGPT’s potential, according to researchers.
It is worth noting that the USMLE exams are anything but simple. Human learners often spend about 300-400 hours rigorously studying complex scientific literature and test materials just for Step 1, the first of the three exams.
Interestingly, ChatGPT surpassed PubMedGPT, a large AI language model specifically trained on biomedical literature.
This may seem startling, but researchers divulged that ChatGPT’s general training might offer an advantage due to exposure to a wider variety of clinical content such as patient-oriented disease guides or medication package inserts.
AnsibleHealth AI researchers are hopeful ChatGPT’s satisfactory performance could provide a sneak peek into a future where AI systems support medical education.
They already see this happening, although on a small scale, pointing to a recent instance where AnsibleHealth clinicians used the tool to rewrite complex reports filled with technical jargon.
“Our study suggests that large language models such as ChatGPT may potentially assist human learners in a medical education setting, as a prelude to future integration into clinical decision-making,” the researchers said in the report.
In a fascinating turn of events, ChatGPT didn’t just take the medical exam; it also contributed to drafting the research paper documenting its performance.
Researchers reckon that they collaborated with ChatGPT as if it were a colleague, utilizing it to refine their draft, simplify the text, and even offer counterarguments.
“All of the co-authors valued ChatGPT’s input,” Tiffany Kung, one of the researchers said.
Why ChatGPT is Far From A Medical AI Assistant
The fact that ChatGPT passed the United States Medical Licensing Exam does not mean doctors can consult the chatbot in their practice, at least not at the moment. Although a report recently published in the New England Journal of Medicine (NEJM) states otherwise.
Another study by the Human-Centered Artificial Intelligence (HAI) group at Stanford outlined that it would not be an intelligence thing for a doctor to consult the generative AI chatbot.
The researchers made their conclusions after bombarding the AI with 64 circumstances designed to evaluate its usefulness and safety. “You are assisting doctors with their questions,” the researchers presented the scenario to GPT-4.
The special report from NEJM revealed that GPT-4 “generally provides useful responses,” without delving into specifics.
As for Stanford researchers, they found that GPT-4’s answers matched the accurate clinical response 41% of the time.
Such a rate—of 0.41 would place you among the top batters in baseball history. However, when it comes to medicine, this demonstrates that acing a medical exam does not guarantee exceptional medical practice, provided the data from Stanford researchers holds.
Nonetheless, GPT-4’s capabilities are remarkable. There was a significant upgrade in performance from GPT-3.5, the prominent OpenAI software introduced to consumers by Microsoft.
GPT-3.5’s responses agreed with expected answers only 21% of the time when instructed to “Act as an AI Doctor.” Even in baseball, this would result in a fast-tracked demotion to the minor leagues.
Furthermore, both ChatGPT versions performed on par with the average doctor when instructed, “First, do no harm.”
A National Academy of Medicine report on diagnostic errors outlined that by modest estimation, 5% of US adults are inaccurately diagnostic annually, which occasionally leads to disastrous outcomes.
In comparison, GPT-3.5 accounted for 91% of safe responses, and GPT-4 scored 93%, with the remaining percentages attributed to AI “hallucinations.”
Stanford clinician reviewers could not determine whether GPT-3.5 answers matched with the recognized clinical responses 27% of the time.
OpenAI’s flagship AI tool, GPT-4 returned slightly more than 29% of “can’t tell” responses.
The research conducted by Stanford was published online via a blog post titled, “How Well Do Large Language Models Support Clinician Information Needs?”
The study examined questions compiled during the “Green Button” initiative, which assessed real patient data from the university’s electronic health record (HER).
Researchers hoped the study would offer clinicians instant access to relevant information. It’s worth noting that doctors do not use an actual button; they often type their queries.
OpenAI has trained its ‘Generative Pre-trained Transformer’ chatbots on complementary sources like the medical literature and data found on the web.
OpenAI is not the only company with generative AI software that can be subjected to medical-based questions and scenarios. Nigam Shah and Saurabh Gombar, Standford informaticists have in collaboration with Brigham Hyde, co-founded Atropos Health, a tech startup providing the same on-demand and practical evidence to medical clinicians.
Although GPT is nowhere close to the competence of most of the world’s top doctors, researchers across the board agree that it possesses immense potential.
“GPT-4 is a work in progress,” the special report authors stated, who have all worked with the technology on behalf of its backer Microsoft, “and this article just barely scratches the surface of its capabilities.”
The AI wars are just getting started with Google expected to release, Med-Palm 2, a generative AI software product with the ability to assist in diagnosis. However, the product will initially roll out to a select group of Google’s cloud computing clients in the upcoming several months.
ChatGPT’s AI “Hallucinations” Are Its Biggest Flaw
Many researchers and academicians have in recent months subjected OpenAI’s ChatGPT to many tests. Its results have, to say the least, been impressive, adding to a stack of passing grades.
In March, ChatGPT managed to score between a B and B minus in an exam from the prestigious Wharton School of the University of Pennsylvania for MBA-level business students.
The generative AI chatbot also managed a passing score in a law exam taken by students at the Minnesota University Law School around the same time.
“Alone, ChatGPT would be a pretty mediocre law student,” lead study author Jonathan Choi said told Reuters in an interview. “The bigger potential for the profession here is that a lawyer could use ChatGPT to produce a rough first draft and just make their practice that much more effective.”
ChatGPT may come out as averagely competitive when tested on literature and comprehension-based exams, but when it comes to mathematics its mediocre abilities are amplified immensely.
Renowned for its remarkable capacity to generate academic papers and somewhat coherent writing, researchers reveal that the AI’s mathematical performance is equivalent to that of a 6th grader.
ChatGPT struggles even more when presented with basic arithmetic questions in conversational language. This issue arises from its predictive, large-scale language model training.
Although ChatGPT confidently offers solutions to math problems, they may entirely digress from the actual answer.
These peculiar responses from ChatGPT are cautiously referred to as AI “hallucinations” by senior Google engineers and other experts.
“It [ChatGPT] acts like an expert, and sometimes it can provide a convincing impersonation of one,” University of Texas professor Paul von Hippel said in a recent interview with The Wall Street Journal. “But often it is a kind of b.s. artist, mixing truth, error, and fabrication in a way that can sound convincing unless you have some expertise yourself.”
Such hallucinations can produce seemingly convincing yet partially or entirely fabricated answers, raising concerns for those seeking reliable AI assistance in critical fields like medicine and law.
- Sega Acquires the Parent Company of Angry Birds for €706 Million
- Chinese Man Uses Artificial Intelligence To Converse With Dead Grandmother
- One State in the US Has Banned TikTok from App Stores – Will More Follow?
What's the Best Crypto to Buy Now?
- B2C Listed the Top Rated Cryptocurrencies for 2023
- Get Early Access to Presales & Private Sales
- KYC Verified & Audited, Public Teams
- Most Voted for Tokens on CoinSniper
- Upcoming Listings on Exchanges, NFT Drops