Image for article titled ChatGPT Passed a Major Medical Exam, but Just Barely

Image: Miriam Doerr Martin Frommherz (Shutterstock)

Anyone anxiously holding their breath for a competent robot doctor may need to wait a bit longer. A group of AnsibleHealth AI researchers recently put OpenAI’s ChatGPT to the test against a major medical licensing exam and the results are in. The AI chatbot technically passed, but by the skin of its teeth. When it comes to medical exams, even the most impressive new AI still performs at a D level. The researchers say that lackluster showing is nonetheless a landmark achievement for AI.

The researchers tested ChatGPT on the United States Medical Licensing Exam (USMLE), a standardized series of three exams required for U.S. doctors vying for a medical license. ChatGPT managed to score between 52.4% and 75% across all three levels of the exam. That might not sound great to all of the overachievers out there, but it’s about on par with the 60% passing threshold for the exam. Researchers involved in the study claim this marks the first time AI was able to perform at or near the passing threshold for the notoriously difficult exam. Crucially, ChatGPT was able to pass without any extra specialized inputs from human trainers.

“Reaching the passing score for this notoriously difficult expert exam, and doing so without any human reinforcement, marks a notable milestone in clinical AI maturation,” the authors wrote in the journal PLOS Digital Health.

Mediocre test scores aside, the researchers praised ChatGPT for its ability to craft authentic sounding, original answers. ChatGPT managed to create, “new, non-obvious, and clinically valid insights,” for 88.9% of its responses and appeared to show evidence of deductive reasoning, chain of thought, and long term dependency skills. Those findings appear somewhat unique to ChatGPT and its particular style of AI learning. Unlike previous generations of systems that use deep learning models, ChatGPT relies on a large language model trained to predict a sequence of words based on the context of the words that came before. That means, unlike other AIs, ChatGPT can actually generate sequences of words that weren’t previously seen by the algorithm and that could make some coherent sense.

The tricky USMLE exams test participants on basic science, clinical reasoning, medical management, and bioethics. They’re most often taken by medical students and physicians in training. These exams are also standardized and regulated, which makes them particularly well suited to test out ChatGPT’s capabilities, the researchers said. One thing the exams definitely aren’t is easy. Human students typically spend around 300-400 hours stressfully pouring over dense scientific literature and testing material in preparation just for the Step 1 exam, the first of the three.

G/O Media may get a commission

Galaxy Book 3 Series

Pre-order now

Galaxy Book 3 Series

Available February 24
Each new laptop model comes with a free storage upgrade. The 1TB version of each is priced the same as the 512GB version which basically means the 1TB version is $200 off.

Surprisingly, ChatGPT managed to outperform PubMedGPT, another large language model AI trained exclusively on biomedical literature. That may seem counterintuitive at first, but the researchers say ChatGPT’s more generalized training may actually give it a leg up because it’s potentially exposed to a broader range of clinical content like patient-facing disease primers or drug package inserts. The researchers optimistically believe ChatGPT’s passable grade could hint towards a future where AI systems can play an assisting role in medical education. That’s already happening on a small level, they write, citing a recent example of AnsibleHealth clinicians using the tool to rewrite dense, jargon filled reports.

“Our study suggests that large language models such as ChatGPT may potentially assist human learners in a medical education setting, as a prelude to future integration into clinical decision-making,” the researchers said.

In a rather meta twist, ChatGPT wasn’t just tasked with taking the medical exam. The system was also involved with drafting the eventual research paper documenting its performance. Researchers say they interacted with ChatGPT, “much like a colleague” and leaned on it to synthesize and simplify their draft and even provide counterpoints.

“All of the co-authors valued ChatGPT’s input,” Tiffany Kung, one of the researchers wrote.

ChatGPT: Mediocre at writing, abysmal at math

ChatGPT has added an impressive amount of passing grades to its educational trophy wall in recent months. Last month, ChatGPT managed to score between a B and B minus on a MBA-level exam given to business students at the prestigious Wharton School of the University of Pennsylvania. Right around the same time, the AI achieved a passing score on a law exam given to students at the Minnesota University Law School. In the law exam case, ChatGPT skirted by with a C+.

“Alone, ChatGPT would be pretty mediocre law student,” lead study author Jonathan Choi said in an interview with Reuters. “The bigger potential for the profession here is that a lawyer could use ChatGPT to produce a rough first draft and just make their practice that much more effective.”

ChatGPT might be able to eke out passable scores in exams focused on writing and reading comprehension, but mathematics is another beast entirely. Despite its impressive ability to bust out academic papers and semi-conceiving prose, researchers say the AI only performs at roughly a 6th grade level when it comes to math. ChatGPT fares even worse when it’s asked basic arithmetic problems in natural language format. That stumbling stems from its predictive large language model training. ChatGPT will, of course, confidently provide you an answer to your math problem, but it could be completely divorced from reality.

ChatGPT’s at time wacko answers are what senior Google engineers and other in the field have referred to, cautiously, as AI “hallucinations.” These AI hallucinations create answers that seem convincing but are partially or completely made up, which isn’t exactly a great sign for anyone looking to authoritative AI’s in high-stakes fields like medicine and law.

“It [ChatGPT] acts like an expert, and sometimes it can provide a convincing impersonation of one,” University of Texas professor Paul von Hippel said in a recent interview with The Wall Street Journal. “But often it is a kind of b.s. artist, mixing truth, error and fabrication in a way that can sound convincing unless you have some expertise yourself.”

Read More

The General

View all posts