A serious look at the future of AI medical advice
[Paper: A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?]
Summary by Oreolorun Olu-Ipinlaye
When ChatGPT came out a couple of years ago, some people said AI was going to take everyone's jobs. Fast forward two years and, well, that hasn’t happened, and it probably won't anytime soon — if ever. Don’t get me wrong, these models are super useful, but they still hallucinate, making them unreliable for performance-critical fields like medicine. We’ve seen a couple of iterations of OpenAI models (and other LLMs) since then, each bringing some improvements, but how close are we to a future where these models are actually used for performance-critical tasks? How soon might we be able to trust them to make accurate clinical diagnoses? Today’s paper aims to investigate this by looking at OpenAI’s latest preview release, o1.
As you might know, o1 is a generative model developed by OpenAI and released separately from the GPT series. What makes o1 stand out is its ability to chain-of-thought (CoT) reason. (If you’re curious about how CoT works, Adrian’s summary from a few weeks ago gives a good rundown.) As you’d expect from a new model, o1 outperforms GPT-4 across various general language tasks, but what’s really interesting is that, according to OpenAI, its ability to CoT reason makes it a vastly superior model for technical problems in subjects such as science and math. They claim it even beats human experts on PhD-level science questions! But how does it perform in a specialized field such as medicine?
The authors of today’s paper put o1’s clinical skills to the test using 35 popular medical datasets that are used in evaluating LLMs and two custom datasets based on questions curated from The Lancet, The New England Journal of Medicine, and Medbullets. These custom datasets contain questions that are more technical than the 35 medical ones, and the authors’ goal in using them was to push o1’s CoT reasoning capabilities to the limit. They divided the datasets into three categories to evaluate the model on the following aspects:
Understanding: Can the model pull out relevant info from a clinical query and give a clear summary?
Reasoning: Can it think logically and reach accurate conclusions based on the info provided?
Multilinguality: Can it handle tasks where the prompt and response languages are different?
The researchers compared o1 to four other LLMs — GPT-4, GPT-3.5, MEDITRON, and Llama 3 — and the results are quite interesting. When it comes to reasoning, o1 is superior to the other LLMs in all datasets except one. But when it comes to understanding, o1’s performance in many datasets wasn't as dominant, and in one particular dataset it was resoundingly outmatched by Llama 3 by close to 20 percentage points. In multilinguality, o1 was the clear winner with superior performances in all languages surveyed.
The authors also performed chain-of-thought prompting over direct prompting and discovered that this marginally improved o1’s performance — which isn’t surprising: if you can CoT reason then CoT prompting should feel natural. And what about hallucinating? Surely the fact that o1 can reason about a problem before providing a response means hallucination is a thing of the past, right? Well, not quite. In fact in many datasets o1 hallucinated more than GPT-4!
I find it quite interesting that the authors of this paper didn’t include any of Anthropic’s models (the various versions of Claude) in their evaluations even though Claude 3.5 Sonnet has performed brilliantly on a lot of LLM evaluations. It's also clear that there isn't a definitive framework for evaluating LLMs on clinical topics; in this paper, the authors compared four models to o1 in some evaluations and in others just two (the hallucination evaluation).
So are we closer to an AI doctor? For now I'll say not quite. The fact that o1 hallucinates is a sign that GenAI still isn’t mature enough to be used in performance-critical tasks. That being said, an LLM with CoT reasoning is definitely a step in the right direction, and with further refinement we could get to a point where they become reliable tools in the hands of doctors. Then maybe, just maybe, they’ll take over the field of medicine.