ChatGPT recently passed the U.S. Medical Licensing Exam, but using it for a real-world medical diagnosis could quickly turn deadly. ER, physicians are anonymizing patient cases for ChatGPT (after hours) to substantiate reliability and accuracy. Physicians while curious and optimistic are cautious and will not trust ChatGPT or any other AI, or LLM without vetting the applications.
Josh Tamayo-Sarver, MD, Ph.D.
As an advocate of leveraging artificial intelligence to improve the quality and efficiency of healthcare, I wanted to see how the current version of ChatGPT might serve as a tool in my own practice.
So after my regular clinical shifts in the emergency department the other week, I anonymized my History of Present Illness notes for 35 to 40 patients — basically, my detailed medical narrative of each person’s medical history, and the symptoms that brought them to the emergency department — and fed them into ChatGPT.
The specific prompt I used was, “What are the differential diagnoses for this patient presenting to the emergency department [insert patient HPI notes here]?”
The results were fascinating, but also fairly disturbing.
OpenAI’s chatbot did a decent job of bringing up common diagnoses I wouldn’t want to miss — as long as everything I told it was precise and highly detailed. Correctly diagnosing a patient as having a nursemaid’s elbow, for instance, required about 200 words; identifying another patient’s orbital wall blowout fracture took the entire 600 words of my HPI on them.
For roughly half of my patients, ChatGPT suggested six possible diagnoses, and the “right” diagnosis — or at least the diagnosis that I believed to be right after complete evaluation and testing — was among the six that ChatGPT suggested.
Not bad. Then again, a 50% success rate in the context of an emergency room is also not good.
ChatGPT’s worst performance happened with a 21-year-old female patient who came into the ER with right lower quadrant abdominal pain. I fed her HPI into ChatGPT, which instantly came back with a differential diagnosis of appendicitis or an ovarian cyst, among other possibilities.
But ChatGPT missed a somewhat important diagnosis with this woman.
She had an ectopic pregnancy, in which a malformed fetus develops in a woman’s fallopian tube, and not her uterus. Diagnosed too late, it can be fatal — resulting in death caused by internal bleeding. Fortunately for my patient, we were able to rush her into the operating room for immediate treatment.
Notably, when she saw me in the emergency room, this patient did not even know she was pregnant. This is not an atypical scenario, and often only emerges after some gentle inquiring:
“Any chance you’re pregnant?”
The point is that a great deal of information must be fed into the chatbot for any degree of accuracy. In a real clinical setting speech-to-text and/or extracting text from the EHR would have to be used. The human brain extracts, analyses, synthesizes, and reports a diagnosis in seconds. A.I. and LLM cannot approach this capability for some time.
The cerebral cortex is much like an LLM. It contains decades of data learned from past experiences, reading, lectures, patient encounters, and real-world experiences. A board-certified physician will assimilate a patient encounter as he listens to history, does a physical exam, and observes the result of laboratory evaluation and imaging tests.
This is why it takes over 12 years of education and training to produce a competent and hopefully ethical doctor.
In short, ChatGPT worked pretty well as a diagnostic tool when I fed it perfect information and the patient had a classic presentation.
This is likely why ChatGPT “passed” the case vignettes in the Medical Licensing Exam. Not because it’s “smart,” but because the classic cases in the exam have a deterministic answer that already exists in its database. ChatGPT rapidly presents answers in a natural language format (that’s the genuinely impressive part), but underneath that is a knowledge retrieval process similar to Google Search. And most actual patient cases are not classic.
In the meantime, we urgently need a much more realistic view from Silicon Valley and the public at large of what AI can do now — and its many, often dangerous, limitations. We must be very careful to avoid inflated expectations with programs like ChatGPT because, in the context of human health, they can literally be life-threatening
Physicians are not waiting for the FDA