
A 39-year-old woman presented to the emergency department at Beth Israel Deaconess Medical Center in Boston. Her left knee has been hurting for days. The day before, she had a fever of 102 degrees. It’s over now, but she still has the chills. And her knee was red and swollen.
What is the diagnosis?
On a recent sweltering Friday, resident Dr. Megan Langdon told this true story to a room full of medical students and residents. They come together to learn a tricky skill – how to think like a doctor.
“Doctors are not good at teaching other doctors how we think,” said Dr. Adam Rodman, an internist, medical historian and organizer of the Beth Israel Deaconess event.
But this time, they can turn to an expert for help with a diagnosis — GPT-4, the latest version of a chatbot released by the company OpenAI.
Artificial intelligence is changing many aspects of medical practice, and some medical professionals are using these tools to help them make diagnoses. Physicians at Beth Israel Deaconess, a teaching hospital affiliated with Harvard Medical School, decided to explore how chatbots can be used and abused to train future physicians.
Mentors like Dr. Rodman hope medical students will use GPT-4 and other chatbots to conduct what doctors call curbside consultations — when they pull colleagues aside and ask for opinions on difficult cases. The idea is to use chatbots much like doctors seek advice and insights from each other.
For more than a century, doctors have been portrayed as detectives who collect clues and use them to find the culprit. But experienced doctors actually use a different approach — pattern recognition — to figure out what’s wrong. In medicine, this is known as a disease script: Doctors put together signs, symptoms and test results to tell a coherent story based on similar cases they know or have seen firsthand.
If disease scripts don’t help, doctors turn to other strategies, such as assigning probabilities to various diagnoses that might be appropriate, Dr. Rodman said.
For more than half a century, researchers have tried, without real success, to design computer programs to make medical diagnoses.
Doctors say GPT-4 is different. “It will create something very similar to a disease script,” Dr. Rodman said. In doing so, he added, “it’s fundamentally different from a search engine.”
Dr. Rodman and other doctors at Beth Israel Deaconess have asked GPT-4 for possible diagnoses in difficult cases.in a study Publishing last month in the medical journal JAMA, they found that it did better than most doctors on a weekly diagnostic challenge published in the New England Journal of Medicine.
However, they learned that using the program is an art and there are pitfalls.
Medical students and residents are “definitely using it,” said Dr. Christopher Smith, director of the medical center’s internal medicine residency program. But, he added, “it’s an open question whether they learned anything.”
The worry is that they may rely on AI for diagnosis as much as they rely on a calculator on their phone for math. That’s dangerous, Dr. Smith said.
Learning, he says, involves trying to solve problems: “That’s how we keep things. Part of learning is the struggle. If you outsource learning to GPT, that struggle goes away.”
At the meeting, students and residents broke into small groups to try to figure out what was wrong with a patient with a swollen knee. Then they turned to GPT-4.
These groups tried different approaches.
People use GPT-4 for internet searches, similar to how they use Google. The chatbot came up with a list of possible diagnoses, including trauma. But when the panelists asked it to explain its reasoning, the robot was disappointed, explaining, “Trauma is a common cause of knee injuries.”
Another group thought of possible hypotheses and asked GPT-4 to verify them. The chatbot’s list matched that of the group: infections, including Lyme disease; arthritis, including gout, a type of arthritis that involves joint crystals; and trauma.
GPT-4 listed rheumatoid arthritis as the most likely disease, although it wasn’t high on the panel’s list. The coach later told everyone that the patient was unlikely to have gout because she was young and female. Rheumatoid arthritis may also be ruled out because only one joint is inflamed, and only for a few days.
As curbside advice, GPT-4 seems to pass the test, or at least agree with the students and residents. But in this exercise, it doesn’t provide any insight, nor does it provide a disease script.
One reason may be that students and residents use the bot more like a search engine than curbside advice.
To use the robot properly, the instructors said, they would first need to tell GPT-4 something like: “You are a doctor who is seeing a 39-year-old woman with knee pain.” They would then need to list her symptoms before asking for a diagnosis and follow up with questions about the robot’s reasoning, just as they would with a medical colleague.
This, the lecturers say, is one way to harness the power of GPT-4. But it’s also important to recognize that chatbots can make mistakes and “hallucinate” — providing answers that have no basis in fact. Using it requires knowing when it’s wrong.
“There’s nothing wrong with using these tools,” said Dr. Byron Crowe, an internist at the hospital. “You just have to use them in the right way.”
He draws an analogy to the group.
“Pilots use GPS,” Dr. Crowe said. But, he added, airlines “have very high standards for reliability.” In medicine, he said, the use of chatbots is “very tempting,” but the same high standards should apply.
“It’s a great thought partner, but it’s not a substitute for deep psychological expertise,” he said.
After the treatment, the instructor revealed the real cause of the patient’s knee swelling.
As it turns out, this is a possibility that every group has considered, and it was also proposed by GPT-4.
She has Lyme disease.
Olivia Allison contributed reporting.