‘Humble’ AI detects when diagnosis is uncertain – Integrative Practitioner


Written by Erin Yeh

Artificial intelligence (AI) models have assisted doctors in many clinical tasks, and hold great promise in helping with patient diagnosis and personalized treatment options. However, a team of researchers led by MIT warns press release AI systems in their current form may push doctors down the wrong path due to overconfidence.

LLMs tend to show inappropriate overconfidence in clinical reasoning tasks, display inflexibility in their thinking and a tendency to hallucinate when faced with situations that differ from their training patterns (BMJ Health and Care InformaticsDigital ID: 10.1136/BMJHSI-2025-101877). They also display ingratiating behavior, such as offering praise or flattery.

According to the researchers, “modest” artificial intelligence is needed. They designed a framework called Balanced, Open, Diagnostic, Humble, and Curious (BODHI) that is more transparent about uncertainty and encourages users to gather additional information when they are not confident about their diagnosis.

Six integrated steps and scaffolding for the idea chain

The BODHI framework works through six complementary steps. First, the clinical complexity assessment evaluates the query on diagnostic ambiguity, urgency, and completeness of data. Second, prior confidence evaluation estimates the cognitive state of the model based on training and query accuracy. Third, the Curiosity module identifies information gaps and provides clarifying questions, and the Humility module assesses the limits of trust and triggers for postponement. The team previously presented curiosity and humility as key cognitive virtues for AI in healthcare, the study said. Curiosity is meant to reduce uncertainty through targeted inquiry, and humility acknowledges limitations and defers to human experience.

Fourth, the virtue activation matrix links the collected outputs to one of four cognitive positions (follow and monitor, monitor and substitute, clarify and review, and escalate and reformulate). Fifth, adaptive system responses are generated according to the specific situation. Finally, the framework uses clinical feedback to refine thresholds and improve performance over time.

BODHI also uses a dual-path chain of thought protocol that separates internal logic from external communication. Pass 1 analyzes the request across seven domains: task type classification (emergency, technical, hybrid, or conversation), identification of audience (patient, health professional, or unclear), initial hypothesis with reasoning, key uncertainties affecting trust, clarifying questions (1-2 required for non-emergency situations), red flags that trigger escalation, and safe recommendations appropriate to the level of uncertainty.

Pass 2 then generates the final response facing the clinician using Pass 1 analysis and applying cognitive boundaries. The system then adjusts its behavior based on context: Conversational mode (default) applies full cognitive constraints to patient interactions, Emergency mode prioritizes safety over completeness, Technical mode reduces hedging (modesty) for administrative tasks, and Mixed mode balances clinical reasoning with technical precision. The overarching constraints dictate basic practices: using specific numbers and time frames when possible, turning conditional statements into direct questions to gather more information, and offering alternative probabilities when confidence is low.

“It’s like having a co-pilot telling you that you need to look for fresh eyes so you can better understand this complex patient,” said Leo Anthony Seeley, a senior research scientist at the Massachusetts Institute of Technology’s Institute of Medical Engineering and Sciences, a physician at Beth Israel Deaconess Medical Center, and an associate professor at Harvard Medical School. Press release.

Significant improvements in behavior

The team evaluated BODHI on HealthBench Hard, a benchmark of 200 challenging clinical scenarios that require diagnostic reasoning, treatment planning, and triage decisions. Two language models, GPT-4.1-mini and GPT-4o-mini, were evaluated.

The results showed significant improvements in both models. For GPT-4.1-mini, the score improved from 2.5% to 19.1%, with the context seeking (curiosity) rate rising from 7.8% to 97.3% and hedging behavior increasing from 1.7% to 21.9%. GPT-4o-mini has been improved from 0% to 2.2%, with context lookup increased from 0% to 73.5%. Overall, BODHI achieved significant improvements in clinical curiosity and quality. These gains were achieved through a series of ideas without improving the model or making architectural changes.

GPT-4.1-mini showed a greater overall improvement, suggesting that model capability affects the usefulness of applying the cognitive constraint. The GPT-4o-mini had similar rates of context seeking, but lower total scores, which may reflect differences in underlying reasoning or reliability of following instructions. However, both models achieved strong improvement on initial cognitive measures, suggesting that the double-pass protocol is effective across model variables.

What does humility mean in the clinic?

Traditional methods, such as uncertainty quantification, can estimate trust, but they do not influence behavior or communication. Sample consistency or token level probability can distinguish between correct and incorrect outputs but is often poorly calibrated and overconfident. Fine-tuning these approaches requires changing the model itself and may not generalize well across different clinical contexts. Conceptual frameworks of epistemic humility highlight the issue without offering practical solutions. In contrast, BODHI works at the motivational level, requires no changes in paradigm, and has demonstrated behavioral shifts with improvements in both curiosity and humility.

However, researchers advise that declines in communication quality scores should be interpreted carefully. In high-risk clinical settings, appropriately humble, question-based answers are safer than confident but potentially incorrect statements. Low communication quality scores may reflect specific limitations rather than a true reduction in clinical effectiveness. Future evaluation frameworks should reward appropriate expressions of uncertainty and penalize overconfidence to match the qualities that clinical AI should possess.

Some limitations of this study include reliance on a single standard, evaluation of two models from a single provider, and lack of physician-in-the-episode validation. The double-pass protocol also increased computational cost and access time, which could limit real-time applications. The effectiveness of the framework may also vary across clinical areas, patient populations, and institutional settings. While the Chain of Thought protocol improves transparency, it may not fully reflect the actual model calculation, which represents a limitation of subsequent rationalization approaches. The team recommends that future studies test BODHI in real clinical settings with a diverse patient population and evaluate its impact on outcomes, such as diagnostic accuracy and patient safety.

Significant improvements demonstrate that BODHI can reliably constrain LLMs to operate within cognitive limits. Through it, AI can be deployed more safely and can act as collaborative partners that know when to ask and defer questions instead of masking uncertainty with overconfidence. Currently, BODHI is available as An open source Python package.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *