Risk-Calibrated Patient-Facing AI Safety Cards: A UI/UX Benchmark for Explainable Medical AI Response Interfaces

Chenyu  Li; Binghua  Zhou; Krystal  Gao

doi:10.51903/ijgd.v3i2.3709

Authors

Chenyu Li Applied Analytics, Columbia University, NY, USA
Binghua Zhou Computer Science, USC, CA, USA
Krystal Gao Human-Computer Interaction, CMU, PA, USA

DOI:

https://doi.org/10.51903/ijgd.v3i2.3709

Keywords:

medical LLMs, patient safety, UI/UX design, risk communication, explainable AI, safety cards, information design

Abstract

Patient-facing medical AI systems communicate risk at moments when a non-expert user may act on what they read. This study evaluates risk-calibrated AI safety cards as a UI/UX framework for explainable medical response interfaces. The main experiment used 466 PatientSafetyBench prompts across five patient-safety categories: harmful medical advice, misdiagnosis and overconfidence, unlicensed practice of medicine, health misinformation, and bias or stigmatization. For each prompt, five response interfaces were generated: plain text, risk-label card, refusal-plus-explanation card, evidence-disclosure card, and next-step action card. The evaluation reports deterministic rubric-based communication scores rather than clinical safety outcomes. Across 2,330 PatientSafetyBench responses, the integrated Next-Step Action Card lowered the mean communication-risk score from 3.85 to 1.00 on a 1-5 scale, lowered the overconfidence-indicator score from 3.38 to 1.41, increased actionability from 16.46 to 98.09 on a 0-100 scale, increased risk-label clarity from 10.20 to 94.97, and increased evidence disclosure from 22.82 to 96.98. A second analysis used HealthBench physician-created rubric criteria to test whether the card structure aligned with communication, context, uncertainty, and escalation expectations in broader health conversations. The action-card condition increased communication-rubric coverage from 11.23% to 85.29% in HealthBench OSS, from 10.44% to 83.86% in HealthBench Consensus, and from 9.88% to 86.50% in HealthBench Hard. These results support the safety card as a reproducible information-design intervention for risk communication. They do not establish real-world patient behavior change or clinical safety; clinician review, patient testing, multilingual adaptation, and live-system evaluation remain necessary before deployment.

References

Amershi, S., Weld, D., Voris, M., Fourney, A., Nushi, B., Collisson, P., Suh, J., Iqbal, S., Bennett, P. N., Inkpen, K., Teevan, J., Kikin-Gil, R., & Horvitz, E. (2019). Guidelines for human-AI interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-13. https://doi.org/10.1145/3290605.3300233

Arora, R. K., Wei, J., Hicks, R. S., Bowman, P., Quinonero-Candela, J., Tsimbourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K. (2025). HealthBench: Evaluating large language models towards improved human health. arXiv:2505.08775.

Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587-604. https://doi.org/10.1162/tacl_a_00041

Carayon, P., Wetterneck, T. B., Rivera-Rodriguez, A. J., Hundt, A. S., Hoonakker, P., Holden, R., & Gurses, A. P. (2014). Human factors systems approach to healthcare quality and patient safety. Applied Ergonomics, 45(1), 14-25. https://doi.org/10.1016/j.apergo.2013.04.023

Char, D. S., Shah, N. H., & Magnus, D. (2018). Implementing machine learning in health care: Addressing ethical challenges. New England Journal of Medicine, 378(11), 981-983. https://doi.org/10.1056/NEJMp1714229

Covello, V. T., & Sandman, P. M. (2001). Risk communication: Evolution and revolution. In A. Wolbarst (Ed.), Solutions to an environment in peril (pp. 164-178). Johns Hopkins University Press.

Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv:1702.08608.

Ehsan, U., & Riedl, M. O. (2020). Human-centered explainable AI: Towards a reflective sociotechnical approach. HCI International 2020 - Late Breaking Papers, 449-466. https://doi.org/10.1007/978-3-030-60117-1_33

Gunning, D., & Aha, D. (2019). DARPA's explainable artificial intelligence program. AI Magazine, 40(2), 44-58. https://doi.org/10.1609/aimag.v40i2.2850

Houts, P. S., Doak, C. C., Doak, L. G., & Loscalzo, M. J. (2006). The role of pictures in improving health communication: A review of research on attention, comprehension, recall, and adherence. Patient Education and Counseling, 61(2), 173-190. https://doi.org/10.1016/j.pec.2005.05.004

Institute of Medicine. (2004). Health literacy: A prescription to end confusion. National Academies Press. https://doi.org/10.17226/10883

Jason Kuhn, Yushan Chen, & Evelyn Chan. (2024). AI-Driven Mobile UI Pattern Recognition and Design Topic Mining on RICO: Semantic Clustering and Screenshot-Based Topic Classification. Journal of Advanced Computing Systems , 4(5), 67-83. https://doi.org/10.69987/JACS.2024.40506

Kessels, R. P. C. (2003). Patients' memory for medical information. Journal of the Royal Society of Medicine, 96(5), 219-222. https://doi.org/10.1177/014107680309600504

Liao, Q. V., Gruen, D., & Miller, S. (2020). Questioning the AI: Informing design practices for explainable AI user experiences. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1-15. https://doi.org/10.1145/3313831.3376590

Lipton, Z. C. (2018). The mythos of model interpretability. Queue, 16(3), 31-57. https://doi.org/10.1145/3236386.3241340

Microsoft. (2025). PatientSafetyBench [Dataset]. Hugging Face.

Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1-38. https://doi.org/10.1016/j.artint.2018.07.007

Norman, D. A. (2013). The design of everyday things (Rev. ed.). Basic Books.

Paling, J. (2003). Strategies to help patients understand risks. BMJ, 327(7417), 745-748. https://doi.org/10.1136/bmj.327.7417.745

Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, abuse. Human Factors, 39(2), 230-253. https://doi.org/10.1518/001872097778543886

Preece, J., Rogers, Y., & Sharp, H. (2015). Interaction design: Beyond human-computer interaction (4th ed.). Wiley.

Shortliffe, E. H., & Sepulveda, M. J. (2018). Clinical decision support in the era of artificial intelligence. JAMA, 320(21), 2199-2200. https://doi.org/10.1001/jama.2018.17163

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Babiker, A., Scharli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620, 172-180. https://doi.org/10.1038/s41586-023-06291-2

Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29, 1930-1940. https://doi.org/10.1038/s41591-023-02448-8

Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124-1131. https://doi.org/10.1126/science.185.4157.1124

World Health Organization. (2021). Ethics and governance of artificial intelligence for health. World Health Organization.

World Wide Web Consortium. (2018). Web content accessibility guidelines (WCAG) 2.1. W3C Recommendation.

Yushan Chen, & Evelyn Chan. (2023). Multimodal UI Representation Learning: Ablation of Screenshot, Wireframe, and View-Hierarchy Proxies on an Uploaded 168-Screen Dataset. Journal of Advanced Computing Systems , 3(1), 1-15. https://doi.org/10.69987/JACS.2023.30101

Zhang, Y., Liao, Q. V., & Bellamy, R. K. E. (2020). Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, 295-305. https://doi.org/10.1145/3351095.3372852

Risk-Calibrated Patient-Facing AI Safety Cards: A UI/UX Benchmark for Explainable Medical AI Response Interfaces

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Most read articles by the same author(s)

menunew