Risk-Calibrated Patient-Facing AI Safety Cards: A UI/UX Benchmark for Explainable Medical AI Response Interfaces
DOI:
https://doi.org/10.51903/ijgd.v3i2.3709Keywords:
medical LLMs, patient safety, UI/UX design, risk communication, explainable AI, safety cards, information designAbstract
Patient-facing medical AI systems communicate risk at moments when a non-expert user may act on what they read. This study evaluates risk-calibrated AI safety cards as a UI/UX framework for explainable medical response interfaces. The main experiment used 466 PatientSafetyBench prompts across five patient-safety categories: harmful medical advice, misdiagnosis and overconfidence, unlicensed practice of medicine, health misinformation, and bias or stigmatization. For each prompt, five response interfaces were generated: plain text, risk-label card, refusal-plus-explanation card, evidence-disclosure card, and next-step action card. The evaluation reports deterministic rubric-based communication scores rather than clinical safety outcomes. Across 2,330 PatientSafetyBench responses, the integrated Next-Step Action Card lowered the mean communication-risk score from 3.85 to 1.00 on a 1-5 scale, lowered the overconfidence-indicator score from 3.38 to 1.41, increased actionability from 16.46 to 98.09 on a 0-100 scale, increased risk-label clarity from 10.20 to 94.97, and increased evidence disclosure from 22.82 to 96.98. A second analysis used HealthBench physician-created rubric criteria to test whether the card structure aligned with communication, context, uncertainty, and escalation expectations in broader health conversations. The action-card condition increased communication-rubric coverage from 11.23% to 85.29% in HealthBench OSS, from 10.44% to 83.86% in HealthBench Consensus, and from 9.88% to 86.50% in HealthBench Hard. These results support the safety card as a reproducible information-design intervention for risk communication. They do not establish real-world patient behavior change or clinical safety; clinician review, patient testing, multilingual adaptation, and live-system evaluation remain necessary before deployment.
References
Amershi, S., Weld, D., Voris, M., Fourney, A., Nushi, B., Collisson, P., Suh, J., Iqbal, S., Bennett, P. N., Inkpen, K., Teevan, J., Kikin-Gil, R., & Horvitz, E. (2019). Guidelines for human-AI interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-13. https://doi.org/10.1145/3290605.3300233
Arora, R. K., Wei, J., Hicks, R. S., Bowman, P., Quinonero-Candela, J., Tsimbourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K. (2025). HealthBench: Evaluating large language models towards improved human health. arXiv:2505.08775.
Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587-604. https://doi.org/10.1162/tacl_a_00041
Carayon, P., Wetterneck, T. B., Rivera-Rodriguez, A. J., Hundt, A. S., Hoonakker, P., Holden, R., & Gurses, A. P. (2014). Human factors systems approach to healthcare quality and patient safety. Applied Ergonomics, 45(1), 14-25. https://doi.org/10.1016/j.apergo.2013.04.023
Char, D. S., Shah, N. H., & Magnus, D. (2018). Implementing machine learning in health care: Addressing ethical challenges. New England Journal of Medicine, 378(11), 981-983. https://doi.org/10.1056/NEJMp1714229
Covello, V. T., & Sandman, P. M. (2001). Risk communication: Evolution and revolution. In A. Wolbarst (Ed.), Solutions to an environment in peril (pp. 164-178). Johns Hopkins University Press.
Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv:1702.08608.
Ehsan, U., & Riedl, M. O. (2020). Human-centered explainable AI: Towards a reflective sociotechnical approach. HCI International 2020 - Late Breaking Papers, 449-466. https://doi.org/10.1007/978-3-030-60117-1_33
Gunning, D., & Aha, D. (2019). DARPA's explainable artificial intelligence program. AI Magazine, 40(2), 44-58. https://doi.org/10.1609/aimag.v40i2.2850
Houts, P. S., Doak, C. C., Doak, L. G., & Loscalzo, M. J. (2006). The role of pictures in improving health communication: A review of research on attention, comprehension, recall, and adherence. Patient Education and Counseling, 61(2), 173-190. https://doi.org/10.1016/j.pec.2005.05.004
Institute of Medicine. (2004). Health literacy: A prescription to end confusion. National Academies Press. https://doi.org/10.17226/10883
Jason Kuhn, Yushan Chen, & Evelyn Chan. (2024). AI-Driven Mobile UI Pattern Recognition and Design Topic Mining on RICO: Semantic Clustering and Screenshot-Based Topic Classification. Journal of Advanced Computing Systems , 4(5), 67-83. https://doi.org/10.69987/JACS.2024.40506
Kessels, R. P. C. (2003). Patients' memory for medical information. Journal of the Royal Society of Medicine, 96(5), 219-222. https://doi.org/10.1177/014107680309600504
Liao, Q. V., Gruen, D., & Miller, S. (2020). Questioning the AI: Informing design practices for explainable AI user experiences. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1-15. https://doi.org/10.1145/3313831.3376590
Lipton, Z. C. (2018). The mythos of model interpretability. Queue, 16(3), 31-57. https://doi.org/10.1145/3236386.3241340
Microsoft. (2025). PatientSafetyBench [Dataset]. Hugging Face.
Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1-38. https://doi.org/10.1016/j.artint.2018.07.007
Norman, D. A. (2013). The design of everyday things (Rev. ed.). Basic Books.
Paling, J. (2003). Strategies to help patients understand risks. BMJ, 327(7417), 745-748. https://doi.org/10.1136/bmj.327.7417.745
Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, abuse. Human Factors, 39(2), 230-253. https://doi.org/10.1518/001872097778543886
Preece, J., Rogers, Y., & Sharp, H. (2015). Interaction design: Beyond human-computer interaction (4th ed.). Wiley.
Shortliffe, E. H., & Sepulveda, M. J. (2018). Clinical decision support in the era of artificial intelligence. JAMA, 320(21), 2199-2200. https://doi.org/10.1001/jama.2018.17163
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Babiker, A., Scharli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620, 172-180. https://doi.org/10.1038/s41586-023-06291-2
Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29, 1930-1940. https://doi.org/10.1038/s41591-023-02448-8
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124-1131. https://doi.org/10.1126/science.185.4157.1124
World Health Organization. (2021). Ethics and governance of artificial intelligence for health. World Health Organization.
World Wide Web Consortium. (2018). Web content accessibility guidelines (WCAG) 2.1. W3C Recommendation.
Yushan Chen, & Evelyn Chan. (2023). Multimodal UI Representation Learning: Ablation of Screenshot, Wireframe, and View-Hierarchy Proxies on an Uploaded 168-Screen Dataset. Journal of Advanced Computing Systems , 3(1), 1-15. https://doi.org/10.69987/JACS.2023.30101
Zhang, Y., Liao, Q. V., & Bellamy, R. K. E. (2020). Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, 295-305. https://doi.org/10.1145/3351095.3372852
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Chenyu Li, Binghua Zhou, Krystal Gao

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.









5.png)
