Uncertainty-Aware Medical Image Explanation Cards: LLM-Generated Visual Explanations for AI-Assisted Radiology Interfaces

Ziliang Samuel  Zhong; Qiyou  Wu; Gaotian  Mi

doi:10.51903/ijgd.v3i2.3616

Authors

Ziliang Samuel Zhong New York University, NY, USA
Qiyou Wu Artificial Intelligence, Northeastern University, MA, USA
Gaotian Mi Biomedical Engineering, Johns Hopkins University, MD, USA

DOI:

https://doi.org/10.51903/ijgd.v3i2.3616

Keywords:

radiology interface design, explainable artificial intelligence, Grad-CAM, uncertainty visualization, LLM microcopy, PneumoniaMNIST, visual hierarchy, UI/UX, medical image communication

Abstract

This study investigates how visual hierarchy, calibrated probability, uncertainty cues, Grad-CAM heatmaps, and role-specific language generation can be integrated into compact explanation cards for AI-assisted radiology interfaces. The empirical task was a reproducible PneumoniaMNIST-compatible normal-versus-pneumonia chest X-ray classification problem that preserves the MedMNIST label schema, split sizes, and NPZ data structure. All reported performance values were computed by the included scripts on the packaged dataset; every result table contains measured values from saved experimental artifacts. Six model variants were evaluated with accuracy, AUC, F1, sensitivity, specificity, negative log-likelihood, Brier score, and expected calibration error. The selected Spatial-CNN with temperature scaling achieved AUC = 0.868, accuracy = 0.763, F1 = 0.778, specificity = 0.923, Brier score = 0.155, and ECE = 0.021 on the 624-image test split. A warning rule using confidence, entropy, and MC-dropout variance flagged 310 test cases and captured 113 of 148 model errors. Grad-CAM stability was audited on a 200-case stratified subset, and role-specific microcopy was generated for clinician-facing, patient-facing, and uncertainty-warning cards. Patient-facing text achieved a mean Flesch Reading Ease of 74.6 and FK grade of 5.4, while clinician text preserved concise technical language. The contribution is a visual communication system for AI diagnostic cards that connects empirical model behavior with user-centered explanation design rather than treating explainability as an isolated algorithmic overlay.

References

Amann, J., Blasimme, A., Vayena, E., Frey, D., & Madai, V. I. (2020). Explainability for Artificial Intelligence in Healthcare: A Multidisciplinary Perspective. BMC Medical Informatics and Decision Making, 20(1), 310. https://doi.org/10.1186/s12901-020-01066-7

Amin, K. S., Davis, M. A., Doshi, R., Haims, A. H., Khosla, P., & Forman, H. P. (2023). Artificial Intelligence to Improve Patient Understanding of Radiology Reports. Yale Journal of Biology and Medicine, 96(3), 407-414. https://doi.org/10.59249/nkoy5498

Borys, K., Schmitt, Y. A., Nauta, M., Seifert, C., Kramer, N., Friedrich, C. M., & Nensa, F. (2023). Explainable AI in Medical Imaging: An Overview for Clinical Practitioners - Saliency-Based XAI Approaches. European Journal of Radiology, 162, 110787. https://doi.org/10.1016/j.ejrad.2023.110787

Bozer, A., & Pekcevik, Y. (2025). Comparative Evaluation of Large Language Models in Explaining Radiology Reports: Expert Assessment of Readability, Understandability, and Communication Features. Insights into Imaging, 16(1), 232. https://doi.org/10.1186/s13244-025-02121-3

Cai, C. J., Winter, S., Steiner, D., Wilcox, L., & Terry, M. (2019). Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1-14. https://doi.org/10.1145/3290605.3300234

Chen, H., Gomez, C., Huang, C. M., & Unberath, M. (2022). Explainable Medical Imaging AI Needs Human-Centered Design: Guidelines and Evidence from a Systematic Review. npj Digital Medicine, 5(1), 156. https://doi.org/10.1038/s41746-022-00699-2

Chen, Y., & Chan, E. (2023). Multimodal UI Representation Learning: Ablation of Screenshot, Wireframe, and View-Hierarchy Proxies on an Uploaded 168-Screen Dataset. Journal of Advanced Computing Systems, 3(1), 1-15. https://doi.org/10.69987/jacs.2023.30101

Chen, Y., & Xu, H. (2026). Trust-Calibrated Multilingual RAG for Humanitarian Information Platforms: Empirical Evaluation on OMoS-QA for Migration Information Access. International Journal of Graphic Design, 4(1), 141–164. https://doi.org/10.51903/ijgd.v4i1.3552

Doshi, R., Amin, K., Khosla, P., Bajaj, S., Chheang, S., & Forman, H. P. (2024). Quantitative Evaluation of Large Language Models to Streamline Radiology Report Impressions: A Multimodal Retrospective Analysis. Radiology, 310(1), e231593. https://doi.org/10.1148/radiol.231593

Doshi-Velez, F., & Kim, B. (2017). Towards a Rigorous Science of Interpretable Machine Learning. arXiv, arXiv:1702.08608. https://doi.org/10.48550/arxiv.1702.08608

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations. https://openreview.net/forum?id=yicbfdntty

Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. Proceedings of the 33rd International Conference on Machine Learning, 48, 1050-1059. https://proceedings.mlr.press/v48/gal16.html

Ghassemi, M., Oakden-Rayner, L., & Beam, A. L. (2021). The False Hope of Current Approaches to Explainable Artificial Intelligence in Health Care. The Lancet Digital Health, 3(11), e745-e750. https://doi.org/10.1016/s2589-7500(21)00208-9

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning, 70, 1321-1330. https://proceedings.mlr.press/v70/guo17a.html

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 770-778. https://doi.org/10.1109/cvpr.2016.90

Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., & Adam, H. (2019). Searching for MobileNetV3. Proceedings of the IEEE/CVF International Conference on Computer Vision, 1314-1324. https://doi.org/10.1109/iccv.2019.00140

Kendall, A., & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? Advances in Neural Information Processing Systems, 30, 5574-5584. https://proceedings.neurips.cc/paper/2017/hash/2650d60c49a052f97130b441d6d3c8cb-abstract.html

Kermany, D. S., Goldbaum, M., Cai, W., Valentim, C. C. S., Liang, H., Baxter, S. L., McKeown, A., Yang, G., Wu, X., Yan, F., Dong, J., Prasadha, M. K., Pei, J., Ting, M. Y. L., Zhu, J., Li, C., Hewett, S., Dong, J., Ziyar, I., ... Zhang, K. (2018). Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell, 172(5), 1122-1131.e9. https://doi.org/10.1016/j.cell.2018.02.010

Kermany, D. S., Zhang, K., & Goldbaum, M. (2018). Large Dataset of Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images (Version 3) [Data set]. Mendeley Data. https://doi.org/10.17632/rscbjbr9sj.3

Kuhn, J., Chen, Y., & Chan, E. (2024). AI-Driven Mobile UI Pattern Recognition and Design Topic Mining on RICO: Semantic Clustering and Screenshot-Based Topic Classification. Journal of Advanced Computing Systems, 4(5), 67-83. https://doi.org/10.69987/jacs.2024.40506

Li, H., Moon, J. T., Iyer, D., Balthazar, P., & Liu, R. (2023). Decoding Radiology Reports: Potential Application of OpenAI ChatGPT to Enhance Patient Understanding of Diagnostic Reports. Clinical Imaging, 101, 137-141. https://doi.org/10.1016/j.clinimag.2023.06.008

Lundberg, S. M., & Lee, S. I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems, 30, 4765-4774. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-abstract.html

Melyani, M., Prasetyo, T. F., Rahadjeng, I. R., Mufid, Z., Rafik, A., Shaura, R. K., Daniel, D., & Emita, I. (2024). Design Framework of Expert System Program in Otolaryngology Disease Diagnosis Use Extreme Programming (XP) Method (Case Study in THB Bekasi Hospital). Journal of Technology Informatics and Engineering, 3(3), 397–416. https://doi.org/10.51903/jtie.v3i3.209

Miller, T. (2019). Explanation in Artificial Intelligence: Insights from the Social Sciences. Artificial Intelligence, 267, 1-38. https://doi.org/10.1016/j.artint.2018.07.007

Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., & Snoek, J. (2019). Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. Advances in Neural Information Processing Systems, 32, 13991-14002. https://proceedings.neurips.cc/paper/2019/hash/1728efbda81692282ba1e4129fe0f4de-abstract.html

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144. https://doi.org/10.1145/2939672.2939778

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Proceedings of the IEEE International Conference on Computer Vision, 618-626. https://doi.org/10.1109/iccv.2017.74

Sholekhah, D. Z., & Noviar, D. (2025). Integrative Deep Learning Architecture for High-Accuracy Medical Image Segmentation: Combining U-Net, ResNet, and Transformers. Journal of Technology Informatics and Engineering, 4(1), 115–134. https://doi.org/10.51903/jtie.v4i1.288

Tonekaboni, S., Joshi, S., McCradden, M. D., & Goldenberg, A. (2019). What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. Proceedings of the 4th Machine Learning for Healthcare Conference, 106, 359-380. https://proceedings.mlr.press/v106/tonekaboni19a.html

Van Der Velden, B. H. M., Kuijf, H. J., Gilhuijs, K. G. A., & Viergever, M. A. (2022). Explainable Artificial Intelligence (XAI) in Deep Learning-Based Medical Image Analysis. Medical Image Analysis, 79, 102470. https://doi.org/10.1016/j.media.2022.102470

Xie, Y., Chen, M., Kao, D., Gao, G., & Chen, X. (2020). CheXplain: Enabling Physicians to Explore and Understand Data-Driven, AI-Enabled Medical Imaging Analysis. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1-13. https://doi.org/10.1145/3313831.3376807

Yang, J., Shi, R., & Ni, B. (2021). MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis. IEEE 18th International Symposium on Biomedical Imaging, 191-195. https://doi.org/10.1109/isbi48211.2021.9434062

Yang, J., Shi, R., Wei, D., Liu, Z., Zhao, L., Ke, B., Pfister, H., & Ni, B. (2023). MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification. Scientific Data, 10(1), 41. https://doi.org/10.1038/s41597-022-01721-8

Uncertainty-Aware Medical Image Explanation Cards: LLM-Generated Visual Explanations for AI-Assisted Radiology Interfaces

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Most read articles by the same author(s)

menunew