LLM-as-Design-Critic: Aligning AI-Generated UI Feedback with Human Graphic Design Judgment

Authors

  • Yunhe Li Computer and Information Technology University of Pennsylvania, PA, USA
  • Shenghan Lu Information Technology, Fordham University, NY, USA
  • Lily Zhao UX Design, Boston University, MA, USA

DOI:

https://doi.org/10.51903/ijgd.v3i1.3661

Keywords:

Large Language Models, UI Critique, Graphic Design Judgment, Visual Communication, Mobile Interface Evaluation

Abstract

This paper evaluates whether AI-authored mobile user-interface critiques align with human graphic design judgment. The study uses the public UICrit CSV derived from RICO mobile screens, containing 2,981 annotator rows, 1,000 distinct UI screens, 11,344 source-indexed design critiques, normalized critique bounding boxes, and ratings for aesthetics, learnability, efficiency, usability, and overall design quality. We conducted a full reproducible empirical evaluation rather than reporting illustrative results. Seven models were compared on a group-disjoint split by RICO screen ID: a mean baseline, task-text TF-IDF Ridge, human-critique TF-IDF Ridge, LLM-critique TF-IDF Ridge, all-critique TF-IDF Ridge, topic-and-region Ridge, and a fused text-topic-region Ridge model. We also measured human-LLM critique alignment using TF-IDF cosine, character n-gram cosine, ROUGE-L F1, unigram F1, topic Jaccard, and best-match bounding-box IoU. The fused model achieved the strongest overall design-quality prediction on the held-out test set (MAE = 0.613, RMSE = 0.805, Spearman = 0.575), improving over the mean baseline MAE of 0.779. Human critiques alone were highly predictive (design-quality Spearman = 0.556), whereas LLM-inclusive critiques alone were much weaker (Spearman = 0.194). Human-LLM semantic alignment was low for exclusive human versus exclusive LLM comments (mean TF-IDF cosine = 0.046) and substantially higher when comments tagged as both were included (mean TF-IDF cosine = 0.390). Results show that design critiques encode measurable aesthetic and usability judgment, but LLM critiques still differ from human critique priorities unless shared comments and region evidence are incorporated

References

Bangor, A., Kortum, P. T., & Miller, J. T. (2008). An Empirical Evaluation of the System Usability Scale. International Journal of Human–Computer Interaction, 24(6), 574–594. https://doi.org/10.1080/10447310802205776

Bansal, G., Nushi, B., Kamar, E., Lasecki, W. S., Weld, D. S., & Horvitz, E. (2019). Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 7, 2–11. https://doi.org/10.1609/hcomp.v7i1.5280

Binghua Zhou, Siming Zhao, & David Chao. (2023). LLM-Guided Energy-Aware A/B Testing for Consolidation and DVFS Policies via Power-Sensitivity Clustering. Journal of Advanced Computing Systems, 3(4), 12–30. https://doi.org/10.69987/jacs.2023.30402

Brooke, J. (1996). SUS: A Quick and Dirty Usability Scale. In P. W. Jordan, B. Thomas, B. A. Weerdmeester, & I. L. McClelland (Eds.), Usability Evaluation in Industry, 189–194. https://userinterfaces.aalto.fi/sus/sus.pdf

Card, S. K., Moran, T. P., & Newell, A. (1983). The Psychology of Human-Computer Interaction. https://doi.org/10.1201/9780203736166

Daren Zheng, Chenyu Li, & Harvey Davidson. (2023). Continual Red-Teaming for In-the-Wild Jailbreaks via Online Guardrail Updates and Guardrail Distillation. Journal of Advanced Computing Systems, 3(2), 35–49. https://doi.org/10.69987/jacs.2023.30203

Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., & Kumar, R. (2017). RICO: A Mobile APP Dataset for Building Data-Driven Design Applications. Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, 845–854. https://doi.org/10.1145/3126594.3126651

Duan, P., Chen, C.-Y., Li, G., Hartmann, B., & Li, Y. (2024). UICrit: Enhancing Automated Design Evaluation with a UI Critique Dataset. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, 46, 1-17. https://doi.org/10.1145/3654777.3676381

Heer, J., & Bostock, M. (2010). Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 203–212. https://doi.org/10.1145/1753326.1753357

Jason Kuhn, Yushan Chen, & Evelyn Chan. (2024). AI-Driven Mobile UI Pattern Recognition and Design Topic Mining on RICO: Semantic Clustering and Screenshot-Based Topic Classification. Journal of Advanced Computing Systems, 4(5), 67–83. https://doi.org/10.69987/jacs.2024.40506

Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology (4th ed.). https://doi.org/10.4135/9781071878781

Lewis, J. R. (2018). Measuring Perceived Usability: The CSUQ, SUS, and UMUX. International Journal of Human–Computer Interaction, 34(12), 1148–1156. https://doi.org/10.1080/10447318.2017.1418805

Lewis, J. R., Utesch, B. S., & Maher, D. E. (2015). Measuring Perceived Usability: The SUS, UMUX-LITE, and AltUsability. International Journal of Human–Computer Interaction, 31(8), 496–505. https://doi.org/10.1080/10447318.2015.1064654

Li, G., Baechler, G., Tragut, M., & Li, Y. (2022). Learning to Denoise Raw Mobile UI Layouts for Improving Datasets at Scale. Proceedings of the CHI Conference on Human Factors in Computing Systems, 67, 1-13. https://doi.org/10.1145/3491102.3502042

Liu, T. F., Craft, M., Situ, J., Yumer, E., Mech, R., & Kumar, R. (2018). Learning Design Semantics for Mobile Apps. Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, 569–579. https://doi.org/10.1145/3242587.3242650

Lund, A. M. (2001). Measuring Usability with the USE Questionnaire. Usability Interface, 8(2), 3–6. https://search.worldcat.org/title/818903534

Moshagen, M., & Thielsch, M. T. (2010). Facets of Visual Aesthetics. International Journal of Human–Computer Studies, 68(10), 689–709. https://doi.org/10.1016/j.ijhcs.2010.05.006

Ngo, D. C. L., Teo, L. S., & Byrne, J. G. (2003). Modelling Interface Aesthetics. Information Sciences, 152, 25–37. https://doi.org/10.1016/s0020-0255(02)00404-8

Nielsen, J. (1994). Enhancing the Explanatory Power of Usability Heuristics. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 152–158. https://doi.org/10.1145/191666.191729

Nielsen, J., & Molich, R. (1990). Heuristic Evaluation of User Interfaces. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 249–256. https://doi.org/10.1145/97243.97281

Norman, D. A. (2013). The Design of Everyday Things (Rev. and expanded ed.). https://jnd.org/the-design-of-everyday-things-revised-and-expanded-edition/

O’Donovan, P., Agarwala, A., & Hertzmann, A. (2015). DesignScape: Design with Interactive Layout Suggestions. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 1221–1224. https://doi.org/10.1145/2702123.2702149

Oulasvirta, A., Kristensson, P. O., Bi, X., & Howes, A. (Eds.). (2018). Computational Interaction. https://doi.org/10.1093/oso/9780198799658.001.0001

Petrova, S., & Watanabe, K. (2025). User-Centered Mobile Navigation: Evaluating Local Usability for Improved UX. Journal of Technology Informatics and Engineering, 4(3), 478–492. https://doi.org/10.51903/jtie.v4i3.457

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models from Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, 8748–8763. https://proceedings.mlr.press/v139/radford21a.html

Reinecke, K., & Gajos, K. Z. (2014). Quantifying Visual Preferences Around the World. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 11–20. https://doi.org/10.1145/2556288.2557052

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?” Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144. https://doi.org/10.1145/2939672.2939778

Saranya, K. N., Bhandari, M., Borad, N., Reddy, P. V. P., & Kumar, S. (2025). Surveying the Impact of Rarely Investigated Design Components on User Engagement. International Journal of Graphic Design, 3(1), 39–52. https://doi.org/10.51903/ijgd.v3i1.2752

Sauro, J., & Lewis, J. R. (2016). Quantifying the User Experience: Practical Statistics for User Research (2nd ed.). https://doi.org/10.1016/c2010-0-65191-4

Shneiderman, B., Plaisant, C., Cohen, M., Jacobs, S., Elmqvist, N., & Diakopoulos, N. (2016). Designing the User Interface: Strategies for Effective Human-Computer Interaction (6th ed.). https://www.pearson.com/en-us/subject-catalog/p/designing-the-user-interface-strategies-for-effective-human-computer-interaction/p200000003255

Swearngin, A., & Li, Y. (2019). Modeling Mobile Interface Tappability Using Crowdsourcing and Deep Learning. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 78, 1-11. https://doi.org/10.1145/3290605.3300305

Wang, B., Li, G., Zhou, X., Chen, Z., Grossman, T., & Li, Y. (2021). Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning. Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology, 498–510. https://doi.org/10.1145/3472749.3474765

Wu, J., Peng, Y.-H., Li, A. X. Y., Swearngin, A., Bigham, J. P., & Nichols, J. (2024). UIClip: A Data-Driven Model for Assessing User Interface Design. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, 45, 1-16. https://doi.org/10.1145/3654777.3676408

Wu, J., Zhang, X., Nichols, J., & Bigham, J. P. (2021). Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots. Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology, 470–483. https://doi.org/10.1145/3472749.3474763

Yunianto, I., & Wahyudi, W. (2024). Designing User Experience for a Mobile Application for Agricultural Product Marketing Using the Human-Centered Design Method. International Journal of Graphic Design, 2(2), 207–221. https://doi.org/10.51903/ijgd.v2i2.2123

Yushan Chen, & Evelyn Chan. (2023). Multimodal UI Representation Learning: Ablation of Screenshot, Wireframe, and View-Hierarchy Proxies on an Uploaded 168-Screen Dataset. Journal of Advanced Computing Systems, 3(1), 1–15. https://doi.org/10.69987/jacs.2023.30101

Zhang, X., Ross, A. S., & Fogarty, J. (2018). Robust Annotation of Mobile Application Interfaces in Methods for Accessibility Repair and Enhancement. Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, 609–621. https://doi.org/10.1145/3242587.3242616

Downloads

Published

2025-05-30

How to Cite

LLM-as-Design-Critic: Aligning AI-Generated UI Feedback with Human Graphic Design Judgment. (2025). International Journal of Graphic Design, 3(1), 196-215. https://doi.org/10.51903/ijgd.v3i1.3661