Trust-Calibrated Multilingual RAG for Humanitarian Information Platforms: Empirical Evaluation on OMoS-QA for Migration Information Access

Authors

  • Yushan Chen Service Design, Savannah College of Art and Design, GA, USA
  • Haosen Xu Electrical Engineering and Computer Science, University of California, Berkeley, CA, USA

DOI:

https://doi.org/10.51903/ijgd.v4i1.3552

Keywords:

Humanitarian Information Platforms, Multilingual Retrieval-Augmented Generation, Answerability Detection, Human-Centered AI, Explainable Interfaces

Abstract

Humanitarian information platforms increasingly serve migrants, refugees, and crisis-affected users who need correct answers about housing, schooling, legal procedures, benefits, health, and emergency services. In this setting, a wrong answer is more harmful than a missing answer, so multilingual question-answering systems must not only retrieve and summarize relevant content but also calibrate when to answer, when to abstain, and how to communicate uncertainty to the user. This paper develops a trust-calibrated multilingual retrieval-augmented generation (RAG) design for humanitarian information platforms and evaluates it on the public OMoS-QA benchmark for migration information access. The study combines two empirical layers. First, we run a direct page-retrieval evaluation over the full public corpus and compare BM25, word-level TF-IDF, character-level TF-IDF, and a lexical-character hybrid retriever. Second, we reanalyze the officially scored benchmark outputs released with OMoS-QA for sentence-level answer extraction, question-level no-answer detection, multilingual transfer, and cross-language transfer. All numerical results are empirically measured; no illustrative placeholders are used. The hybrid retriever reaches 69.4% recall at rank 1, 82.6% at rank 3, and 86.1% at rank 5, outperforming the sparse baselines. On same-language answer extraction, DeBERTa achieves the strongest balanced F1 (62.5 German, 64.9 English), while Llama-3-70B and GPT-3.5-Turbo obtain the strongest no-answer detection results. Explicit answerability prompting raises Llama-3-70B recall on unanswerable questions to 83.6% in German and 78.2% in English. Multilingual experiments show moderate degradation for French and larger losses for Arabic and Ukrainian, while cross-language transfer remains surprisingly robust. Based on these findings, the paper formulates a design contribution for graphic and interaction design: a trust-calibrated evidence-card pattern that combines evidence highlighting, citation links, uncertainty cues, and escalation to human support. The result is a benchmark-grounded interface logic for safer public-interest LLM applications rather than a user-validated final interface.

References

Afroogh, S., Akbari, A., Malone, E., Kargar, M., & Alambeigi, H. (2024). Trust in AI: Progress, Challenges, and Future Directions. Humanities and Social Sciences Communications, 11(1), 1568. https://doi.org/10.1057/s41599-024-04044-8

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=hsyw5go0v8

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K. A., Davis, J., Demszky, D., ... Liang, P. (2021). On the Opportunities and Risks of Foundation Models [Preprint]. arXiv. https://arxiv.org/abs/2108.07258

Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1870–1879. https://aclanthology.org/p17-1171

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://aclanthology.org/2020.acl-main.747

Desai, S., & Durrett, G. (2020). Calibration of Pre-Trained Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 295–302. https://aclanthology.org/2020.emnlp-main.21

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://aclanthology.org/n19-1423

Fazzinga, B., Palmieri, E., Vestoso, M., Bolognini, L., Galassi, A., Furfaro, F., & Torroni, P. (2024). A Chatbot for Asylum-Seeking Migrants in Europe [Preprint]. arXiv. https://arxiv.org/abs/2407.09197

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, 1321–1330. https://proceedings.mlr.press/v70/guo17a.html

Izacard, G., & Grave, E. (2021). Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 874–880. https://aclanthology.org/2021.eacl-main.74

Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781. https://aclanthology.org/2020.emnlp-main.550

Kleinle, S., Prange, J., & Friedrich, A. (2024). OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context. In Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024), 231–248. https://aclanthology.org/2024.konvens-main.25

Lewis, P., Oğuz, B., Rinott, R., Riedel, S., & Schwenk, H. (2020). MLQA: Evaluating Cross-Lingual Extractive Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7315–7330. https://aclanthology.org/2020.acl-main.653

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-abstract.html

Liao, Q. V., & Vaughan, J. W. (2024). AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. Harvard Data Science Review. https://doi.org/10.1162/99608f92.8036d03b

Longpre, S., Lu, Y., & Daiber, J. (2021). MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. Transactions of the Association for Computational Linguistics, 9, 1389–1406. https://doi.org/10.1162/tacl_a_00433

Matlin, S. A., Hanefeld, J., Corte-Real, A., Rupino da Cunha, P., de Gruchy, T., Noorali Manji, K., Netto, G., Nunes, T., Şanlıer, İ., Takian, A., Zaman, M. H., & Saso, L. (2024). Digital Solutions for Migrant and Refugee Health: A Framework for Analysis and Action. The Lancet Regional Health – Europe, 50, 101190. https://doi.org/10.1016/j.lanepe.2024.101190

Nogueira, R., & Cho, K. (2019). Passage Re-Ranking with BERT [Preprint]. arXiv. https://arxiv.org/abs/1901.04085

Nugroho, S. A. A., & Wibowo, A. (2025). Evaluating Digital Transformation within Integration Limitations using Desk-Based Analytical Case Study. Journal of Technology Informatics and Engineering, 4(2), 289-299. https://doi.org/10.51903/jtie.v4i2.365

Pizzi, M., Romanoff, M., & Engelhardt, T. (2021). AI for Humanitarian Action: Human Rights and Ethics. International Review of the Red Cross, 102(913), 145–180. https://doi.org/10.1017/s1816383121000011

Rajpurkar, P., Jia, R., & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2, 784–789. https://aclanthology.org/p18-2124

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383–2392. https://aclanthology.org/d16-1264

Raees, M., Meijerink, I., Lykourentzou, I., Khan, V.-J., & Papangelis, K. (2024). From Explainable to Interactive AI: A Literature Review on Current Trends in Human-AI Interaction. International Journal of Human-Computer Studies, 189, 103301. https://doi.org/10.1016/j.ijhcs.2024.103301

Romarez, R., Sembiring, R., & Hanifah, U. (2024). Aesthetic Misinformation in Local Digital Journalism: A Case Study on Editorial Bypass in Public Service News Production. International Journal of Graphic Design, 2(1), 01-19. https://doi.org/10.51903/rgb91w74

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models [Preprint]. arXiv. https://arxiv.org/abs/2104.08663

Zhao, J., Wang, Y., Mancenido, M. V., Chiou, E. K., & Maciejewski, R. (2023). Evaluating the Impact of Uncertainty Visualization on Model Reliance. IEEE Transactions on Visualization and Computer Graphics, 30(1), 1215–1225. https://doi.org/10.1109/tvcg.2023.3251950

Downloads

Published

2026-04-15

How to Cite

Trust-Calibrated Multilingual RAG for Humanitarian Information Platforms: Empirical Evaluation on OMoS-QA for Migration Information Access. (2026). International Journal of Graphic Design, 4(1), 141-164. https://doi.org/10.51903/ijgd.v4i1.3552