Trust-Calibrated Multilingual RAG for Humanitarian Information Platforms: Empirical Evaluation on OMoS-QA for Migration Information Access
DOI:
https://doi.org/10.51903/ijgd.v4i1.3552Keywords:
Humanitarian Information Platforms, Multilingual Retrieval-Augmented Generation, Answerability Detection, Human-Centered AI, Explainable InterfacesAbstract
Humanitarian information platforms increasingly serve migrants, refugees, and crisis-affected users who need correct answers about housing, schooling, legal procedures, benefits, health, and emergency services. In this setting, a wrong answer is more harmful than a missing answer, so multilingual question-answering systems must not only retrieve and summarize relevant content but also calibrate when to answer, when to abstain, and how to communicate uncertainty to the user. This paper develops a trust-calibrated multilingual retrieval-augmented generation (RAG) design for humanitarian information platforms and evaluates it on the public OMoS-QA benchmark for migration information access. The study combines two empirical layers. First, we run a direct page-retrieval evaluation over the full public corpus and compare BM25, word-level TF-IDF, character-level TF-IDF, and a lexical-character hybrid retriever. Second, we reanalyze the officially scored benchmark outputs released with OMoS-QA for sentence-level answer extraction, question-level no-answer detection, multilingual transfer, and cross-language transfer. All numerical results are empirically measured; no illustrative placeholders are used. The hybrid retriever reaches 69.4% recall at rank 1, 82.6% at rank 3, and 86.1% at rank 5, outperforming the sparse baselines. On same-language answer extraction, DeBERTa achieves the strongest balanced F1 (62.5 German, 64.9 English), while Llama-3-70B and GPT-3.5-Turbo obtain the strongest no-answer detection results. Explicit answerability prompting raises Llama-3-70B recall on unanswerable questions to 83.6% in German and 78.2% in English. Multilingual experiments show moderate degradation for French and larger losses for Arabic and Ukrainian, while cross-language transfer remains surprisingly robust. Based on these findings, the paper formulates a design contribution for graphic and interaction design: a trust-calibrated evidence-card pattern that combines evidence highlighting, citation links, uncertainty cues, and escalation to human support. The result is a benchmark-grounded interface logic for safer public-interest LLM applications rather than a user-validated final interface.
References
Afroogh, S., Akbari, A., Malone, E., Kargar, M., & Alambeigi, H. (2024). Trust in AI: Progress, Challenges, and Future Directions. Humanities and Social Sciences Communications, 11(1), 1568. https://doi.org/10.1057/s41599-024-04044-8
Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=hsyw5go0v8
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K. A., Davis, J., Demszky, D., ... Liang, P. (2021). On the Opportunities and Risks of Foundation Models [Preprint]. arXiv. https://arxiv.org/abs/2108.07258
Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1870–1879. https://aclanthology.org/p17-1171
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://aclanthology.org/2020.acl-main.747
Desai, S., & Durrett, G. (2020). Calibration of Pre-Trained Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 295–302. https://aclanthology.org/2020.emnlp-main.21
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://aclanthology.org/n19-1423
Fazzinga, B., Palmieri, E., Vestoso, M., Bolognini, L., Galassi, A., Furfaro, F., & Torroni, P. (2024). A Chatbot for Asylum-Seeking Migrants in Europe [Preprint]. arXiv. https://arxiv.org/abs/2407.09197
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, 1321–1330. https://proceedings.mlr.press/v70/guo17a.html
Izacard, G., & Grave, E. (2021). Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 874–880. https://aclanthology.org/2021.eacl-main.74
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781. https://aclanthology.org/2020.emnlp-main.550
Kleinle, S., Prange, J., & Friedrich, A. (2024). OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context. In Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024), 231–248. https://aclanthology.org/2024.konvens-main.25
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., & Schwenk, H. (2020). MLQA: Evaluating Cross-Lingual Extractive Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7315–7330. https://aclanthology.org/2020.acl-main.653
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-abstract.html
Liao, Q. V., & Vaughan, J. W. (2024). AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. Harvard Data Science Review. https://doi.org/10.1162/99608f92.8036d03b
Longpre, S., Lu, Y., & Daiber, J. (2021). MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. Transactions of the Association for Computational Linguistics, 9, 1389–1406. https://doi.org/10.1162/tacl_a_00433
Matlin, S. A., Hanefeld, J., Corte-Real, A., Rupino da Cunha, P., de Gruchy, T., Noorali Manji, K., Netto, G., Nunes, T., Şanlıer, İ., Takian, A., Zaman, M. H., & Saso, L. (2024). Digital Solutions for Migrant and Refugee Health: A Framework for Analysis and Action. The Lancet Regional Health – Europe, 50, 101190. https://doi.org/10.1016/j.lanepe.2024.101190
Nogueira, R., & Cho, K. (2019). Passage Re-Ranking with BERT [Preprint]. arXiv. https://arxiv.org/abs/1901.04085
Nugroho, S. A. A., & Wibowo, A. (2025). Evaluating Digital Transformation within Integration Limitations using Desk-Based Analytical Case Study. Journal of Technology Informatics and Engineering, 4(2), 289-299. https://doi.org/10.51903/jtie.v4i2.365
Pizzi, M., Romanoff, M., & Engelhardt, T. (2021). AI for Humanitarian Action: Human Rights and Ethics. International Review of the Red Cross, 102(913), 145–180. https://doi.org/10.1017/s1816383121000011
Rajpurkar, P., Jia, R., & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2, 784–789. https://aclanthology.org/p18-2124
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383–2392. https://aclanthology.org/d16-1264
Raees, M., Meijerink, I., Lykourentzou, I., Khan, V.-J., & Papangelis, K. (2024). From Explainable to Interactive AI: A Literature Review on Current Trends in Human-AI Interaction. International Journal of Human-Computer Studies, 189, 103301. https://doi.org/10.1016/j.ijhcs.2024.103301
Romarez, R., Sembiring, R., & Hanifah, U. (2024). Aesthetic Misinformation in Local Digital Journalism: A Case Study on Editorial Bypass in Public Service News Production. International Journal of Graphic Design, 2(1), 01-19. https://doi.org/10.51903/rgb91w74
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models [Preprint]. arXiv. https://arxiv.org/abs/2104.08663
Zhao, J., Wang, Y., Mancenido, M. V., Chiou, E. K., & Maciejewski, R. (2023). Evaluating the Impact of Uncertainty Visualization on Model Reliance. IEEE Transactions on Visualization and Computer Graphics, 30(1), 1215–1225. https://doi.org/10.1109/tvcg.2023.3251950
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Yushan Chen, Haosen Xu

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.









5.png)
