
Advancements and Challenges of Multimodal Models in Medical Applications
- 1 College of Engineering, Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, Guangdong, China
* Author to whom correspondence should be addressed.
Abstract
Multimodal models have demonstrated significant potential in the medical field integrating information from various modalities such as images and text to improve understanding and reasoning. This paper provides a comprehensive review of their applications, focusing on medical visual question answering (VQA), medical report generation, and surgical assistance systems. In VQA, multimodal models like MedFuseNet and XrayGPT enhance patient-doctor communication and assist in disease diagnosis. For medical report generation, models such as Medical-VLBERT and RadFM automate report writing, alleviating the workload of healthcare professionals while improving accuracy. In surgical assistance, models like Surgical-LVLM and PitVQA-Net support surgical localization, pathological analysis, and procedural annotations. Despite these advancements, challenges persist, including data scarcity, limited model interpretability, and difficulty adapting to dynamic medical scenarios. The lack of diverse and annotated datasets, particularly for rare diseases, hinders the models’ generalization capabilities. Furthermore, ensuring patient privacy and compliance with regulatory frameworks is critical for broader adoption. This review synthesizes recent developments, highlights challenges, and provides insights into the future of multimodal AI in healthcare. By advancing intelligent healthcare systems, multimodal models have the potential to transform clinical practices, improve diagnostic accuracy, and enhance patient outcomes.
Keywords
Multimodal models, Vision-language models, Visual question answering, Surgical assistance
[1]. Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., & Gao, J. (2022). Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4), 163-352.
[2]. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
[3]. Chen, J., Guo, H., Yi, K., Li, B., & Elhoseiny, M. (2022). Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18030-18040).
[4]. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K. W., ... & Keutzer, K. (2021). How much can clip benefit vision-and-language tasks?. arXiv preprint arXiv:2107.06383.
[5]. Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. (2022). Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15638-15650).
[6]. Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888-12900). PMLR.
[7]. Kim, W., Son, B., & Kim, I. (2021, July). Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning (pp. 5583-5594). PMLR.
[8]. Guo, Z., Li, X., Huang, H., Guo, N., & Li, Q. (2019). Deep learning-based image segmentation on multimodal medical imaging. IEEE Transactions on Radiation and Plasma Medical Sciences, 3(2), 162-169.
[9]. Lin, Z., Zhang, D., Tao, Q., Shi, D., et al. (2023). Medical visual question answering: A survey. Artificial Intelligence in Medicine, 143, 102611.
[10]. Sharma, D., Purushotham, S., & Reddy, C. K. (2021). MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain. Scientific Reports, 11(1), 19826.
[11]. Ren, F., & Zhou, Y. (2020). Cgmvqa: A new classification and generative model for medical visual question answering. IEEE Access, 8, 50626-50636.
[12]. Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., ... & Gao, J. (2024). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36.
[13]. Thawakar, O. C., Shaker, A. M., Mullappilly, S. S., et al. (2024, August). XrayGPT: Chest radiographs summarization using large medical vision-language models. In Proceedings of the 23rd workshop on biomedical natural language processing (pp. 440-448).
[14]. Xiao, H., Zhou, F., Liu, X., Liu, T., Li, Z., Liu, X., & Huang, X. (2024). A comprehensive survey of large language models and multimodal large language models in medicine. arXiv preprint arXiv:2405.08603.
[15]. Liu, G., Liao, Y., Wang, F., Zhang, B., Zhang, L., Liang, X., ... & Cui, S. (2021). Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning. IEEE Transactions on Neural Networks and Learning Systems, 32(9), 3786-3797.
[16]. Biswal, S., Zhuang, P., Pyrros, A., Siddiqui, N., Koyejo, S., & Sun, J. (2022, December). EMIXER: End-to-end Multimodal X-ray Generation via Self-supervision. In Machine Learning for Healthcare Conference (pp. 297-324). PMLR.
[17]. Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. (2023). Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463.
[18]. Hartsock, I., & Rasool, G. (2024). Vision-language models for medical report generation and visual question answering: A review. Frontiers in Artificial Intelligence, 7, 1430984.
[19]. Peters, B. S., Armijo, P. R., Krause, C., Choudhury, S. A., & Oleynikov, D. (2018). Review of emerging surgical robotic technology. Surgical endoscopy, 32, 1636-1655.
[20]. Bao, X., Guo, S., Xiao, N., Li, Y., Yang, C., & Jiang, Y. (2018). A cooperation of catheters and guidewires-based novel remote-controlled vascular interventional robot. Biomedical microdevices, 20, 1-19.
[21]. Wang, G., Bai, L., Nah, W. J., Wang, J., Zhang, Z., Chen, Z., ... & Ren, H. (2024). Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery. arXiv preprint arXiv:2405.10948.
[22]. He, R., Xu, M., Das, A., Khan, D. Z., et al. (2024, October). PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 488-498). Cham: Springer Nature Switzerland.
[23]. Seenivasan, L., Islam, M., Krishna, A. K., & Ren, H. (2022, September). Surgical-vqa: Visual question answering in surgical scenes using transformer. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 33-43). Cham: Springer Nature Switzerland.
[24]. Seenivasan, L., Islam, M., Kannan, G., & Ren, H. (2023, October). SurgicalGPT: end-to-end language-vision GPT for visual question answering in surgery. In International conference on medical image computing and computer-assisted intervention (pp. 281-290). Cham: Springer Nature Switzerland.
[25]. Gupta, D., Attal, K., & Demner-Fushman, D. (2023). A dataset for medical instructional video classification and question answering. Scientific Data, 10(1), 158.
[26]. Bai, L., Islam, M., & Ren, H. (2023, October). CAT-ViL: co-attention gated vision-language embedding for visual question localized-answering in robotic surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 397-407). Cham: Springer Nature Switzerland.
Cite this article
He,Z. (2025). Advancements and Challenges of Multimodal Models in Medical Applications. Applied and Computational Engineering,135,167-174.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 3rd International Conference on Mechatronics and Smart Systems
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).