A survey on pre-training and transfer learning for multimodal Vision-Language Models

Research Article
Open access

A survey on pre-training and transfer learning for multimodal Vision-Language Models

Zhongren Liang 1*
  • 1 Beijing University of Posts and Telecommunications    
  • *corresponding author 2936721262@qq.com
Published on 10 June 2025 | https://doi.org/10.54254/2977-3903/2025.23982
AEI Vol.16 Issue 6
ISSN (Print): 2977-3911
ISSN (Online): 2977-3903

Abstract

In recent years, Vision-Language Models (VLMs) have emerged as a significant breakthrough in multimodal learning, demonstrating remarkable progress in tasks such as image-text alignment, image generation, and semantic reasoning. This paper systematically reviews current VLM pretraining methodologies, including contrastive learning and generative paradigms, while providing an in-depth analysis of efficient transfer learning strategies such as prompt tuning, LoRA, and adapter modules. Through representative models like CLIP, BLIP, and GIT, we examine their practical applications in visual grounding, image-text retrieval, visual question answering, affective computing, and embodied AI. Furthermore, we identify persistent challenges in fine-grained semantic modeling, cross-modal reasoning, and cross-lingual transfer. Finally, we envision future trends in unified architectures, multimodal reinforcement learning, and domain adaptation, aiming to provide systematic reference and technical insights for subsequent research.

Keywords:

Vision-Language Models, multimodal learning, pre-training, transfer learning, contrastive learning

Liang,Z. (2025). A survey on pre-training and transfer learning for multimodal Vision-Language Models. Advances in Engineering Innovation,16(6),135-139.
Export citation

References

[1]. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, 139, 8748–8763. Retrieved from https://proceedings.mlr.press/v139/radford21a.html

[2]. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022). CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv preprint arXiv:2205.01917. Retrieved from https://arxiv.org/abs/2205.01917

[3]. Chen, D., Zhang, Y., Wang, Z., & Li, H. (2022). ProtoCLIP: Prototypical Contrastive Language Image Pretraining. arXiv preprint arXiv:2206.10996. Retrieved from https://arxiv.org/abs/2206.10996

[4]. Joshi, S., Jain, A., Payani, A., & Mirzasoleiman, B. (2024). Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 238, 1000–1008. Retrieved from https://proceedings.mlr.press/v238/joshi24a.html

[5]. Cui, Y., Zhao, L., Liang, F., Li, Y., & Shao, J. (2022). Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision. arXiv preprint arXiv:2203.05796. Retrieved from https://arxiv.org/abs/2203.05796

[6]. Pan, X., Ye, T., Han, D., Song, S., & Huang, G. (2022). Contrastive Language-Image Pre-Training with Knowledge Graphs. arXiv preprint arXiv:2210.08901. Retrieved from https://arxiv.org/abs/2210.08901

[7]. Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Proceedings of the 39th International Conference on Machine Learning, 162, 12888–12900. Retrieved from https://proceedings.mlr.press/v162/li22n.html

[8]. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the 40th International Conference on Machine Learning, 202, 19730–19742. Retrieved from https://proceedings.mlr.press/v202/li23q.html

[9]. Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., & Wang, L. (2022). GIT: A Generative Image-to-text Transformer for Vision and Language. arXiv preprint arXiv:2205.14100. Retrieved from https://arxiv.org/abs/2205.14100

[10]. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hassani, A., Jeong, J., Sezer, U., Alabdulmohsin, I., Smaira, L., Raposo, D., Tyszkiewicz, M., et al. (2022). Flamingo: A Visual Language Model for Few-Shot Learning. arXiv preprint arXiv:2204.14198. Retrieved from https://arxiv.org/abs/2204.14198

[11]. Liu, Y., Zhang, Y., Wang, Y., Hou, L., Cao, J., & Bao, J. (2023). BEIT-3: Scaling Multimodal Transformers Across Vision, Language, and Audio. arXiv preprint arXiv:2302.00915. Retrieved from https://arxiv.org/abs/2302.00915

[12]. Jia, C., Yang, Y., Xia, Y., Chen, K., Parekh, Z., Pham, H., ... & Zettlemoyer, L. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2102.05918

[13]. Kim, W., Son, B., & Kim, I. (2021). ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. International Conference on Machine Learning (ICML). https://arxiv.org/abs/2102.03334

[14]. Li, J., Zhu, Y., Zhang, Y., Yin, X., Lu, J., & Li, X. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Advances in Neural Information Processing Systems (NeurIPS 2023). https://arxiv.org/abs/2301.12597

[15]. Pfeiffer, J., Rücklé, A., Dürr, J., Frank, A., & Gurevych, I. (2021). AdapterFusion: Non-Destructive Task Composition for Transfer Learning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). https://arxiv.org/abs/2005.00247

[16]. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint. https://arxiv.org/abs/2103.00020

[17]. Sun, H., Wang, Y., &Xu, L. (2025). Parrot: Multilingual Visual Instruction Tuning. arXiv preprint arXiv:2406.02539. Retrieved from https://arxiv.org/abs/2406.02539

[18]. Lai, W., Mesgar, M., & Fraser, A. (2025). LLMs Beyond English: Scaling Multilingual Capability with Cross-Lingual Feedback. arXiv preprint arXiv:2406.02540. Retrieved from https://arxiv.org/abs/2406.02540

[19]. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., Wang, Y., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. Retrieved from https://arxiv.org/abs/2106.09685

[20]. Zheng, Y., Lin, K., Wang, J., et al. (2025). PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning. arXiv preprint arXiv:2406.01587. Retrieved from https://arxiv.org/abs/2406.01587

[21]. Zhou, H., Li, M., Zhang, F., et al. (2025). UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment. arXiv preprint arXiv:2406.01069. Retrieved from https://arxiv.org/abs/2406.01069

[22]. Wang, H., Dong, K., Zhu, Z., et al. (2024). Transferable Multimodal Attack on Vision-Language Pre-training Models. Proceedings of the IEEE Symposium on Security and Privacy. https://doi.org/10.1109/sp54263.2024.00102

[23]. Zhang X ,Guo C .Research on Multimodal Prediction of E-Commerce Customer Satisfaction Driven by Big Data[J]. Applied Sciences,2024,14(18):8181-8181.

[24]. Hwang, J.-J., Xu, R., Lin, H., Hung, W.-C., Ji, J., Choi, K., Huang, D., He, T., Covington, P., Sapp, B., Zhou, Y., Guo, J., Anguelov, D., & Tan, M. (2024). EMMA: End-to-End Multimodal Model for Autonomous Driving. arXiv. https://arxiv.org/abs/2410.23262

[25]. Pham, T.-H., Ngo, C., Bui, T.-D., Quang, M. L., Pham, T.-H., & Hy, T.-S. (2025). SilVar-Med: A speech-driven visual language model for explainable abnormality detection in medical imaging. arXiv. https://arxiv.org/abs/2504.10642

[26]. Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., & Wang, X. (2022). GroupViT: Semantic Segmentation Emerges from Text Supervision. arXiv preprint arXiv:2202.11094. https://arxiv.org/abs/2202.11094

[27]. Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S., Xiong, C., & Hoi, S. (2021). Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. arXiv preprint arXiv:2107.07651. https://arxiv.org/abs/2107.07651

[28]. Wei, D., Li, Z., & Liu, P. (2024). Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction. arXiv preprint arXiv:2412.06273. https://arxiv.org/abs/2412.06273

[29]. Liu, F., Chen, D., Guan, Z., Zhou, X., Zhu, J., Ye, Q., Fu, L., & Zhou, J. (2024). RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. arXiv preprint arXiv:2306.11029. https://arxiv.org/abs/2306.11029


Cite this article

Liang,Z. (2025). A survey on pre-training and transfer learning for multimodal Vision-Language Models. Advances in Engineering Innovation,16(6),135-139.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Journal:Advances in Engineering Innovation

Volume number: Vol.16
Issue number: Issue 6
ISSN:2977-3903(Print) / 2977-3911(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, 139, 8748–8763. Retrieved from https://proceedings.mlr.press/v139/radford21a.html

[2]. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022). CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv preprint arXiv:2205.01917. Retrieved from https://arxiv.org/abs/2205.01917

[3]. Chen, D., Zhang, Y., Wang, Z., & Li, H. (2022). ProtoCLIP: Prototypical Contrastive Language Image Pretraining. arXiv preprint arXiv:2206.10996. Retrieved from https://arxiv.org/abs/2206.10996

[4]. Joshi, S., Jain, A., Payani, A., & Mirzasoleiman, B. (2024). Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 238, 1000–1008. Retrieved from https://proceedings.mlr.press/v238/joshi24a.html

[5]. Cui, Y., Zhao, L., Liang, F., Li, Y., & Shao, J. (2022). Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision. arXiv preprint arXiv:2203.05796. Retrieved from https://arxiv.org/abs/2203.05796

[6]. Pan, X., Ye, T., Han, D., Song, S., & Huang, G. (2022). Contrastive Language-Image Pre-Training with Knowledge Graphs. arXiv preprint arXiv:2210.08901. Retrieved from https://arxiv.org/abs/2210.08901

[7]. Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Proceedings of the 39th International Conference on Machine Learning, 162, 12888–12900. Retrieved from https://proceedings.mlr.press/v162/li22n.html

[8]. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the 40th International Conference on Machine Learning, 202, 19730–19742. Retrieved from https://proceedings.mlr.press/v202/li23q.html

[9]. Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., & Wang, L. (2022). GIT: A Generative Image-to-text Transformer for Vision and Language. arXiv preprint arXiv:2205.14100. Retrieved from https://arxiv.org/abs/2205.14100

[10]. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hassani, A., Jeong, J., Sezer, U., Alabdulmohsin, I., Smaira, L., Raposo, D., Tyszkiewicz, M., et al. (2022). Flamingo: A Visual Language Model for Few-Shot Learning. arXiv preprint arXiv:2204.14198. Retrieved from https://arxiv.org/abs/2204.14198

[11]. Liu, Y., Zhang, Y., Wang, Y., Hou, L., Cao, J., & Bao, J. (2023). BEIT-3: Scaling Multimodal Transformers Across Vision, Language, and Audio. arXiv preprint arXiv:2302.00915. Retrieved from https://arxiv.org/abs/2302.00915

[12]. Jia, C., Yang, Y., Xia, Y., Chen, K., Parekh, Z., Pham, H., ... & Zettlemoyer, L. (2021). Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2102.05918

[13]. Kim, W., Son, B., & Kim, I. (2021). ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. International Conference on Machine Learning (ICML). https://arxiv.org/abs/2102.03334

[14]. Li, J., Zhu, Y., Zhang, Y., Yin, X., Lu, J., & Li, X. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Advances in Neural Information Processing Systems (NeurIPS 2023). https://arxiv.org/abs/2301.12597

[15]. Pfeiffer, J., Rücklé, A., Dürr, J., Frank, A., & Gurevych, I. (2021). AdapterFusion: Non-Destructive Task Composition for Transfer Learning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). https://arxiv.org/abs/2005.00247

[16]. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint. https://arxiv.org/abs/2103.00020

[17]. Sun, H., Wang, Y., &Xu, L. (2025). Parrot: Multilingual Visual Instruction Tuning. arXiv preprint arXiv:2406.02539. Retrieved from https://arxiv.org/abs/2406.02539

[18]. Lai, W., Mesgar, M., & Fraser, A. (2025). LLMs Beyond English: Scaling Multilingual Capability with Cross-Lingual Feedback. arXiv preprint arXiv:2406.02540. Retrieved from https://arxiv.org/abs/2406.02540

[19]. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., Wang, Y., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. Retrieved from https://arxiv.org/abs/2106.09685

[20]. Zheng, Y., Lin, K., Wang, J., et al. (2025). PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning. arXiv preprint arXiv:2406.01587. Retrieved from https://arxiv.org/abs/2406.01587

[21]. Zhou, H., Li, M., Zhang, F., et al. (2025). UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment. arXiv preprint arXiv:2406.01069. Retrieved from https://arxiv.org/abs/2406.01069

[22]. Wang, H., Dong, K., Zhu, Z., et al. (2024). Transferable Multimodal Attack on Vision-Language Pre-training Models. Proceedings of the IEEE Symposium on Security and Privacy. https://doi.org/10.1109/sp54263.2024.00102

[23]. Zhang X ,Guo C .Research on Multimodal Prediction of E-Commerce Customer Satisfaction Driven by Big Data[J]. Applied Sciences,2024,14(18):8181-8181.

[24]. Hwang, J.-J., Xu, R., Lin, H., Hung, W.-C., Ji, J., Choi, K., Huang, D., He, T., Covington, P., Sapp, B., Zhou, Y., Guo, J., Anguelov, D., & Tan, M. (2024). EMMA: End-to-End Multimodal Model for Autonomous Driving. arXiv. https://arxiv.org/abs/2410.23262

[25]. Pham, T.-H., Ngo, C., Bui, T.-D., Quang, M. L., Pham, T.-H., & Hy, T.-S. (2025). SilVar-Med: A speech-driven visual language model for explainable abnormality detection in medical imaging. arXiv. https://arxiv.org/abs/2504.10642

[26]. Xu, J., De Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., & Wang, X. (2022). GroupViT: Semantic Segmentation Emerges from Text Supervision. arXiv preprint arXiv:2202.11094. https://arxiv.org/abs/2202.11094

[27]. Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S., Xiong, C., & Hoi, S. (2021). Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. arXiv preprint arXiv:2107.07651. https://arxiv.org/abs/2107.07651

[28]. Wei, D., Li, Z., & Liu, P. (2024). Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction. arXiv preprint arXiv:2412.06273. https://arxiv.org/abs/2412.06273

[29]. Liu, F., Chen, D., Guan, Z., Zhou, X., Zhu, J., Ye, Q., Fu, L., & Zhou, J. (2024). RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. arXiv preprint arXiv:2306.11029. https://arxiv.org/abs/2306.11029