Research Article
Open access
Published on 19 December 2024
Download pdf
Li,X. (2024). Strategies of Building AI Agents for Multimodal Productivity with Contemporary Large Language Models. Applied and Computational Engineering,116,108-113.
Export citation

Strategies of Building AI Agents for Multimodal Productivity with Contemporary Large Language Models

Xiaotian Li *,1,
  • 1 Faculty for Natural Sciences, Norwegian University of Science and Technology, Trondheim, Norway

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2755-2721/2025.20428

Abstract

The past few years have witnessed significant advancements in generative artificial intelligence (AI) led by large language models (LLMs), applications demonstrating capabilities in traditionally unattainable tasks. Numerous efforts are being initiated exploring a prospect all the more exciting, to employ LLMs not just as language processors, but as a starting point toward AI agents that can adapt to diverse tasks and complex scenarios. In this paper, a survey is offered on the state-of-the-art strategies of deploying such models for the generation of both text-based domain-specific contents and multimodal outputs embodied in interactions with web applications, industrial software and ultimately the physical world. Two approaches to implementing multimodality are delineated: Direct embedding of multimodal data, and conversion of multimodal data to text. Both types have seen extensive use in areas of research focus, such as image processing, embodied action and software automation. Representative cases from these categories are reviewed with a focus on their input/output modalities, methods of processing multimodal data and output quality.

Keywords

Artificial Intelligence, Computation and Language, Large Language Models, AI Agents

[1]. Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Cheng, W., Zhang, Q., Qin, W., Zheng, Y., Qiu, X., Huang, X., Gui, T. (2023). The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv preprint arXiv:2309.07864.

[2]. Liu, J., Wang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., Lou, Y. (2024). Large Language Model-Based Agents for Software Engineering: A Survey. arXiv preprint arXiv:2409.02977.

[3]. Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J. (2024). Large Language Models: A Survey. arXiv preprint arXiv:2402.06196.

[4]. Huyen, C. (2023, October 10). Multimodality and Large Multimodal Models (LMMs). Chip Huyen. https://huyenchip.com/2023/10/10/multimodal.html

[5]. Raschka, S., PhD. (2024, November 3). Understanding Multimodal LLMs. Ahead of AI. https://magazine.sebastianraschka.com/p/understanding-multimodal-llms

[6]. Bran, A.M., Cox, S., Schilter, O., Baldassari, C., White, A.D., Schwaller, P. (2023). ChemCrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376.

[7]. Boiko, D. A., MacKnight, R., Kline, B., & Gomes, G. (2023). Autonomous chemical research with large language models. Nature, 624(7992), 570–578. https://doi.org/10.1038/s41586-023-06792-0

[8]. Zhao, Z., Tang, D., Zhu, H., Zhang, Z., Chen, K., Liu, C., Ji, Y. (2024). A Large Language Model-based multi-agent manufacturing system for intelligent shopfloor. arXiv preprint arXiv:2405.16887.

[9]. Wang, Z., Li, A., Li, Z., Liu, X. (2024). GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing. arXiv preprint arXiv:2407.05600.

[10]. Gong, R., Huang, Q., Ma, X., Vo, H., Durante, Z., Noda, Y., Zheng, Z., Zhu, S., Terzopoulos, D., Fei-Fei, L., Gao, J. (2023). MindAgent: Emergent Gaming Interaction. arXiv preprint arXiv:2309.09971.

[11]. Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P. (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv preprint arXiv:2303.03378.

[12]. Qi, Z., Dong, R., Zhang, S., Geng, H., Han, C., Ge, Z., Yi, L., Ma, K. (2024). ShapeLLM: Universal 3D Object Understanding for Embodied Interaction. arXiv preprint arXiv:2402.17766.

[13]. Huang, Y., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y., Zhou, J., Dong, C., Huang, R., Zhang, R., & Shan, Y. (2024). SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 33, 8362–8371. https://doi.org/10.1109/cvpr52733.2024.00799

[14]. Wu, J., Zhong, M., Xing, S., Lai, Z., Liu, Z., Wang, W., Chen, Z., Zhu, X., Lu, L., Lu, T., Luo, P., Qiao, Y., Dai, J. (2024). VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. arXiv preprint arXiv:2406.08394.

[15]. Zhang, C., Yang, Z., Liu, J., Han, Y., Chen, X., Huang, Z., Fu, B., Yu, G. (2023). AppAgent: Multimodal Agents as Smartphone Users. arXiv preprint arXiv:2312.13771.

[16]. Nong, S., Zhu, J., Wu, R., Jin, J., Shan, S., Huang, X., Xu, W. (2024). MobileFlow: A Multimodal LLM For Mobile GUI Agent. arXiv preprint arXiv:2407.04346.

[17]. Zhong, W., Feng, X., Zhao, L., Li, Q., Huang, L., Gu, Y., Ma, W., Xu, Y., Qin, B. (2024). Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models. arXiv preprint arXiv:2407.00569.

[18]. Yu, B., Baker, F.N., Chen, Z., Herb, G., Gou, B., Adu-Ampratwum, D., Ning, X., Sun, H. (2024). Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving. arXiv preprint arXiv:2411.07228.

Cite this article

Li,X. (2024). Strategies of Building AI Agents for Multimodal Productivity with Contemporary Large Language Models. Applied and Computational Engineering,116,108-113.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 5th International Conference on Signal Processing and Machine Learning

Conference website: https://2025.confspml.org/
ISBN:978-1-83558-791-1(Print) / 978-1-83558-792-8(Online)
Conference date: 12 January 2025
Editor:Stavros Shiaeles
Series: Applied and Computational Engineering
Volume number: Vol.116
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).