
Evaluating Large Language Models for Code Generation: A Comparative Study on Python, Java, and Swift
- 1 The University of Melbourne, Parkville, VIC 3010, Australia
* Author to whom correspondence should be addressed.
Abstract
With the development of artificial intelligence (AI), particularly in natural language processing and machine learning, AI applications in code generation, error correction, and programming assistance have become more common. However, differences in code generation capabilities among models influence their practical applicability in programming tasks. To investigate this issue, this study evaluates the performance of five state-of-the-art large language models (LLMs)—GPT-4o, OpenAI o1, OpenAI o1 Pro, Claude 3.5, and Gemini 2.0—through a systematic comparative analysis across three programming languages: Python, Java, and Swift. The evaluation framework considers multiple aspects, including overall accuracy, code efficiency, time complexity, space complexity, and multi-solution generation capabilities.The experimental results reveal substantial variations among models: OpenAI o1 Pro and Gemini achieve the highest accuracy, GPT-4o generates the most concise code, and Claude 3.5 produces the greatest number of alternative solutions. However, all models exhibit lower performance in Swift compared to Python and Java, likely due to the limited availability of training data in Swift. An in-depth error analysis identifies differences in model adaptability across programming languages and highlights key limitations of AI-assisted programming. These findings provide insights for developers and users of AI-assisted programming tools, supporting more informed decision-making in selecting and applying these technologies in different programming contexts.
Keywords
Ai-Assisted Programming, Code Generation, Large Language Models
[1]. W. Yan, T. Nakajima, and R. Sawada, "Benefits and Challenges of Collaboration between Students and Conversational Generative Artificial Intelligence in Programming Learning: An Empirical Case Study," Education Sciences, vol. 14, no. 4, p. 433, 2024. [Online]. Available: https://www.mdpi.com/2227-7102/14/4/433.
[2]. D. Z. Guo, Qihao; Yang, Dejian; Xie, Zhenda; Dong, Kai; Zhang, Wentao; Chen, Guanting; Bi, Xiao; Wu, Y.; Li, Y.K.; Luo, Fuli; Xiong, Yingfei; Liang, Wenfeng, "DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence," doi: 10.48550/arXiv.2401.14196.
[3]. M. Izadi, J. Katzy, T. V. Dam, M. Otten, R. M. Popescu, and A. V. Deursen, "Language Models for Code Completion: A Practical Evaluation," presented at the Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639138.
[4]. A. Vadaparty et al., "CS1-LLM: Integrating LLMs into CS1 Instruction," presented at the Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, Milan, Italy, 2024. [Online]. Available: https://doi.org/10.1145/3649217.3653584.
[5]. J. Savelka, A. Agarwal, C. Bogart, Y. Song, and M. Sakr, "Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?," presented at the Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, Turku, Finland, 2023. [Online]. Available: https://doi.org/10.1145/3587102.3588792.
[6]. M. Richards, K. Waugh, M. Slaymaker, M. Petre, J. Woodthorpe, and D. Gooch, "Bob or Bot: Exploring ChatGPT's Answers to University Computer Science Assessment," ACM Trans. Comput. Educ., vol. 24, no. 1, p. Article 5, 2024, doi: 10.1145/3633287.
[7]. S. Gulwani, "Automating string processing in spreadsheets using input-output examples," presented at the Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, Austin, Texas, USA, 2011. [Online]. Available: https://doi.org/10.1145/1926385.1926423.
[8]. M. Mernik, J. Heering, and A. M. Sloane, "When and how to develop domain-specific languages," ACM Comput. Surv., vol. 37, no. 4, pp. 316–344, 2005, doi: 10.1145/1118890.1118892.
[9]. I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," presented at the Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, Montreal, Canada, 2014.
[10]. A. Vaswani et al., "Attention is all you need," presented at the Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California, USA, 2017.
[11]. U. Alon, M. Zilberstein, O. Levy, and E. Yahav, "code2vec: learning distributed representations of code," Proc. ACM Program. Lang., vol. 3, no. POPL, p. Article 40, 2019, doi: 10.1145/3290353.
[12]. F. Zhangyin et al., "CodeBERT: A Pre-Trained Model for Programming and Natural Languages," in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds., November 2020 2020: Association for Computational Linguistics, pp. 1536-1547, doi: 10.18653/v1/2020.findings-emnlp.139.
[13]. Z. Feng et al., "Codebert: A pre-trained model for programming and natural languages," in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020: Association for Computational Linguistics, doi: 10.18653/v1/2020.findings-emnlp.139.
[14]. T. B. Brown et al., "Language models are few-shot learners," presented at the Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 2020.
[15]. OpenAI. "ChatGPT: Optimizing language models for dialogue." OpenAI. https://openai.com/blog/chatgpt .
[16]. M. Chen et al., "Evaluating large language models trained on code," arXiv preprint arXiv:2107.03374, 2021, doi: 10.48550/arXiv.2107.03374.
[17]. J. Austin et al., "Program synthesis with large language models," arXiv preprint arXiv:2108.07732, 2021, doi: 10.48550/arXiv.2108.07732.
[18]. J. Austin et al., "Program synthesis with large language models," arXiv preprint arXiv:2108.07732, 2021, doi: 10.48550/arXiv.2108.07732.
[19]. Y. Li et al., "Competition-level code generation with alphacode," Science, vol. 378, no. 6624, pp. 1092-1097, 2022, doi: 10.1126/science.abq1158.
[20]. T. Y. Zhuo et al., "Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions," arXiv preprint arXiv:2406.15877, 2024, doi: 10.48550/arXiv.2406.15877.
[21]. A. Sanchit et al., "MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks," presented at the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024.
[22]. N. Nguyen and S. Nadi, "An Empirical Evaluation of GitHub Copilot’s Code Suggestions," presented at the 2022 Mining Software Repositories Conference, MSR 2022, 2022.
[23]. S. Ren et al., CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. 2020.
[24]. X. Wang et al., "MAVEN: A Massive General Domain Event Detection Dataset," 2020: Association for Computational Linguistics, pp. 1652–1671, doi: 10.18653/v1/2020.emnlp-main.129.
[25]. M. Chen, & Liu, L., "Codebleu: a method for automatic evaluation of code synthesis," arXiv preprint arXiv:2009.10297, 2020, doi: 10.48550/arXiv.2009.10297.
[26]. S. Lau and P. Guo, "From "Ban It Till We Understand It" to "Resistance is Futile": How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot," presented at the Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1, Chicago, IL, USA, 2023. [Online]. Available: https://doi.org/10.1145/3568813.3600138.
[27]. S. Lau and P. Guo, "From "Ban It Till We Understand It" to "Resistance is Futile": How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot," presented at the Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1, Chicago, IL, USA, 2023. [Online]. Available: https://doi.org/10.1145/3568813.3600138.
[28]. F. Mosaiyebzadeh, S. Pouriyeh, R. Parizi, N. Dehbozorgi, M. Dorodchi, and D. M. Batista, "Exploring the Role of ChatGPT in Education: Applications and Challenges," presented at the Proceedings of the 24th Annual Conference on Information Technology Education, Marietta, GA, USA, 2023. [Online]. Available: https://doi.org/10.1145/3585059.3611445.
[29]. S.-S. Abdul-Rahman and B. du Boulay, "Learning programming via worked-examples: Relation of learning styles to cognitive load," Computers in Human Behavior, vol. 30, pp. 286-298, 2014/01/01/ 2014, doi: https://doi.org/10.1016/j.chb.2013.09.007.
[30]. B. Jury, A. Lorusso, J. Leinonen, P. Denny, and A. Luxton-Reilly, "Evaluating LLM-generated Worked Examples in an Introductory Programming Course," presented at the Proceedings of the 26th Australasian Computing Education Conference, Sydney, NSW, Australia, 2024. [Online]. Available: https://doi.org/10.1145/3636243.3636252.
[31]. J. Leinonen et al., "Comparing Code Explanations Created by Students and Large Language Models," presented at the Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, Turku, Finland, 2023. [Online]. Available: https://doi.org/10.1145/3587102.3588785.
[32]. OpenAI. "Hello GPT-4o." OpenAI. https://openai.com/index/hello-gpt-4o/ .
[33]. OpenAI. "Introducing OpenAI o1-preview." OpenAI. https://openai.com/index/introducing-openai-o1-preview/ .
[34]. OpenAI. "Introducing ChatGPT Pro." OpenAI. https://openai.com/index/introducing-chatgpt-pro/ .
[35]. Anthropic. "Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet." Anthropic. https://www.anthropic.com/research/swe-bench-sonnet .
[36]. G. DeepMind. "Introducing Gemini 2.0: our new AI model for the agentic era." Google DeepMind. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=#ceo-message .
[37]. A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. O'Reilly Media, 2019.
[38]. S. Overflow. "Stack Overflow Developer Survey 2024." https://survey.stackoverflow.com/ .
[39]. D. Kahneman, Thinking, Fast and Slow. New York: Farrar, Straus and Giroux, 2011.
[40]. B. Dickson. "DeepMind's Talker-Reasoner framework brings System 2 thinking to AI agents." https://venturebeat.com/ai/deepminds-talker-reasoner-framework-brings-system-2-thinking-to-ai-agents/ .
Cite this article
Wan,B. (2025). Evaluating Large Language Models for Code Generation: A Comparative Study on Python, Java, and Swift. Applied and Computational Engineering,146,109-126.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of SEML 2025 Symposium: Machine Learning Theory and Applications
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).