
Policy Gradient Methods in Deep Reinforcement Learning
- 1 School of Advanced Technology, Xi’an Jiaotong - Liverpool University, Suzhou, Jiangsu, China
* Author to whom correspondence should be addressed.
Abstract
Policy gradient (PG) methods are a fundamental component of deep reinforcement learning (DRL), particularly effective in continuous and high-dimensional control tasks. This paper presents a structured review of PG algorithms, tracing their development from basic Monte Carlo methods like REINFORCE to advanced techniques such as asynchronous advantage actor-critic (A3C), trust region policy optimization (TRPO), proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), and soft actor-critic (SAC). These methods differ in terms of policy structure, optimization stability, and sample efficiency, addressing core challenges in policy learning through gradient-based updates. In addition, this review explores the application of PG methods in real-world domains, including autonomous driving, financial portfolio management, and smart grid energy systems. These applications demonstrate PG methods’ capacity to operate under uncertainty and adapt to complex dynamic environments. However, limitations such as high variance, low sample efficiency, and instability in multi-agent and offline settings remain significant obstacles. The review concludes by outlining emerging research directions, including entropy-based exploration, model-based policy optimization, meta-learning, and Transformer-based sequence modeling. This work aims to offer theoretical insights and practical guidance to support the continued advancement and application of policy gradient methods in reinforcement learning.
Keywords
Policy Gradient, Deep Reinforcement Learning, Actor-Critic Algorithms, Sample Efficiency, Multi-Agent Systems
[1]. Connell, J. H., & Mahadevan, S. (1997). Robot learning. Kluwer Academic Publishers.
[2]. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
[3]. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
[4]. Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.
[5]. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783.
[6]. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
[7]. Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust region policy optimization. arXiv preprint arXiv:1502.05477.
[8]. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
[9]. Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Sallab, A. A. A., Yogamani, S., & Pérez, P. (2020). Deep reinforcement learning for autonomous driving: A survey. arXiv preprint arXiv:2002.00444.
[10]. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.
[11]. Fonseca, T., Ferreira, L., Cabral, B., Severino, R., & Praca, I. (2024). EnergAIze: Multi-agent deep deterministic policy gradient for vehicle-to-grid energy management. arXiv preprint arXiv:2404.02361.
[12]. Naga, H. (2025). Reinforcement learning: Concepts and real-world applications. Global Research Review, 1(1), 142–151. Retrieved from https://scitechpublications.org/index.php/grr/article/view/21
[13]. Goodness, C. (2025). Artificial intelligence and machine learning in finance: Enhancing efficiency, innovation and decision-making. World Journal of Advanced Engineering Technology and Sciences, 14(3), 134–139.
[14]. He,J.,Hua,C.,Zhou,C.,&Zheng,Z.(2025).Reinforcement-learningportfolioallocationwithdynamicembedding of market information. arXiv preprint arXiv:2501.17992.
[15]. Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
[16]. Kamrani, A. S., Dini, A., Dagdougui, H., & Sheshyekani, K. (2025). Multi-agent deep reinforcement learning with online and fair optimal dispatch of EV aggregators. Machine Learning with Applications, 19, 100620.
[17]. Shojaeighadikolaei, A., Talata, Z., & Hashemi, M. (2024). Centralized vs. decentralized multi-agent reinforcement learning for enhanced control of electric vehicle charging networks. arXiv preprint arXiv:2404.12520.
[18]. Gu, S., Holly, E., Lillicrap, T., & Levine, S. (2016). Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. arXiv preprint arXiv:1610.00633.
[19]. Kochliaridis, V., Kostinoudis, E., & Vlahavas, I. (2024). Optimizing pretrained transformers for autonomous driving. In Proceedings of the ACM (pp. 1–9).
[20]. Eldeeb, E., & Alves, H. (2025). Multi-agent meta-offline reinforcement learning for timely UAV path planning and data collection. arXiv preprint arXiv:2501.16098.
Cite this article
Gao,Y. (2025). Policy Gradient Methods in Deep Reinforcement Learning. Applied and Computational Engineering,158,27-34.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of CONF-SEML 2025 Symposium: Machine Learning Theory and Applications
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).