Research Article
Open access
Published on 28 May 2025
Download pdf
Wei,T. (2025). A human action recognition method based on spatiotemporal information interaction. Advances in Engineering Innovation,16(5),138-151.
Export citation

A human action recognition method based on spatiotemporal information interaction

Tian Wei *,1,
  • 1 Shangluo University

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2977-3903/2025.23586

Abstract

In the field of deep learning, current human action recognition algorithms often treat temporal information, spatial information, and background information equally, which leads to limited recognition accuracy. To address this issue, this paper proposes a human action recognition algorithm based on spatiotemporal information interaction. First, a dual-pathway network is proposed to learn spatial and temporal information at different refresh rates. The network includes a sparse pathway operating at a low frame rate to capture spatial semantic information, and a parallel dense pathway operating at a high frame rate to capture temporal motion information. Second, to extract more discriminative features from videos, a cross-dual attention interaction model is introduced to focus on key regions of video segments and explicitly exchange spatiotemporal information between the two pathways. Experimental results show that the proposed algorithm achieves recognition accuracies of 97.6% on the UCF101 dataset and 78.4% on the HMDB51 dataset, outperforming the novel SlowFast algorithm by 1.8% and 1.4%, respectively. Combined with a nighttime image enhancement algorithm based on MDIFE-Net curve estimation, the method achieved an accuracy of 83.2% on the ARID nighttime dataset—an improvement of 22.9% over the performance before image enhancement. This demonstrates the method’s strong potential for real-world nighttime action recognition applications.

Keywords

image enhancement, action recognition, illumination curve estimation, spatiotemporal information interaction, attention mechanism

[1]. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, 27, 1-9.

[2]. Chen, H., Li, M., Jing, L., & Cheng, Z. (2021). Lightweight long and short-range spatial-temporal graph convolutional network for skeleton-based action recognition. IEEE Access, 9, 161374-161382. https://doi.org/10.1109/ACCESS.2021.3133045

[3]. Lin, J., Gan, C., & Han, S. (2019). TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7083-7093).

[4]. Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6202-6211).

[5]. Pang, C., Lu, X., & Lyu, L. (2023). Skeleton-based action recognition through contrasting two-stream spatial-temporal networks. IEEE Transactions on Multimedia, 1(4), 1-4.

[6]. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4489-4497).

[7]. Tran, D., Ray, J., Shou, Z., Chang, S. F., & Paluri, M. (2017). ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 1-12.

[8]. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).

[9]. Feichtenhofer, C. (2020). X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 203-213).

[10]. Li, J., Han, Y., Zhang, M., Li, G., & Zhang, B. (2022). Multi-scale residual network model combined with global average pooling for action recognition. Multimedia Tools and Applications, 81(1), 1375-1393.

[11]. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2625-2634).

[12]. Si, C., Jing, Y., Wang, W., Wang, L., & Tan, T. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 103-118).

[13]. Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., & Snoek, C. G. (2018). VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166, 41-50.

[14]. Aljarrah, A. A., & Ali, A. H. (2019). Human activity recognition using PCA and BiLSTM recurrent neural networks. In 2019 2nd International Conference on Engineering Technology and its Applications (IICETA) (pp. 156-160).

[15]. Chenhao, W., Yongquan, W. E. I., Dong, G. U. O., & Jun, G. (2020). Human behavior recognition under occlusion based on two-stream network combined with BiLSTM. In 2020 Chinese Control And Decision Conference (CCDC) (pp. 3311-3316).

[16]. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 1-11.

[17]. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. (2020). ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11534-11542).

[18]. Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 3-19).

[19]. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3146-3154).

[20]. Pan, B., Cao, Z., Adeli, E., & Niebles, J. C. (2020). Adversarial cross-domain action recognition with co-attention. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 7, pp. 11815-11822).

[21]. Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., & Lu, T. (2020). TEINet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 7, pp. 11669-11676).

[22]. Wang, X., Xiong, X., Neumann, M., Piergiovanni, A. J., Ryoo, M. S., Angelova, A., Kitani, K. M., & Hua, W. (2020). AttentionNAS: Spatiotemporal attention cell search for video classification. In Computer Vision–ECCV 2020: 16th European Conference (pp. 449-465).

[23]. Niu, Z., Zhong, G., & Yu, H. (2021). A review on the attention mechanism of deep learning. Neurocomputing, 452, 48-62.

[24]. Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6202-6211).

[25]. Picaud, S., Dalkara, D., Marazova, K., Goureau, O., Roska, B., & Sahel, J. A. (2019). The primate model for understanding and restoring vision. Proceedings of the National Academy of Sciences, 116(52), 26280-26287.

[26]. Picaud, S., Dalkara, D., Marazova, K., Goureau, O., Roska, B., & Sahel, J. A. (2019). The primate model for understanding and restoring vision. Proceedings of the National Academy of Sciences, 116(52), 26280-26287.

[27]. Dai, W., Chen, Y., Huang, C., Gao, M. K., & Zhang, X. (2019). Two-stream convolution neural network with video-stream for action recognition. In 2019 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8).

[28]. Wu, M. C., Chiu, C. T., & Wu, K. H. (2019). Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks. In ICASSP 2019-2019 IEEE International Conference on Acoustics (pp. 2202-2206).

[29]. Xu, Y., Yang, J., Cao, H., Mao, K., Yin, J., & See, S. (2021). ARID: A new dataset for recognizing action in the dark. In Deep Learning for Human Activity Recognition: Second International Workshop (pp. 70-84).

[30]. Lore, K. G., Akintayo, A., & Sarkar, S. (2017). LLNet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recognition, 61, 650-662.

Cite this article

Wei,T. (2025). A human action recognition method based on spatiotemporal information interaction. Advances in Engineering Innovation,16(5),138-151.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Journal:Advances in Engineering Innovation

Volume number: Vol.16
ISSN:2977-3903(Print) / 2977-3911(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).