Performance and accuracy of Python distributed neural network training frameworks: Spark vs. TensorFlow in big data modeling and prediction
- 1 Sun Yat-sen University
* Author to whom correspondence should be addressed.
Abstract
The rapid proliferation of big data across various industries, such as healthcare, finance, and social media, has created an urgent need for robust and efficient distributed neural network training frameworks. These frameworks must handle vast volumes of data while ensuring high performance and accuracy in machine learning tasks. As organizations increasingly rely on machine learning models to extract actionable insights from big data, selecting the right framework becomes critical. This paper provides a comprehensive comparison of two leading distributed neural network training frameworks: Apache Spark and TensorFlow. These frameworks are widely adopted due to their scalability and flexibility in handling complex data-driven tasks. The study evaluates their performance and accuracy in big data modeling and prediction, focusing on key metrics such as training speed, model accuracy, resource utilization, and scalability. Utilizing datasets that mirror real-world applications, the study includes thorough data preprocessing, model construction, and distributed training experiments across both frameworks. The findings reveal that while TensorFlow achieves superior model accuracy, Apache Spark demonstrates better scalability and resource efficiency, particularly in large-scale data environments. These insights offer valuable guidance for researchers and industry practitioners in selecting the most appropriate framework for their specific big data applications, ensuring optimal performance and resource management.
Keywords
Distributed neural networks, big data, apache spark, TensorFlow, performance analysis
[1]. Zhang, H., Chen, L., & Li, J. (2019). Comparative analysis of Spark and TensorFlow in distributed neural network training. IEEE Transactions on Big Data, 7(2), 160-175.
[2]. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
[3]. Zaharia, M., Chowdhury, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. [C] Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2-2.
[4]. Abadi, M., Barham, P., et al. (2016). TensorFlow: A system for large-scale machine learning. OSDI, 16, 265-283.
[5]. Gonzalez, J., & Perez, R. (2021). Exploring the trade-offs between performance and accuracy in distributed machine learning. IEEE Transactions on Big Data, 9(1), 50-61.
[6]. Martin, A., & Smith, J. (2017). A survey of distributed machine learning algorithms. [J] International Journal of Big Data, 14(2), 102-123.
[7]. Apache Spark Documentation. (2023). Overview of Spark MLlib. Retrieved from https://spark.apache.org/docs/latest/ml-guide.html
[8]. TensorFlow Documentation. (2023). Distributed Training with TensorFlow. Retrieved from https://www.tensorflow.org/guide/distributed_training
[9]. Shi, Y., & Chen, Z. (2021). Big data processing: The role of Spark in modern data analysis. [J] Journal of Data Science, 19(3), 378-391.
[10]. Liu, X., & Wang, P. (2020). Distributed deep learning: A comparative study of frameworks. ACM Computing Surveys, 52(4), 1-36.
[11]. Lee, K., & Zhang, Y. (2018). Efficient training of deep learning models with Spark. [J] Machine Learning Journal, 107(1), 41-55.
[12]. Williams, R., & Taylor, J. (2019). Optimizing distributed training for neural networks: A survey of recent advances. IEEE Transactions on Neural Networks and Learning Systems, 30(2), 335-348.
[13]. Park, H., & Lee, S. (2018). A review on distributed machine learning frameworks: Comparative analysis and future directions. IEEE Access, 6, 41334-41345.
[14]. UCI Machine Learning Repository. (2023). Dataset used for benchmarking. Retrieved from https://archive.ics.uci.edu/ml/index.php
[15]. Chen, Z., & Zhao, Y. (2022). Scalable machine learning: A comparative analysis of Spark MLlib and TensorFlow. [J] International Journal of Data Science and Analytics, 5(3), 222-233.
[16]. Johnson, K., & Brown, L. (2020). Comparative performance of machine learning frameworks on large-scale datasets. [J] Journal of Big Data, 7(1), 102-118.
[17]. Podani, J. (1994). Multivariate Data Analysis in Ecology and Systematics. SPB Publishing, The Hague.
[18]. Cancer Research UK. (1975). Cancer statistics reports for UK. Retrieved from http://www.cancerreseark.org/aboutcancer/statistics
[19]. Van der Geer, J., Hanraads, J.A.J., & Lupton, R.A. (2010). The art of writing a scientific article. J. Sci. Commun., 163: 51–59.
[20]. Chen, J., & Wang, S. (2019). Performance evaluation of Spark and TensorFlow in big data analytics. Big Data Research, 15, 72-84.
[21]. Dou, Y., Wen, X., & He, Y. (2020). A study on the performance of distributed deep learning models in cloud environments. [J] Journal of Cloud Computing, 9(2), 124-139.
[22]. Gupta, M., & Singh, R. (2021). Neural network training optimizations in distributed systems. IEEE Transactions on Cloud Computing, 9(3), 320-331.
[23]. Wang, J., & Li, X. (2022). Big data and deep learning frameworks: A performance comparison. Future Generation Computer Systems, 115, 457-470.
[24]. Smith, A., & Zhao, J. (2018). Comparative analysis of deep learning frameworks in cloud environments. [J] International Journal of Big Data and Analytics in Healthcare, 3(1), 45-56.
[25]. Han, Y., & Hu, P. (2020). Performance optimization of distributed machine learning models in big data environments. [J] Journal of Parallel and Distributed Computing, 137, 42-55.
[26]. Zhu, Y., & Wu, G. (2021). Large-scale deep learning on Spark: A performance evaluation. IEEE Transactions on Big Data, 7(4), 314-327.
[27]. Zhang, L., & Xu, Y. (2020). Efficient distributed training of deep neural networks in big data environments. IEEE Transactions on Parallel and Distributed Systems, 31(11), 2713-2726.
[28]. Kim, H., & Lee, J. (2019). An evaluation of distributed deep learning frameworks for big data. [J] Journal of Grid Computing, 17(2), 123-140.
[29]. Chen, Y., & Zhang, H. (2020). Distributed neural network training frameworks: A comparative study. ACM Computing Surveys, 53(4), 1-33.
[30]. Taylor, J., & Brown, P. (2019). Optimizing resource utilization in distributed deep learning frameworks. Future Generation Computer Systems, 100, 341-352.
[31]. He, K., & Sun, J. (2020). A comprehensive study of distributed deep learning frameworks. [J] Journal of Supercomputing, 76(7), 5247-5267.
[32]. Li, J., & Zhou, Y. (2021). Performance analysis of distributed training frameworks for deep learning. IEEE Transactions on Big Data, 8(3), 512-525.
[33]. Zhao, X., & Wu, Y. (2021). Comparative analysis of distributed neural network training in cloud environments. IEEE Transactions on Cloud Computing, 9(4), 405-416.
[34]. Zhang, W., & Wang, H. (2021). Deep learning on Spark: A performance evaluation in distributed environments. [J] Journal of Parallel and Distributed Computing, 148, 47-58.
[35]. Liu, Y., & Feng, X. (2022). Evaluating the scalability of distributed deep learning frameworks. [J] Journal of Big Data, 9(1), 12-29.
[36]. Park, S., & Kim, S. (2021). Resource efficiency in distributed deep learning: A comparative study. IEEE Transactions on Parallel and Distributed Systems, 32(10), 2435-2446.
Cite this article
He,M. (2024).Performance and accuracy of Python distributed neural network training frameworks: Spark vs. TensorFlow in big data modeling and prediction.Applied and Computational Engineering,92,52-58.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content
About volume
Volume title: Proceedings of the 6th International Conference on Computing and Data Science
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).