High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures

Research Article
Open access

High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures

Mingtao Zhang 1*
  • 1 Harbin Institute of Technology    
  • *corresponding author nullptr@stu.hit.edu.cn
Published on 23 February 2024 | https://doi.org/10.54254/2755-2721/42/20230775
ACE Vol.42
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-83558-309-8
ISBN (Online): 978-1-83558-310-4

Abstract

This paper delves into the shift from Instruction-Level Parallelism (ILP) to Heterogeneous Hybrid Parallel Computing in the quest for optimized performance processing. It sheds light on the constraints of ILP, emphasizing how these shortcomings have catalyzed a move toward the more adaptable and proficient framework of heterogeneous hybrid computing. This transformation's advantages are explored across diverse applications, notably in deep learning, cloud computing, data centers, and mobile SoCs. Additionally, the study underscores emerging architectures and innovations of this era, including many-core processors, FPGA-driven accelerators, and an assortment of software tools and libraries. While heterogeneous hybrid computing offers a promising horizon, it isn't without challenges. This paper brings to the fore issues like restricted adaptability, steep development costs, software compatibility hurdles, the absence of a standardized programming model, and vendor reliance. Through this in-depth exploration, our aim is to present a holistic snapshot of the present and potential future of high-performance processing.

Keywords:

High Performance Processing, From Instruction-Level Parallelism, Heterogeneous, Hybrid Parallel Computing

Zhang,M. (2024). High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures. Applied and Computational Engineering,42,178-185.
Export citation

References

[1]. Aiken, A., Banerjee, U., Kejariwal, A., Nicolau, A. (2016). Introduction. In: Instruction Level Parallelism. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7797-7_1.

[2]. Fatehi, E., & Gratz, P. (2014, August). ILP and TLP in shared memory applications: A limit study. In Proceedings of the 23rd international conference on Parallel architectures and compilation (pp. 113-126).

[3]. Kiriansky, V., Xu, H., Rinard, M., & Amarasinghe, S. (2018, November). Cimple: Instruction and memory level parallelism: A dsl for uncovering ilp and mlp. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (pp. 1-16).

[4]. Zaidi, A. M., Iordanou, K., Luján, M., & Gabrielli, G. (2021, March). Loopapalooza: Investigating Limits of Loop-Level Parallelism with a Compiler-Driven Approach. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 128-138). IEEE.

[5]. Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Proceedings of the 38th annual international symposium on Computer architecture (pp. 365-376).

[6]. Dally, W. J., Turakhia, Y., & Han, S. (2020). Domain-specific hardware accelerators. Communications of the ACM, 63(7), 48-57.

[7]. Chamberlain, R. D. (2020). Architecturally truly diverse systems: A review. Future Generation Computer Systems, 110, 33-44.

[8]. Hegde, V., & Usmani, S. (2016). Parallel and distributed deep learning. May, 31, 1-8.

[9]. Madiajagan, M., & Raj, S. S. (2019). Parallel computing, graphics processing unit (GPU) and new hardware for deep learning in computational intelligence research. In Deep learning and parallel computing environment for bioengineering systems (pp. 1-15). Academic Press.

[10]. Choi, W., Duraisamy, K., Kim, R. G., Doppa, J. R., Pande, P. P., Marculescu, R., & Marculescu, D. (2016, October). Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms. In Proceedings of the international conference on compilers, architectures and synthesis for embedded systems (pp. 1-10).

[11]. Kim, Y. W., Choi, S. H., & Han, T. H. (2021). Rapid topology generation and core mapping of optical network-on-chip for heterogeneous computing platform. IEEE Access, 9, 110359-110370.

[12]. Nurvitadhi, E., Venkatesh, G., Sim, J., Marr, D., Huang, R., Ong Gee Hock, J., ... & Boudoukh, G. (2017, February). Can FPGAs beat GPUs in accelerating next-generation deep neural networks? . In Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays (pp. 5-14).

[13]. Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., ... & Burger, D. (2014). A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Computer Architecture News, 42(3), 13-24.

[14]. Damiani, A., Fiscaletti, G., Bacis, M., Brondolin, R., & Santambrogio, M. D. (2022). Blastfunction: A full-stack framework bringing fpga hardware acceleration to cloud-native applications. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(2), 1-27.

[15]. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., ... & Tessier, R. (2022). The future of FPGA acceleration in datacenters and the cloud. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(3), 1-42.

[16]. Shan, Y., Lin, W., Guo, Z., & Zhang, Y. (2022, August). Towards a fully disaggregated and programmable data center. In Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems (pp. 18-28).

[17]. Gouk, D., Lee, S., Kwon, M., & Jung, M. (2022). Direct access, {High-Performance} memory disaggregation with {DirectCXL}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) (pp. 287-294).

[18]. Vuppalapati, M., Miron, J., Agarwal, R., Truong, D., Motivala, A., & Cruanes, T. (2020). Building an elastic query engine on disaggregated storage. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (pp. 449-462).

[19]. Halpern, M., Zhu, Y., & Reddi, V. J. (2016, March). Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 64-76). IEEE.

[20]. Zhu, Y., Mattina, M., & Whatmough, P. (2018). Mobile machine learning hardware at arm: a systems-on-chip (soc) perspective. arXiv preprint arXiv:1801.06274.

[21]. Zhu, Y., Samajdar, A., Mattina, M., & Whatmough, P. (2018). Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232.

[22]. Hassan, M. (2017). Heterogeneous MPSoCs for mixed criticality systems: Challenges and opportunities. arXiv preprint arXiv:1706.07429.

[23]. Majo, Z., & Gross, T. R. (2011, May). Memory system performance in a NUMA multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (pp. 1-10).

[24]. Mittal, S. (2020). A survey of FPGA-based accelerators for convolutional neural networks. Neural computing and applications, 32(4), 1109-1139.

[25]. Agrawal, R., de Castro, L., Yang, G., Juvekar, C., Yazicigil, R., Chandrakasan, A., ... & Joshi, A. (2023, February). FAB: An FPGA-based accelerator for bootstrappable fully homomorphic encryption. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 882-895). IEEE.

[26]. Haghi, A., Marco-Sola, S., Alvarez, L., Diamantopoulos, D., Hagleitner, C., & Moreto, M. (2021, August). An FPGA accelerator of the wavefront algorithm for genomics pairwise alignment. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL) (pp. 151-159). IEEE.

[27]. Wang-dong, Y. A. N. G., & Hao-tian, W. A. N. G. Survey of Heterogeneous Hybrid Parallel Computing. Computer Science, ChongQing. China, 47, 5-10.

[28]. Nowatzki, T., Gangadhan, V., Sankaralingam, K., & Wright, G. (2016, March). Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 27-39). IEEE.

[29]. Hill, M. D., & Reddi, V. J. (2021). Accelerator-level parallelism. Communications of the ACM, 64(12), 36-38.

[30]. Fuchs, A., & Wentzlaff, D. (2019, February). The accelerator wall: Limits of chip specialization. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 1-14). IEEE.


Cite this article

Zhang,M. (2024). High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures. Applied and Computational Engineering,42,178-185.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2023 International Conference on Machine Learning and Automation

ISBN:978-1-83558-309-8(Print) / 978-1-83558-310-4(Online)
Editor:Mustafa İSTANBULLU
Conference website: https://2023.confmla.org/
Conference date: 18 October 2023
Series: Applied and Computational Engineering
Volume number: Vol.42
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. Aiken, A., Banerjee, U., Kejariwal, A., Nicolau, A. (2016). Introduction. In: Instruction Level Parallelism. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7797-7_1.

[2]. Fatehi, E., & Gratz, P. (2014, August). ILP and TLP in shared memory applications: A limit study. In Proceedings of the 23rd international conference on Parallel architectures and compilation (pp. 113-126).

[3]. Kiriansky, V., Xu, H., Rinard, M., & Amarasinghe, S. (2018, November). Cimple: Instruction and memory level parallelism: A dsl for uncovering ilp and mlp. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (pp. 1-16).

[4]. Zaidi, A. M., Iordanou, K., Luján, M., & Gabrielli, G. (2021, March). Loopapalooza: Investigating Limits of Loop-Level Parallelism with a Compiler-Driven Approach. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 128-138). IEEE.

[5]. Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Proceedings of the 38th annual international symposium on Computer architecture (pp. 365-376).

[6]. Dally, W. J., Turakhia, Y., & Han, S. (2020). Domain-specific hardware accelerators. Communications of the ACM, 63(7), 48-57.

[7]. Chamberlain, R. D. (2020). Architecturally truly diverse systems: A review. Future Generation Computer Systems, 110, 33-44.

[8]. Hegde, V., & Usmani, S. (2016). Parallel and distributed deep learning. May, 31, 1-8.

[9]. Madiajagan, M., & Raj, S. S. (2019). Parallel computing, graphics processing unit (GPU) and new hardware for deep learning in computational intelligence research. In Deep learning and parallel computing environment for bioengineering systems (pp. 1-15). Academic Press.

[10]. Choi, W., Duraisamy, K., Kim, R. G., Doppa, J. R., Pande, P. P., Marculescu, R., & Marculescu, D. (2016, October). Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms. In Proceedings of the international conference on compilers, architectures and synthesis for embedded systems (pp. 1-10).

[11]. Kim, Y. W., Choi, S. H., & Han, T. H. (2021). Rapid topology generation and core mapping of optical network-on-chip for heterogeneous computing platform. IEEE Access, 9, 110359-110370.

[12]. Nurvitadhi, E., Venkatesh, G., Sim, J., Marr, D., Huang, R., Ong Gee Hock, J., ... & Boudoukh, G. (2017, February). Can FPGAs beat GPUs in accelerating next-generation deep neural networks? . In Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays (pp. 5-14).

[13]. Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., ... & Burger, D. (2014). A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Computer Architecture News, 42(3), 13-24.

[14]. Damiani, A., Fiscaletti, G., Bacis, M., Brondolin, R., & Santambrogio, M. D. (2022). Blastfunction: A full-stack framework bringing fpga hardware acceleration to cloud-native applications. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(2), 1-27.

[15]. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., ... & Tessier, R. (2022). The future of FPGA acceleration in datacenters and the cloud. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(3), 1-42.

[16]. Shan, Y., Lin, W., Guo, Z., & Zhang, Y. (2022, August). Towards a fully disaggregated and programmable data center. In Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems (pp. 18-28).

[17]. Gouk, D., Lee, S., Kwon, M., & Jung, M. (2022). Direct access, {High-Performance} memory disaggregation with {DirectCXL}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) (pp. 287-294).

[18]. Vuppalapati, M., Miron, J., Agarwal, R., Truong, D., Motivala, A., & Cruanes, T. (2020). Building an elastic query engine on disaggregated storage. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (pp. 449-462).

[19]. Halpern, M., Zhu, Y., & Reddi, V. J. (2016, March). Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 64-76). IEEE.

[20]. Zhu, Y., Mattina, M., & Whatmough, P. (2018). Mobile machine learning hardware at arm: a systems-on-chip (soc) perspective. arXiv preprint arXiv:1801.06274.

[21]. Zhu, Y., Samajdar, A., Mattina, M., & Whatmough, P. (2018). Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232.

[22]. Hassan, M. (2017). Heterogeneous MPSoCs for mixed criticality systems: Challenges and opportunities. arXiv preprint arXiv:1706.07429.

[23]. Majo, Z., & Gross, T. R. (2011, May). Memory system performance in a NUMA multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (pp. 1-10).

[24]. Mittal, S. (2020). A survey of FPGA-based accelerators for convolutional neural networks. Neural computing and applications, 32(4), 1109-1139.

[25]. Agrawal, R., de Castro, L., Yang, G., Juvekar, C., Yazicigil, R., Chandrakasan, A., ... & Joshi, A. (2023, February). FAB: An FPGA-based accelerator for bootstrappable fully homomorphic encryption. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 882-895). IEEE.

[26]. Haghi, A., Marco-Sola, S., Alvarez, L., Diamantopoulos, D., Hagleitner, C., & Moreto, M. (2021, August). An FPGA accelerator of the wavefront algorithm for genomics pairwise alignment. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL) (pp. 151-159). IEEE.

[27]. Wang-dong, Y. A. N. G., & Hao-tian, W. A. N. G. Survey of Heterogeneous Hybrid Parallel Computing. Computer Science, ChongQing. China, 47, 5-10.

[28]. Nowatzki, T., Gangadhan, V., Sankaralingam, K., & Wright, G. (2016, March). Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 27-39). IEEE.

[29]. Hill, M. D., & Reddi, V. J. (2021). Accelerator-level parallelism. Communications of the ACM, 64(12), 36-38.

[30]. Fuchs, A., & Wentzlaff, D. (2019, February). The accelerator wall: Limits of chip specialization. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 1-14). IEEE.