References
[1]. Aiken, A., Banerjee, U., Kejariwal, A., Nicolau, A. (2016). Introduction. In: Instruction Level Parallelism. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7797-7_1.
[2]. Fatehi, E., & Gratz, P. (2014, August). ILP and TLP in shared memory applications: A limit study. In Proceedings of the 23rd international conference on Parallel architectures and compilation (pp. 113-126).
[3]. Kiriansky, V., Xu, H., Rinard, M., & Amarasinghe, S. (2018, November). Cimple: Instruction and memory level parallelism: A dsl for uncovering ilp and mlp. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (pp. 1-16).
[4]. Zaidi, A. M., Iordanou, K., Luján, M., & Gabrielli, G. (2021, March). Loopapalooza: Investigating Limits of Loop-Level Parallelism with a Compiler-Driven Approach. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 128-138). IEEE.
[5]. Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Proceedings of the 38th annual international symposium on Computer architecture (pp. 365-376).
[6]. Dally, W. J., Turakhia, Y., & Han, S. (2020). Domain-specific hardware accelerators. Communications of the ACM, 63(7), 48-57.
[7]. Chamberlain, R. D. (2020). Architecturally truly diverse systems: A review. Future Generation Computer Systems, 110, 33-44.
[8]. Hegde, V., & Usmani, S. (2016). Parallel and distributed deep learning. May, 31, 1-8.
[9]. Madiajagan, M., & Raj, S. S. (2019). Parallel computing, graphics processing unit (GPU) and new hardware for deep learning in computational intelligence research. In Deep learning and parallel computing environment for bioengineering systems (pp. 1-15). Academic Press.
[10]. Choi, W., Duraisamy, K., Kim, R. G., Doppa, J. R., Pande, P. P., Marculescu, R., & Marculescu, D. (2016, October). Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms. In Proceedings of the international conference on compilers, architectures and synthesis for embedded systems (pp. 1-10).
[11]. Kim, Y. W., Choi, S. H., & Han, T. H. (2021). Rapid topology generation and core mapping of optical network-on-chip for heterogeneous computing platform. IEEE Access, 9, 110359-110370.
[12]. Nurvitadhi, E., Venkatesh, G., Sim, J., Marr, D., Huang, R., Ong Gee Hock, J., ... & Boudoukh, G. (2017, February). Can FPGAs beat GPUs in accelerating next-generation deep neural networks? . In Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays (pp. 5-14).
[13]. Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., ... & Burger, D. (2014). A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Computer Architecture News, 42(3), 13-24.
[14]. Damiani, A., Fiscaletti, G., Bacis, M., Brondolin, R., & Santambrogio, M. D. (2022). Blastfunction: A full-stack framework bringing fpga hardware acceleration to cloud-native applications. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(2), 1-27.
[15]. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., ... & Tessier, R. (2022). The future of FPGA acceleration in datacenters and the cloud. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(3), 1-42.
[16]. Shan, Y., Lin, W., Guo, Z., & Zhang, Y. (2022, August). Towards a fully disaggregated and programmable data center. In Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems (pp. 18-28).
[17]. Gouk, D., Lee, S., Kwon, M., & Jung, M. (2022). Direct access, {High-Performance} memory disaggregation with {DirectCXL}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) (pp. 287-294).
[18]. Vuppalapati, M., Miron, J., Agarwal, R., Truong, D., Motivala, A., & Cruanes, T. (2020). Building an elastic query engine on disaggregated storage. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (pp. 449-462).
[19]. Halpern, M., Zhu, Y., & Reddi, V. J. (2016, March). Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 64-76). IEEE.
[20]. Zhu, Y., Mattina, M., & Whatmough, P. (2018). Mobile machine learning hardware at arm: a systems-on-chip (soc) perspective. arXiv preprint arXiv:1801.06274.
[21]. Zhu, Y., Samajdar, A., Mattina, M., & Whatmough, P. (2018). Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232.
[22]. Hassan, M. (2017). Heterogeneous MPSoCs for mixed criticality systems: Challenges and opportunities. arXiv preprint arXiv:1706.07429.
[23]. Majo, Z., & Gross, T. R. (2011, May). Memory system performance in a NUMA multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (pp. 1-10).
[24]. Mittal, S. (2020). A survey of FPGA-based accelerators for convolutional neural networks. Neural computing and applications, 32(4), 1109-1139.
[25]. Agrawal, R., de Castro, L., Yang, G., Juvekar, C., Yazicigil, R., Chandrakasan, A., ... & Joshi, A. (2023, February). FAB: An FPGA-based accelerator for bootstrappable fully homomorphic encryption. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 882-895). IEEE.
[26]. Haghi, A., Marco-Sola, S., Alvarez, L., Diamantopoulos, D., Hagleitner, C., & Moreto, M. (2021, August). An FPGA accelerator of the wavefront algorithm for genomics pairwise alignment. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL) (pp. 151-159). IEEE.
[27]. Wang-dong, Y. A. N. G., & Hao-tian, W. A. N. G. Survey of Heterogeneous Hybrid Parallel Computing. Computer Science, ChongQing. China, 47, 5-10.
[28]. Nowatzki, T., Gangadhan, V., Sankaralingam, K., & Wright, G. (2016, March). Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 27-39). IEEE.
[29]. Hill, M. D., & Reddi, V. J. (2021). Accelerator-level parallelism. Communications of the ACM, 64(12), 36-38.
[30]. Fuchs, A., & Wentzlaff, D. (2019, February). The accelerator wall: Limits of chip specialization. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 1-14). IEEE.
Cite this article
Zhang,M. (2024). High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures. Applied and Computational Engineering,42,178-185.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 2023 International Conference on Machine Learning and Automation
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).
References
[1]. Aiken, A., Banerjee, U., Kejariwal, A., Nicolau, A. (2016). Introduction. In: Instruction Level Parallelism. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7797-7_1.
[2]. Fatehi, E., & Gratz, P. (2014, August). ILP and TLP in shared memory applications: A limit study. In Proceedings of the 23rd international conference on Parallel architectures and compilation (pp. 113-126).
[3]. Kiriansky, V., Xu, H., Rinard, M., & Amarasinghe, S. (2018, November). Cimple: Instruction and memory level parallelism: A dsl for uncovering ilp and mlp. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (pp. 1-16).
[4]. Zaidi, A. M., Iordanou, K., Luján, M., & Gabrielli, G. (2021, March). Loopapalooza: Investigating Limits of Loop-Level Parallelism with a Compiler-Driven Approach. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (pp. 128-138). IEEE.
[5]. Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Proceedings of the 38th annual international symposium on Computer architecture (pp. 365-376).
[6]. Dally, W. J., Turakhia, Y., & Han, S. (2020). Domain-specific hardware accelerators. Communications of the ACM, 63(7), 48-57.
[7]. Chamberlain, R. D. (2020). Architecturally truly diverse systems: A review. Future Generation Computer Systems, 110, 33-44.
[8]. Hegde, V., & Usmani, S. (2016). Parallel and distributed deep learning. May, 31, 1-8.
[9]. Madiajagan, M., & Raj, S. S. (2019). Parallel computing, graphics processing unit (GPU) and new hardware for deep learning in computational intelligence research. In Deep learning and parallel computing environment for bioengineering systems (pp. 1-15). Academic Press.
[10]. Choi, W., Duraisamy, K., Kim, R. G., Doppa, J. R., Pande, P. P., Marculescu, R., & Marculescu, D. (2016, October). Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms. In Proceedings of the international conference on compilers, architectures and synthesis for embedded systems (pp. 1-10).
[11]. Kim, Y. W., Choi, S. H., & Han, T. H. (2021). Rapid topology generation and core mapping of optical network-on-chip for heterogeneous computing platform. IEEE Access, 9, 110359-110370.
[12]. Nurvitadhi, E., Venkatesh, G., Sim, J., Marr, D., Huang, R., Ong Gee Hock, J., ... & Boudoukh, G. (2017, February). Can FPGAs beat GPUs in accelerating next-generation deep neural networks? . In Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays (pp. 5-14).
[13]. Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., ... & Burger, D. (2014). A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Computer Architecture News, 42(3), 13-24.
[14]. Damiani, A., Fiscaletti, G., Bacis, M., Brondolin, R., & Santambrogio, M. D. (2022). Blastfunction: A full-stack framework bringing fpga hardware acceleration to cloud-native applications. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(2), 1-27.
[15]. Bobda, C., Mbongue, J. M., Chow, P., Ewais, M., Tarafdar, N., Vega, J. C., ... & Tessier, R. (2022). The future of FPGA acceleration in datacenters and the cloud. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 15(3), 1-42.
[16]. Shan, Y., Lin, W., Guo, Z., & Zhang, Y. (2022, August). Towards a fully disaggregated and programmable data center. In Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems (pp. 18-28).
[17]. Gouk, D., Lee, S., Kwon, M., & Jung, M. (2022). Direct access, {High-Performance} memory disaggregation with {DirectCXL}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) (pp. 287-294).
[18]. Vuppalapati, M., Miron, J., Agarwal, R., Truong, D., Motivala, A., & Cruanes, T. (2020). Building an elastic query engine on disaggregated storage. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (pp. 449-462).
[19]. Halpern, M., Zhu, Y., & Reddi, V. J. (2016, March). Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 64-76). IEEE.
[20]. Zhu, Y., Mattina, M., & Whatmough, P. (2018). Mobile machine learning hardware at arm: a systems-on-chip (soc) perspective. arXiv preprint arXiv:1801.06274.
[21]. Zhu, Y., Samajdar, A., Mattina, M., & Whatmough, P. (2018). Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232.
[22]. Hassan, M. (2017). Heterogeneous MPSoCs for mixed criticality systems: Challenges and opportunities. arXiv preprint arXiv:1706.07429.
[23]. Majo, Z., & Gross, T. R. (2011, May). Memory system performance in a NUMA multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (pp. 1-10).
[24]. Mittal, S. (2020). A survey of FPGA-based accelerators for convolutional neural networks. Neural computing and applications, 32(4), 1109-1139.
[25]. Agrawal, R., de Castro, L., Yang, G., Juvekar, C., Yazicigil, R., Chandrakasan, A., ... & Joshi, A. (2023, February). FAB: An FPGA-based accelerator for bootstrappable fully homomorphic encryption. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 882-895). IEEE.
[26]. Haghi, A., Marco-Sola, S., Alvarez, L., Diamantopoulos, D., Hagleitner, C., & Moreto, M. (2021, August). An FPGA accelerator of the wavefront algorithm for genomics pairwise alignment. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL) (pp. 151-159). IEEE.
[27]. Wang-dong, Y. A. N. G., & Hao-tian, W. A. N. G. Survey of Heterogeneous Hybrid Parallel Computing. Computer Science, ChongQing. China, 47, 5-10.
[28]. Nowatzki, T., Gangadhan, V., Sankaralingam, K., & Wright, G. (2016, March). Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 27-39). IEEE.
[29]. Hill, M. D., & Reddi, V. J. (2021). Accelerator-level parallelism. Communications of the ACM, 64(12), 36-38.
[30]. Fuchs, A., & Wentzlaff, D. (2019, February). The accelerator wall: Limits of chip specialization. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 1-14). IEEE.