Research Article
Open access
Published on 26 February 2024
Download pdf
Meng,F.;Wang,Y. (2024). Transformers: Statistical interpretation, architectures and applications. Applied and Computational Engineering,43,193-210.
Export citation

Transformers: Statistical interpretation, architectures and applications

Fanfei Meng *,1, Yuxin Wang 2
  • 1 Northwestern University
  • 2 Northwestern University

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2755-2721/43/20230832

Abstract

Transformers have been widely recognized as powerful tools to analyze multiple tasks due to its state-of art multi-head attention spaces, such as Natural Language Processing (NLP), Computer Vision (CV) and Speech Recognition (SR). Inspired by its abundant designs and strong functions on analyzing input data, we would like to start from the various architectures, further proceed to the investigation on its statistical mechanism and inference and then introduce its applications on dominant tasks. The underlying statistical mechanisms arouse our interests and intrigue us to investigate it in a higher level, and this surveys will focus on its mathematical foundations and then use the principles to try to analyze the reasons for its excellent performance on many recognition scenarios.

Keywords

Transformer, Natural Language Processing (NLP), Computer Vision (CV), Speech Recognition (SR), Deep Learning

[1]. Sutskever I, Vinyals O and Le Q V 2014 Sequence to sequence learning with neural networks (Preprint 1409.3215)

[2]. Ganesh P, Chen Y, Lou X, Khan M A, Yang Y, Chen D, Winslett M, Sajjad H and Nakov P 2020 Compressing large-scale transformer-based models: A case study on bert (Preprint 2002.11985)

[3]. van Aken B, Winter B, L¨oser A and Gers F A 2019 Proceedings of the 28th ACM International Conference on Information and Knowledge Management CIKM ’19 (New York, NY, USA: Association for Computing Machinery) p 1823–1832ISBN 9781450369763 URL https://doi.org/10.1145/3357384.3358028

[4]. Song Z, Xie Y, Huang W and Wang H 2019 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC) pp 1383–1387

[5]. Michel P, Levy O and Neubig G 2019 Are sixteen heads really better than one? (Preprint 1905.10650)

[6]. Voita E, Talbot D, Moiseev F, Sennrich R and Titov I 2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Florence, Italy: Association for Computational Linguistics) pp 5797–5808 URL https://www.aclweb.org/anthology/P19-1580

[7]. Tan R, Sun J, Su B and Liu G 2019 Neural Information Processing - 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12-15, 2019, Proceedings, Part II (Lecture Notes in Computer Science vol 11954) ed Gedeon T, Wong K W and Lee M (Springer) pp 504–515 URL https://doi.org/10.1007/978-3-030-36711-4_ 42

[8]. Sukhbaatar S, Grave E, Bojanowski P and Joulin A 2019 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Florence, Italy: Association for Computational Linguistics) pp 331–335 URL https: //www.aclweb.org/anthology/P19-1032

[9]. Kumar S, Parker J and Naderian P 2020 Adaptive transformers in rl (Preprint 2004.03761)

[10]. Han S, Pool J, Tran J and Dally W J 2015 Learning both weights and connections for efficient neural networks (Preprint 1506.02626)

[11]. Han S, Mao H and Dally W J 2016 Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding (Preprint 1510.00149)

[12]. Neill J O 2020 An overview of neural network compression (Preprint 2006.03669)

[13]. Hou L, Huang Z, Shang L, Jiang X, Chen X and Liu Q 2020 Advances in Neural Information Pro-cessing Systems vol 33 ed Larochelle H, Ranzato M, Hadsell R, Balcan M F and Lin H (Cur-ran Associates, Inc.) pp 9782–9793 URL https://proceedings.neurips.cc/paper/2020/file/ 6f5216f8d89b086c18298e043bfe48ed-Paper.pdf

[14]. Grave E, Joulin A and Usunier N 2016 Improving neural language models with a continuous cache (Preprint 1612. 04426)

[15]. Linzen T 2016 Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP (Berlin, Germany: Association for Computational Linguistics) pp 13–18 URL https://www.aclweb.org/anthology/W16-2503

[16]. Levy O and Goldberg Y 2014 Proceedings of the Eighteenth Conference on Computational Natural Language Learning (Ann Arbor, Michigan: Association for Computational Linguistics) pp 171–180 URL https://www.aclweb.org/anthology/W14-1618

[17]. Canziani A, Paszke A and Culurciello E 2017 An analysis of deep neural network models for practical applications (Preprint 1605.07678)

[18]. Liu W, Wang Z, Liu X, Zeng N, Liu Y and Alsaadi F E 2017 Neurocomputing 234 11–26 ISSN 0925-2312 URL https://www.sciencedirect.com/science/article/pii/S0925231216315533

[19]. Krawczak M 2013 Multilayer Neural Networks: A Generalized Net Perspective (Springer Publishing Company, Incorporated) ISBN 3319002473

[20]. Samek W, Binder A, Montavon G, Lapuschkin S and M¨uller K 2017 IEEE Transactions on Neural Networks and Learning Systems 28 2660–2673

[21]. Greff K, Srivastava R K, Koutn´ık J, Steunebrink B R and Schmidhuber J 2017 IEEE Transactions on Neural Networks and Learning Systems 28 2222–2232

[22]. Mnih A and Kavukcuoglu K 2013 Advances in Neural Information Processing Systems vol 26 ed Burges C J C, Bottou L, Welling M, Ghahramani Z and Weinberger K Q (Curran Associates, Inc.) URL https://proceedings.neurips.cc/paper/2013/file/db2b4182156b2f1f817860ac9f409ad7-Paper.pdf

[23]. Ihori M, Makishima N, Tanaka T, Takashima A, Orihashi S and Masumura R 2021 Mapgn: Masked pointer-generator network for sequence-to-sequence pre-training (Preprint 2102.07380)

[24]. Ramachandran P, Liu P J and Le Q V 2018 Unsupervised pretraining for sequence to sequence learning (Preprint 1611.02683)

[25]. Pulver A and Lyu S 2017 2017 International Joint Conference on Neural Networks (IJCNN) pp 845–851

[26]. Dai Z, Yang Z, Yang Y, Carbonell J, Le Q V and Salakhutdinov R 2019 Transformer-xl: Attentive language models beyond a fixed-length context (Preprint 1901.02860)

[27]. Devlin J, Chang M W, Lee K and Toutanova K 2019 Bert: Pre-training of deep bidirectional transformers for language understanding (Preprint 1810.04805)

[28]. Yang L, Kenny E M, Ng T L J, Yang Y, Smyth B and Dong R 2020 Generating plausible counterfactual explanations for deep transformers in financial text classification (Preprint 2010.12512)

[29]. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao T L, Gugger S, Drame M, Lhoest Q and Rush A M 2020 Huggingface’s transformers: State-of-the-art natural language processing (Preprint 1910.03771)

[30]. Wang Q, Li B, Xiao T, Zhu J, Li C, Wong D F and Chao L S 2019 Learning deep transformer models for machine translation (Preprint 1906.01787)

[31]. Xia Y, He T, Tan X, Tian F, He D and Qin T 2019 Proceedings of the AAAI Conference on Artificial Intelligence 33 5466–5473 URL https://ojs.aaai.org/index.php/AAAI/article/view/4487

[32]. ao S and Wan X 2020 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Online: Association for Computational Linguistics) pp 4346–4350 URL https://www.aclweb.org/anthology/ 2020.acl-main.400

[33]. Liu X, Duh K, Liu L and Gao J 2020 Very deep transformers for neural machine translation (Preprint 2008.07772)

[34]. Banar N, Daelemans W and Kestemont M 2020 Character-level transformer-based neural machine translation (Preprint 2005.11239)

[35]. Ahmed K, Keskar N S and Socher R 2017 Weighted transformer network for machine translation (Preprint 1711. 02132)

[36]. Raganato A, Scherrer Y and Tiedemann J 2020 Fixed encoder self-attention patterns in transformer-based machine translation (Preprint 2002.10260)

[37]. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I 2017 Attention is all you need (Preprint 1706.03762)

[38]. Bosselut A, Rashkin H, Sap M, Malaviya C, Celikyilmaz A and Choi Y 2019 Comet: Commonsense transformers for automatic knowledge graph construction (Preprint 1906.05317)

[39]. Vila L C, Escolano C, Fonollosa J A and Costa-Jussa M R 2018 IberSPEECH pp 60–63

[40]. Dong L, Xu S and Xu B 2018 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(IEEE) pp 5884–5888

[41]. Wang Y, Mohamed A, Le D, Liu C, Xiao A, Mahadeokar J, Huang H, Tjandra A, Zhang X, Zhang F et al. 2020 ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE) pp 6874–6878

[42]. Karita S, Chen N, Hayashi T, Hori T, Inaguma H, Jiang Z, Someki M, Soplin N E Y, Yamamoto R, Wang X et al. 2019 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (IEEE) pp 449–456

[43]. Zhang J, Luan H, Sun M, Zhai F, Xu J, Zhang M and Liu Y 2018 arXiv preprint arXiv:1810.03581

[44]. Lu L, Liu C, Li J and Gong Y 2020 Exploring transformers for large-scale speech recognition (Preprint 2005.09684)

[45]. Deng L and Li X 2013 IEEE Transactions on Audio, Speech, and Language Processing 21 1060–1089

[46]. Chen D, Hua G, Wen F and Sun J 2016 European Conference on Computer Vision (Springer) pp 122–138

[47]. Wu B, Xu C, Dai X, Wan A, Zhang P, Tomizuka M, Keutzer K and Vajda P 2020 arXiv preprint arXiv:2006.03677

[48]. Han K, Xiao A, Wu E, Guo J, Xu C and Wang Y 2021 arXiv preprint arXiv:2103.00112

[49]. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, Yang Z, Zhang Y and Tao D 2021 A survey on visual transformer (Preprint 2012.12556)

[50]. Kim T H, Sajjadi M S, Hirsch M and Scholkopf B 2018 Proceedings of the European Conference on Computer Vision (ECCV) pp 106–122

Cite this article

Meng,F.;Wang,Y. (2024). Transformers: Statistical interpretation, architectures and applications. Applied and Computational Engineering,43,193-210.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2023 International Conference on Machine Learning and Automation

Conference website: https://2023.confmla.org/
ISBN:978-1-83558-311-1(Print) / 978-1-83558-312-8(Online)
Conference date: 18 October 2023
Editor:Mustafa İSTANBULLU
Series: Applied and Computational Engineering
Volume number: Vol.43
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).