1. Introduction
Transformers have gained immense popularity in deep learning in the past few years. They have achieved many advanced results and practical applications in deep learning tasks, such as machine translation, file creation, and image processing. The key to the success of this algorithm relies on the self-attention mechanism, which can encode input data in parallel, improving efficiency and capturing complex correlations across different types of large datasets [1-4].
Building on our understanding on the transformer mechanism and the reasons for its success in partical applications, we are going to review [5], a paper that delves into the mathematical intricacies of the attention mechanism of the transformer from a multi-particle dynamical system perspective. This will enable a systematic study of the attention's convergence towards clusters and its relation with the Kuramoto oscillator in the simplified case of having only 2 particles. We end the paper by introducing a practical application of Transformers to approximate Sumformers.
2. Background
2.1. Multi-Particle Dynamical System
The subject of multi-particle dynamical systems concerns the evolution in time of systems of
The partial map fixing the time variable,
Smooth dynamical systems can be modeled using ODEs. If we have a system
In the case of multi-particle dynamical systems, the vector field
2.2. Neural Networks as dynamical systems
Deep neural networks can be thought of as discretizations of continuous dynamical systems ~[9]. This interpretation has been widely used in the literature in recent years since it allows numerical analysis tools to understand and better design neural networks.
Resnet, a simple example The ResNet architecture is a simple example that illustrates well the suitability of the dynamical system interpretation. We start from a simple initial value problem for a first-order ODE,
where
where
3. Transformers as Muli-Particle Dynamic Systems
3.1. The transformer architecture
In the same way, we can identify a transformer network with some initial value problem and find the corresponding dynamical system representing its layers. Recall the transformer consists of an attention layer 3 and a feed-forward layer 4:
The attention layer outputs a linear combination of the system particles, depending on the query
3.2. Dynamical system formulation
The introduction of dynamical system notation into the transformer problem was first done in [7], modeling the multi-headed self-attention layer as the diffusion term and the feed-forward network as the convection term. This MPDS can be approximated using the Lie-Trotter splitting scheme by the iterative solving of the diffusion and the convection ODEs. Nevertheless, this formulation is still very complex to be analyzed analytically.
3.3. Simplification of the problem
To perform a deeper mathematical analysis, [5] relaxes the typical experimental formulation of the transformer and focuses solely on a simplified version of the attention mechanism (3). The simplified problem is the following:
- Each particle lies in the unit sphere
. Therefore, after the attention mechanism, the particle is normalized again into the sphere. To study the evolution of a particle position over non-linear manifolds such as the sphere, where we do not have a notion of "sum" or "difference," we rely on the concept of the tangent bundle. The change of position will then be measured using infinitesimal displacements on the point's tangent hyperplane. This is empirically attained by projecting the attention output into the tangent hyperplane at the particle's position, using , . - The attention parameters: query
, key , and value are considered constant across time (i.e., layers) and equal to the identity unless stated otherwise. Therefore, the problem dynamics follow: (5) - To avoid the asymmetry introduced by the denominator in (5) they also study a variant of the attention mechanism normalized by a factor of
, which is also equivalent to studying the case . This other formulation is: (6)
With these relaxations, the main focus of the paper is to study the evolution of the systems (5) and (6). This study gives us insights into the following questions: what is, mathematically, the attention mechanism? Is attention guaranteed to converge? If so, is this method deterministic?
3.4. Formal result
Formally, the main result of the section holds for
for some
This theorem follows as a direct corollary of a result they call the cone collapse. In this previous lemma, they show that any solution to the Cauchy problems (5) and (6) converges and at an exponential rate towards a single particle
When
The most surprising result in this section is that the metastability and phase transition between clustering and non-clustering regimes can also be modeled by this parameter of the system dynamics
4. Angular Dynamics Equation and the Kuramoto Model
We review some further equations and models mentioned in [5] in section 7.1 and section 7.2 on Angular Dynamics Equation and Kuramoto Model .
4.1. Derivation of the Angular Dynamics Equation
In this section, based on [5] section 7.1, we focus on reviewing the dynamics of particles constrained to the unit circle
where
To derive the dynamics of
Firstly, we assume we only discuss the dynamic equation 6 (USA) . To express this equation in terms of
Secondly, noting that
Thirdly, by using some elementary trigonometric relations, this finally leads us to define the following expression of the equation:
Definition 1. The expression of Angular Dynamics under equation 6 is:
which governs the evolution of
Remark 1. when
Remark 2. Considering the dynamics in definition 1 for
which reaches its maximum when all
4.2. The Kuramoto Model and Its Genalizations
In this section, we reviewed some Kuramoto models and its genalizations in this section by following section 7.2 in [8] As mentioned in the last section, when
Definition 2.The Kuramoto model for oscillator
where
Remark 3. In this model in the previous definition, for small
Observation 1.
If all intrinsic frequencies
where the energy
Remark 4. This energy
Definition 3.
The Kuramoto model can also be generalized to include more general non-linear interaction functions. In particular, an extension of the form can be written as following:
where
Example 1. One example of such a generalization is when
where
5. A practical transformer - Sumformer
5.1. Definition of Sumformer
A Sumformer is a sequence-to-sequence function
where
5.2. Approximation theorem of Sumformer
Let
5.3. Proof
We aim to prove that a Transformer can approximate the Sumformer
Step 1: Sumformer Approximation of
1. Goal: Construct a Sumformer
2. Constructing
where
3. Defining
where
4. Approximation Error of
Step 2: Transformer Approximation of Sumformer
1. Input Encoding: For each input
2. Using Attention to Approximate
so that each attention score is constant, allowing us to approximate the sum
3. Output Generation Using Feed-Forward Layers: Use two feed-forward layers to approximate
Step 3: Combined Approximation By combining these two steps, we have:
This completes the proof.
5.4. Conclusion
In summary, the multi-particle dynamical systems perspective provides a powerful lens to understand transformers. By modeling self-attention as interacting particles, one can rigorously study convergence, clustering, and stability, while also revealing parallels with classical systems such as Kuramoto oscillators. This framework not only deepens theoretical insight but also opens pathways for principled analysis and potential improvements in transformer architectures.
References
[1]. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https: //aclanthology.org/N19-1423
[2]. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., . . . Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In (Vol. abs/2010.11929). Retrieved from https: //api.semanticscholar.org/CorpusID: 225039882
[3]. E, W. (2017). A proposal on machine learning via dynamical systems. Communications in Mathe- matics and Statistics, 5 , 1 - 11. Retrieved from https: //api.semanticscholar.org/CorpusID: 64849498
[4]. Geshkovski, B., Koubbi, H., Polyanskiy, Y., & Rigollet, P. (2024). Dynamic metastability in the self- attention model. Retrieved from https: //arxiv.org/abs/2410.06833
[5]. Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. (2024). A mathematical perspective on transformers. Retrieved from https: //arxiv.org/abs/2312.10794 6
[6]. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022, September). Transformers in vision: A survey. ACM Comput. Surv., 54 (10s). Retrieved from https: //doi.org/10.1145/ 3505244 DOI: 10.1145/3505244
[7]. Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., . . . Liu, T.-Y. (2019). Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv, abs/1906.02762 . Retrieved from https: //api.semanticscholar.org/CorpusID: 174801126
[8]. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (140), 1–67. Retrieved from http: //jmlr.org/papers/v21/20-074.html
Cite this article
Zhang,Y. (2025). Multi-Particle Dynamical Systems Modeling Transformers. Theoretical and Natural Science,132,114-121.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of CONF-APMM 2025 Symposium: Simulation and Theory of Differential-Integral Equation in Applied Physics
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).
References
[1]. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https: //aclanthology.org/N19-1423
[2]. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., . . . Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In (Vol. abs/2010.11929). Retrieved from https: //api.semanticscholar.org/CorpusID: 225039882
[3]. E, W. (2017). A proposal on machine learning via dynamical systems. Communications in Mathe- matics and Statistics, 5 , 1 - 11. Retrieved from https: //api.semanticscholar.org/CorpusID: 64849498
[4]. Geshkovski, B., Koubbi, H., Polyanskiy, Y., & Rigollet, P. (2024). Dynamic metastability in the self- attention model. Retrieved from https: //arxiv.org/abs/2410.06833
[5]. Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. (2024). A mathematical perspective on transformers. Retrieved from https: //arxiv.org/abs/2312.10794 6
[6]. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022, September). Transformers in vision: A survey. ACM Comput. Surv., 54 (10s). Retrieved from https: //doi.org/10.1145/ 3505244 DOI: 10.1145/3505244
[7]. Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., . . . Liu, T.-Y. (2019). Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv, abs/1906.02762 . Retrieved from https: //api.semanticscholar.org/CorpusID: 174801126
[8]. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (140), 1–67. Retrieved from http: //jmlr.org/papers/v21/20-074.html