Multi-Particle Dynamical Systems Modeling Transformers

1. Introduction

Transformers have gained immense popularity in deep learning in the past few years. They have achieved many advanced results and practical applications in deep learning tasks, such as machine translation, file creation, and image processing. The key to the success of this algorithm relies on the self-attention mechanism, which can encode input data in parallel, improving efficiency and capturing complex correlations across different types of large datasets [1-4].

Building on our understanding on the transformer mechanism and the reasons for its success in partical applications, we are going to review [5], a paper that delves into the mathematical intricacies of the attention mechanism of the transformer from a multi-particle dynamical system perspective. This will enable a systematic study of the attention's convergence towards clusters and its relation with the Kuramoto oscillator in the simplified case of having only 2 particles. We end the paper by introducing a practical application of Transformers to approximate Sumformers.

2. Background

2.1. Multi-Particle Dynamical System

The subject of multi-particle dynamical systems concerns the evolution in time of systems of $n$ particles. Where a particle is an element from the set $X$ , e.g. $X = R^{d}$ . More precisely, an homogeneous, continuous-time dynamical system $ϕ$ , can be defined as the continuous map $ϕ : R \times X^{n} \to X^{n}$ that satisfies the following two relations for any $x = (x_{1}, . . ., x_{n}) \in X^{n}$ and $s, t \in R$

$ϕ (s + t, x) = ϕ (t, ϕ (s, x))$
$ϕ (0, x) = x$

The partial map fixing the time variable, $Φ_{t} : ϕ (t, \cdot)$ , is called the system’s flow, and the partial map fixing the particles $ξ_{x} (t) = ϕ (\cdot, x)$ is called the system’s trajectory.

Smooth dynamical systems can be modeled using ODEs. If we have a system $ϕ$ with flows ${Φ_{t}}$ , then its trajectories $x (t) : = ξ_{x_{0}} (t)$ satisfy the initial value problem, $\dot{x} (t) = f (x (t))$ with $x (0) = x_{0}$ in which the vector field $f (x) = \frac{d}{d t} ∣_{t = 0} = Φ_{t} (x)$ .

In the case of multi-particle dynamical systems, the vector field $f_{i}$ corresponding to each particle $x_{i}$ , $i = 1, . . ., n$ is the sum of two factors: convection and diffusion. The convection factor concerns the particle movement regardless of other particles, e.g., caused by an external force like gravity. The diffusion factor concerns the particle movement that results from interacting with other particles.

2.2. Neural Networks as dynamical systems

Deep neural networks can be thought of as discretizations of continuous dynamical systems ~[9]. This interpretation has been widely used in the literature in recent years since it allows numerical analysis tools to understand and better design neural networks.

Resnet, a simple example The ResNet architecture is a simple example that illustrates well the suitability of the dynamical system interpretation. We start from a simple initial value problem for a first-order ODE,

${\begin{matrix} \frac{d x}{d t} = f (t, x), & t > t_{0} \\ x (t_{0}) = w \end{matrix}$ (1)

where $x : [t_{0}, \infty) \to X^{n}$ and $w \in X^{n}$ is the value of the system at $t_{0}$ . As simple as it is, it is not always possible to solve (1) analytically. Nevertheless, numerical methods can find an approximate solution at a given time $T$ . For instance, the Euler method can find an approximate solution to this problem in $L$ steps by discretizing the time variable with step size $γ = (T - t_{0}) / L$ and using the first order approximation of the derivative $\frac{x (t_{l + 1}) - x (l_{i})}{t_{l + 1} - t_{l}} \approx f (t_{l}, x (t_{l}))$ . By doing so, we can estimate $x (T)$ from $x_{0} = x (t_{0})$ by sequentially estimating $x_{l + 1} = x (t_{l + 1})$ with the iterative rule

$x_{l + 1} = x_{l} + γ f (t_{l}, x_{l})$ (2)

where $l = {0, . . ., L - 1}$ , $t_{l} = t_{0} + γ l$ . This mathematical formulation of the update rule is equivalent to the formulation of a ResNet layer. Therefore, the function $γ f (t_{l}, x_{l})$ can be considered a neural network block, where the $t_{l}$ time variable indicates the $l$ -th layer and $x_{l}$ corresponds to the skip connection present in this architecture.

3. Transformers as Muli-Particle Dynamic Systems

3.1. The transformer architecture

In the same way, we can identify a transformer network with some initial value problem and find the corresponding dynamical system representing its layers. Recall the transformer consists of an attention layer 3 and a feed-forward layer 4:

$A t t_{l} (x_{l}, i) = \sum_{j = 1}^{N} S o f t m a x (β ⟨ Q_{l} x_{l, i}, K_{l} x_{l, j} ⟩) V_{l} x_{l, j}$ (3)

$F F N_{l} (x_{l, i}) = W_{l}^{2} σ (W_{l}^{1} x_{l, i} + b_{l}^{1}) + b_{l}^{2}$ (4)

The attention layer outputs a linear combination of the system particles, depending on the query $Q$ , key $K$ , value $V$ matrices, their scalar product, and the temperature value $β$ . The feed-forward layer outputs the non-linear transformation of the particle $x_{i}$ according to the matrices $W^{2}, W^{1}$ , the vectors $b^{2}, b^{1}$ , and the non-linear function $σ$ .

3.2. Dynamical system formulation

The introduction of dynamical system notation into the transformer problem was first done in [7], modeling the multi-headed self-attention layer as the diffusion term and the feed-forward network as the convection term. This MPDS can be approximated using the Lie-Trotter splitting scheme by the iterative solving of the diffusion and the convection ODEs. Nevertheless, this formulation is still very complex to be analyzed analytically.

3.3. Simplification of the problem

To perform a deeper mathematical analysis, [5] relaxes the typical experimental formulation of the transformer and focuses solely on a simplified version of the attention mechanism (3). The simplified problem is the following:

Each particle lies in the unit sphere $X = S^{d - 1}$ . Therefore, after the attention mechanism, the particle is normalized again into the sphere. To study the evolution of a particle position over non-linear manifolds such as the sphere, where we do not have a notion of "sum" or "difference," we rely on the concept of the tangent bundle. The change of position will then be measured using infinitesimal displacements on the point's tangent hyperplane. This is empirically attained by projecting the attention output into the tangent hyperplane at the particle's position, using $P_{x_{i}}^{⊥} : R^{d} \to S^{d - 1}$ , $y \mapsto y - ⟨ y, x_{i} ⟩ x_{i}$ .
The attention parameters: query $Q$ , key $K$ , and value $V$ are considered constant across time (i.e., layers) and equal to the identity unless stated otherwise. Therefore, the problem dynamics follow:
${\dot{x}}_{i} (t) = P_{x_{i} (t)}^{⊥} (\frac{\sum_{j = 1}^{n} e^{β ⟨ x_{j}, x_{i} ⟩} x_{j}}{\sum_{j = 1}^{n} e^{β ⟨ x_{j}, x_{i} ⟩}})$ (5)
To avoid the asymmetry introduced by the denominator in (5) they also study a variant of the attention mechanism normalized by a factor of $n$ , which is also equivalent to studying the case $β ≪ 1$ . This other formulation is:
${\dot{x}}_{i} (t) = P_{x_{i} (t)}^{⊥} (\frac{\sum_{j = 1}^{n} e^{β ⟨ x_{j}, x_{i} ⟩} x_{j}}{n})$ (6)

With these relaxations, the main focus of the paper is to study the evolution of the systems (5) and (6). This study gives us insights into the following questions: what is, mathematically, the attention mechanism? Is attention guaranteed to converge? If so, is this method deterministic?

3.4. Formal result

Formally, the main result of the section holds for $d \geq n$ , when the initial configuration is uniformly sampled over the $(S^{d - 1})^{n}$ . The paper states that under these conditions, the unique solution to the Cauchy problem for (5) and (6) converges almost surely and at an exponential rate towards a single particle $x^{*} \in S^{d - 1}$ . This is, for any particles $i \in {1, . . ., n}$

$∥ x_{i} (t) - x^{*} ∥ \leq C e^{- λ t}$ (7)

for some $C, λ > 0$ . Not only this, but the same results hold for more general formulations of the problem, where the key $K$ and query $Q$ matrices are arbitrary $d \times d$ matrices.

This theorem follows as a direct corollary of a result they call the cone collapse. In this previous lemma, they show that any solution to the Cauchy problems (5) and (6) converges and at an exponential rate towards a single particle $x^{*} \in S^{d - 1}$ if the initial configuration lies in an open hemisphere. This indeed happens with probability one when $d \geq n$ .

When $n$ is fixed and $d \to \infty$ in high dimensional spaces, we can better model the entire dynamics evolution with high probability. This has an intuitive explanation since, when $d ≫ n$ , any two particles will likely be almost orthogonal. By concentration of measure, the evolution of this system is comparable to the evolution of an orthonormal system, in which a single parameter describes the dynamics. In this simplified model, where all different initial particles are orthogonal, the unique solution to (5) and (6) preserves an equal angle between all different particles whose value depends only on time $t$ and the temperature of the attention mechanism $β$ . Equivalently:

$∠ (x_{i} (t), x_{j} (t)) = θ_{β} (t) i \neq j$ (8)

The most surprising result in this section is that the metastability and phase transition between clustering and non-clustering regimes can also be modeled by this parameter of the system dynamics $γ_{β} (t) = c o s (θ_{β} (t))$ when $d ≫ n$ . Further work in this topic is done in [8].

4. Angular Dynamics Equation and the Kuramoto Model

We review some further equations and models mentioned in [5] in section 7.1 and section 7.2 on Angular Dynamics Equation and Kuramoto Model .

4.1. Derivation of the Angular Dynamics Equation

In this section, based on [5] section 7.1, we focus on reviewing the dynamics of particles constrained to the unit circle $S^{1} \subset R^{2}$ ,i.e., the case when $d = 2$ specifically under the dynamics equation 6 (USA).This model, parametrized by angles and related to the celebrated Kuramoto model. Each particle $x_{i} (t) \in S^{1}$ can be represented by an angle $θ_{i} (t) \in T = [0, 2 π)$ as follows [5]:

$x_{i} (t) = \cos (θ_{i} (t)) e_{1} + \sin (θ_{i} (t)) e_{2},$

where $e_{1} = (1, 0)$ and $e_{2} = (0, 1)$ are the standard basis vectors in $R^{2}$ .

To derive the dynamics of $θ_{i} (t)$ under equation 6 (USA), we proceed with the following steps as we follow the section 7.1 in [5]:

Firstly, we assume we only discuss the dynamic equation 6 (USA) . To express this equation in terms of $θ_{i} (t)$ , we start by using the relation that $\cos (θ_{i} (t)) = ⟨ x_{i} (t), e_{1} ⟩$ . we then take the time derivative of both sides of the equation which yields the following equation:

${\dot{θ}}_{i} (t) = - \frac{1}{n \sin (θ_{i} (t))} \sum_{j = 1}^{n} e^{β ⟨ x_{i} (t), x_{j} (t) ⟩} (⟨ x_{j} (t), e_{1} ⟩ - ⟨ x_{i} (t), x_{j} (t) ⟩ ⟨ x_{i} (t), e_{1} ⟩) .$

Secondly, noting that $⟨ x_{i} (t), x_{j} (t) ⟩ = \cos (θ_{i} (t) - θ_{j} (t))$ , we could substitute this relation and rewrite it into the following equation:

${\dot{θ}}_{i} (t) = - \frac{1}{n \sin (θ_{i} (t))} \sum_{j = 1}^{n} e^{β c o s (θ_{i} (t) - θ_{j} (t))} [\cos (θ_{j} (t)) - \cos (θ_{i} (t) - θ_{j} (t)) \cos (θ_{i} (t))] .$

Thirdly, by using some elementary trigonometric relations, this finally leads us to define the following expression of the equation:

Definition 1. The expression of Angular Dynamics under equation 6 is:

${\dot{θ}}_{i} (t) = - \frac{1}{n} \sum_{j = 1}^{n} e^{β \cos (θ_{i} (t) - θ_{j} (t))} \sin (θ_{i} (t) - θ_{j} (t)),$

which governs the evolution of $θ_{i} (t)$ in the presence of interactions weighted by $β$ and the relative angle between particles.

Remark 1. when $β = 0$ , the dynamics reduce to the well-known Kuramoto model which describes the synchronization phenomenon in coupled oscillators.

${\dot{θ}}_{i} (t) = - \frac{1}{n} \sum_{j = 1}^{n} \sin (θ_{i} (t) - θ_{j} (t)) .$ .

Remark 2. Considering the dynamics in definition 1 for $β > 0$ , we observe that it can also be written as a gradient flow with interaction energy $E_{β} : T^{n} \to R_{\geq 0}$ :

$E_{β} (θ) = \frac{1}{2 β n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} e^{β \cos (θ_{i} - θ_{j})},$

which reaches its maximum when all $θ_{i}$ align at a single fixed real value in $T = [0, 2 π)$ .

4.2. The Kuramoto Model and Its Genalizations

In this section, we reviewed some Kuramoto models and its genalizations in this section by following section 7.2 in [8] As mentioned in the last section, when $β = 0$ , the dynamics in the previous section simplify to a particular case of the Kuramoto model. In fact, the Kuramoto model could be described in the following definition:

Definition 2.The Kuramoto model for oscillator $i$ is given by:

${\dot{θ}}_{i} (t) = ω_{i} + \frac{K}{n} \sum_{j = 1}^{n} \sin (θ_{j} (t) - θ_{i} (t)),$

where $K > 0$ is a coupling constant and $ω_{i} \in T$ is the intrinsic frequency of oscillator $i$ .

Remark 3. In this model in the previous definition, for small $K$ , oscillators do not synchronize over long time. As $K$ exceeds a critical threshold, some oscillators begin to synchronize. For very large $K$ , all oscillators eventually synchronize in the long term.

Observation 1.

If all intrinsic frequencies $ω_{i}$ are equal to a real number like $ω$ in the previous definition, we can actually shift and rewrite variables by setting $θ_{i} (t) \to θ_{i} (t) - ω t$ . This transforms the dynamics in definition 2 into the following gradient flow form:

$\dot{θ} (t) = - n \nabla F (θ),$

where the energy $F : T^{n} \to R_{\geq 0}$ is defined by:

$F (θ) = - \frac{K}{2 n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \cos (θ_{i} - θ_{j}) .$

Remark 4. This energy $F$ is exactly maximized when all oscillators synchronize (i.e., $θ_{i} = θ^{*}$ for some fixed $θ^{*} \in T$ and for all $i$ in $1, 2..., n$ ), with equilibrium states occurring at the critical points of $F$ .

Definition 3.

The Kuramoto model can also be generalized to include more general non-linear interaction functions. In particular, an extension of the form can be written as following:

${\dot{θ}}_{i} (t) = ω_{i} + \frac{K}{n} \sum_{j = 1}^{n} h (θ_{j} (t) - θ_{i} (t)),$

where $h : T \to R$ is a general non-linear function, which captures both the classic Kuramoto model (when $h (θ) = \sin (θ)$ ) in definition 2 and the model in definition 1 as specific cases.

Example 1. One example of such a generalization is when $h (θ) = e^{β \cos (θ)}$ , leading to the interaction function:

$h_{β} (θ) = e^{β \cos (θ)} = \sum_{k \in Z} I_{k} (β) e^{i k θ},$

where $I_{k} (β)$ denotes the modified Bessel function of the first kind. \end{example}

5. A practical transformer - Sumformer

5.1. Definition of Sumformer

A Sumformer is a sequence-to-sequence function $S : R^{n \times d} \to R^{n \times d}$ , defined for input sequence $X = [x_{1}, \dots, x_{n}]$ as:

$Σ = \sum_{i = 1}^{n} ϕ (x_{i}),$

$S (X) = [ψ (x_{1}, Σ), \dots, ψ (x_{n}, Σ)],$

where $ϕ : R^{d} \to R^{d_{1}}$ and $ψ : R^{d} \times R^{d_{1}} \to R^{d}$ are learnable functions.

5.2. Approximation theorem of Sumformer

Let $f$ be a continuous permutation-equivariant sequence-to-sequence function on compact sets, defined as $f : R^{n \times d} \to R^{n \times d}$ . Then, for any $ϵ > 0$ , there exists a Transformer $T$ such that:

$\sup_{X \in R^{n \times d}} ∥ f (X) - T (X) ∥_{\infty} < ϵ .$

5.3. Proof

We aim to prove that a Transformer can approximate the Sumformer $S$ and hence approximate any equivariant function $f$ . The proof consists of two main steps:

Step 1: Sumformer Approximation of $f$

1. Goal: Construct a Sumformer $S$ that approximates $f$ .

2. Constructing $Σ$ using $ϕ$ : For each input sequence $X = [x_{1}, \dots, x_{n}]$ , define

$Σ = \sum_{i = 1}^{n} ϕ (x_{i}),$

where $ϕ (x_{i})$ encodes information from each input $x_{i}$ in a way that captures permutation-equivariant properties. For example, using multisymmetric polynomials to approximate $f$ ’s behavior.

3. Defining $S (X)$ using $ψ$ : Using $Σ$ , define

$S (X) = [ψ (x_{1}, Σ), \dots, ψ (x_{n}, Σ)],$

where $ψ (x_{i}, Σ)$ combines each $x_{i}$ with the aggregate information $Σ$ to approximate $f (x_{i})$ .

4. Approximation Error of $S$ for $f$ : Given a continuous $f$ , we can choose $ϕ$ and $ψ$ such that:

$\sup_{X \in R^{n \times d}} ∥ f (X) - S (X) ∥_{\infty} < \frac{ϵ}{2} .$

Step 2: Transformer Approximation of Sumformer

1. Input Encoding: For each input $x_{i}$ , construct the sequence

$X^{'} = [[x_{1}, ϕ (x_{1})], \dots, [x_{n}, ϕ (x_{n})]] \in R^{n \times (d + d_{1})} .$

2. Using Attention to Approximate $Σ$ : Set the Transformer attention query and key matrices as:

$W_{Q} = W_{K} = {[\begin{matrix} 1 & 0 & \dots & 0 \end{matrix}]}^{⊤} \in R^{(d + d_{1}) \times 1},$

so that each attention score is constant, allowing us to approximate the sum $Σ$ by aggregating $ϕ (x_{i})$ terms across the sequence.

3. Output Generation Using Feed-Forward Layers: Use two feed-forward layers to approximate $ψ (x_{i}, Σ)$ for each $x_{i}$ , ensuring that:

$\sup_{X \in R^{n \times d}} ∥ S (X) - T (X) ∥_{\infty} < \frac{ϵ}{2} .$

Step 3: Combined Approximation By combining these two steps, we have:

$\sup_{X \in R^{n \times d}} ∥ f (X) - T (X) ∥_{\infty} \leq \sup_{X \in R^{n \times d}} ∥ f (X) - S (X) ∥_{\infty} + \sup_{X \in R^{n \times d}} ∥ S (X) - T (X) ∥_{\infty} < ϵ .$

This completes the proof.

5.4. Conclusion

In summary, the multi-particle dynamical systems perspective provides a powerful lens to understand transformers. By modeling self-attention as interacting particles, one can rigorously study convergence, clustering, and stability, while also revealing parallels with classical systems such as Kuramoto oscillators. This framework not only deepens theoretical insight but also opens pathways for principled analysis and potential improvements in transformer architectures.

References

[1]. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https: //aclanthology.org/N19-1423

[2]. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., . . . Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In (Vol. abs/2010.11929). Retrieved from https: //api.semanticscholar.org/CorpusID: 225039882

[3]. E, W. (2017). A proposal on machine learning via dynamical systems. Communications in Mathe- matics and Statistics, 5 , 1 - 11. Retrieved from https: //api.semanticscholar.org/CorpusID: 64849498

[4]. Geshkovski, B., Koubbi, H., Polyanskiy, Y., & Rigollet, P. (2024). Dynamic metastability in the self- attention model. Retrieved from https: //arxiv.org/abs/2410.06833

[5]. Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. (2024). A mathematical perspective on transformers. Retrieved from https: //arxiv.org/abs/2312.10794 6

[6]. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022, September). Transformers in vision: A survey. ACM Comput. Surv., 54 (10s). Retrieved from https: //doi.org/10.1145/ 3505244 DOI: 10.1145/3505244

[7]. Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., . . . Liu, T.-Y. (2019). Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv, abs/1906.02762 . Retrieved from https: //api.semanticscholar.org/CorpusID: 174801126

[8]. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (140), 1–67. Retrieved from http: //jmlr.org/papers/v21/20-074.html

Cite this article

Zhang,Y. (2025). Multi-Particle Dynamical Systems Modeling Transformers. Theoretical and Natural Science,132,114-121.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of CONF-APMM 2025 Symposium: Simulation and Theory of Differential-Integral Equation in Applied Physics

ISBN：978-1-80590-305-5(Print) / 978-1-80590-306-2(Online)

Editor：Marwan Omar, Shuxia Zhao

Conference website: https://www.confapmm.org/dalian.html

Conference date: 27 September 2025

Series: Theoretical and Natural Science

Volume number: Vol.132

ISSN：2753-8818(Print) / 2753-8826(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[3]. E, W. (2017). A proposal on machine learning via dynamical systems. Communications in Mathe- matics and Statistics, 5 , 1 - 11. Retrieved from https: //api.semanticscholar.org/CorpusID: 64849498

[4]. Geshkovski, B., Koubbi, H., Polyanskiy, Y., & Rigollet, P. (2024). Dynamic metastability in the self- attention model. Retrieved from https: //arxiv.org/abs/2410.06833

[5]. Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. (2024). A mathematical perspective on transformers. Retrieved from https: //arxiv.org/abs/2312.10794 6