Multi-Particle Dynamical Systems Modeling Transformers

Research Article
Open access

Multi-Particle Dynamical Systems Modeling Transformers

Yuxuan Zhang 1*
  • 1 Beanstalk International Bilingual School    
  • *corresponding author yuxuanZhang25hda@bibs.com.cn
Published on 2 October 2025 | https://doi.org/10.54254/2753-8818/2025.DL27324
TNS Vol.132
ISSN (Print): 2753-8818
ISSN (Online): 2753-8826
ISBN (Print): 978-1-80590-305-5
ISBN (Online): 978-1-80590-306-2

Abstract

Deep neural networks can be understood as discretizing a continuous dynamical system. This literature review analyzes how the multi-particle dynamical system formulation models the self-attention mechanism in transformers. We will discover how this formulation enables the systematic study of the system's convergence towards clusters and its relation with the Kuramoto oscillator.

Keywords:

Transformer, Dynamical System, Kuramoto Model, Sumformer

Zhang,Y. (2025). Multi-Particle Dynamical Systems Modeling Transformers. Theoretical and Natural Science,132,114-121.
Export citation

1. Introduction

Transformers have gained immense popularity in deep learning in the past few years. They have achieved many advanced results and practical applications in deep learning tasks, such as machine translation, file creation, and image processing. The key to the success of this algorithm relies on the self-attention mechanism, which can encode input data in parallel, improving efficiency and capturing complex correlations across different types of large datasets [1-4].

Building on our understanding on the transformer mechanism and the reasons for its success in partical applications, we are going to review [5], a paper that delves into the mathematical intricacies of the attention mechanism of the transformer from a multi-particle dynamical system perspective. This will enable a systematic study of the attention's convergence towards clusters and its relation with the Kuramoto oscillator in the simplified case of having only 2 particles. We end the paper by introducing a practical application of Transformers to approximate Sumformers.

2. Background

2.1. Multi-Particle Dynamical System

The subject of multi-particle dynamical systems concerns the evolution in time of systems of  n  particles. Where a particle is an element from the set  X , e.g.  X=Rd . More precisely, an homogeneous, continuous-time dynamical system  ϕ , can be defined as the continuous map  ϕ:R×XnXn  that satisfies the following two relations for any  x=(x1,...,xn)Xn  and  s,tR 

  1.  ϕ(s+t,x)=ϕ(t,ϕ(s,x)) 
  2.  ϕ(0,x)=x 

The partial map fixing the time variable,  Φt:ϕ(t,) , is called the system’s flow, and the partial map fixing the particles  ξx(t)=ϕ(,x)  is called the system’s trajectory.

Smooth dynamical systems can be modeled using ODEs. If we have a system  ϕ  with flows  {Φt} , then its trajectories  x(t):=ξx0(t)  satisfy the initial value problem,  x˙(t)=f(x(t))  with  x(0)=x0  in which the vector field  f(x)=ddtt=0=Φt(x) .

In the case of multi-particle dynamical systems, the vector field  fi  corresponding to each particle  xi ,  i=1,...,n  is the sum of two factors: convection and diffusion. The convection factor concerns the particle movement regardless of other particles, e.g., caused by an external force like gravity. The diffusion factor concerns the particle movement that results from interacting with other particles.

2.2. Neural Networks as dynamical systems

Deep neural networks can be thought of as discretizations of continuous dynamical systems ~[9]. This interpretation has been widely used in the literature in recent years since it allows numerical analysis tools to understand and better design neural networks.

Resnet, a simple example The ResNet architecture is a simple example that illustrates well the suitability of the dynamical system interpretation. We start from a simple initial value problem for a first-order ODE,

{dxdt=f(t,x),t>t0x(t0)=w(1)

where  x:[t0,)Xn  and  wXn  is the value of the system at  t0. As simple as it is, it is not always possible to solve (1) analytically. Nevertheless, numerical methods can find an approximate solution at a given time  T . For instance, the Euler method can find an approximate solution to this problem in  L  steps by discretizing the time variable with step size  γ=(Tt0)/L  and using the first order approximation of the derivative  x(tl+1)x(li)tl+1tlf(tl,x(tl)) . By doing so, we can estimate  x(T)  from  x0=x(t0)  by sequentially estimating  xl+1=x(tl+1)  with the iterative rule

xl+1=xl+γf(tl,xl)(2)

where  l={0,...,L1} ,  tl=t0+γl . This mathematical formulation of the update rule is equivalent to the formulation of a ResNet layer. Therefore, the function  γf(tl,xl)  can be considered a neural network block, where the  tl  time variable indicates the  l -th layer and  xl  corresponds to the skip connection present in this architecture.

3. Transformers as Muli-Particle Dynamic Systems

3.1. The transformer architecture

In the same way, we can identify a transformer network with some initial value problem and find the corresponding dynamical system representing its layers. Recall the transformer consists of an attention layer 3 and a feed-forward layer 4:

Attl(xl,i)=j=1NSoftmax(βQlxl,i,Klxl,j)Vlxl,j(3)

FFNl(xl,i)=Wl2σ(Wl1xl,i+bl1)+bl2(4)

The attention layer outputs a linear combination of the system particles, depending on the query  Q , key  K , value  V  matrices, their scalar product, and the temperature value  β . The feed-forward layer outputs the non-linear transformation of the particle  xi  according to the matrices  W2,W1 , the vectors  b2,b1 , and the non-linear function  σ .

3.2. Dynamical system formulation

The introduction of dynamical system notation into the transformer problem was first done in [7], modeling the multi-headed self-attention layer as the diffusion term and the feed-forward network as the convection term. This MPDS can be approximated using the Lie-Trotter splitting scheme by the iterative solving of the diffusion and the convection ODEs. Nevertheless, this formulation is still very complex to be analyzed analytically.

3.3. Simplification of the problem

To perform a deeper mathematical analysis, [5] relaxes the typical experimental formulation of the transformer and focuses solely on a simplified version of the attention mechanism (3). The simplified problem is the following:

  • Each particle lies in the unit sphere  X=Sd1 . Therefore, after the attention mechanism, the particle is normalized again into the sphere. To study the evolution of a particle position over non-linear manifolds such as the sphere, where we do not have a notion of "sum" or "difference," we rely on the concept of the tangent bundle. The change of position will then be measured using infinitesimal displacements on the point's tangent hyperplane. This is empirically attained by projecting the attention output into the tangent hyperplane at the particle's position, using  Pxi:RdSd1 ,  yyy,xixi .
  • The attention parameters: query  Q , key  K , and value  V  are considered constant across time (i.e., layers) and equal to the identity unless stated otherwise. Therefore, the problem dynamics follow:

    x˙i(t)=Pxi(t)(j=1neβxj,xixjj=1neβxj,xi)(5)

  • To avoid the asymmetry introduced by the denominator in (5) they also study a variant of the attention mechanism normalized by a factor of  n , which is also equivalent to studying the case  β1 . This other formulation is:

    x˙i(t)=Pxi(t)(j=1neβxj,xixjn)(6)

With these relaxations, the main focus of the paper is to study the evolution of the systems (5) and (6). This study gives us insights into the following questions: what is, mathematically, the attention mechanism? Is attention guaranteed to converge? If so, is this method deterministic?

3.4. Formal result

Formally, the main result of the section holds for  dn , when the initial configuration is uniformly sampled over the  (Sd1)n. The paper states that under these conditions, the unique solution to the Cauchy problem for (5) and (6) converges almost surely and at an exponential rate towards a single particle  xSd1 . This is, for any particles  i{1,...,n} 

xi(t)xCeλt(7)

for some  C,λ>0 . Not only this, but the same results hold for more general formulations of the problem, where the key  K  and query  Q  matrices are arbitrary  d×d  matrices.

This theorem follows as a direct corollary of a result they call the cone collapse. In this previous lemma, they show that any solution to the Cauchy problems (5) and (6) converges and at an exponential rate towards a single particle  xSd1  if the initial configuration lies in an open hemisphere. This indeed happens with probability one when  dn .

When  n  is fixed and  d  in high dimensional spaces, we can better model the entire dynamics evolution with high probability. This has an intuitive explanation since, when  dn, any two particles will likely be almost orthogonal. By concentration of measure, the evolution of this system is comparable to the evolution of an orthonormal system, in which a single parameter describes the dynamics. In this simplified model, where all different initial particles are orthogonal, the unique solution to (5) and (6) preserves an equal angle between all different particles whose value depends only on time  t  and the temperature of the attention mechanism  β . Equivalently:

(xi(t),xj(t))=θβ(t)  ij(8)

The most surprising result in this section is that the metastability and phase transition between clustering and non-clustering regimes can also be modeled by this parameter of the system dynamics  γβ(t)=cos(θβ(t))  when  dn. Further work in this topic is done in [8].

4. Angular Dynamics Equation and the Kuramoto Model

We review some further equations and models mentioned in [5] in section 7.1 and section 7.2 on Angular Dynamics Equation and Kuramoto Model .

4.1. Derivation of the Angular Dynamics Equation

In this section, based on [5] section 7.1, we focus on reviewing the dynamics of particles constrained to the unit circle  S1R2 ,i.e., the case when  d=2  specifically under the dynamics equation 6 (USA).This model, parametrized by angles and related to the celebrated Kuramoto model. Each particle  xi(t)S1  can be represented by an angle  θi(t)T=[0,2π)  as follows [5]:  

xi(t)=cos(θi(t))e1+sin(θi(t))e2,

where  e1=(1,0)  and  e2=(0,1)  are the standard basis vectors in  R2 .

To derive the dynamics of  θi(t)  under equation 6 (USA), we proceed with the following steps as we follow the section 7.1 in [5]:

Firstly, we assume we only discuss the dynamic equation 6 (USA) . To express this equation in terms of  θi(t) , we start by using the relation that  cos(θi(t))=xi(t),e1 . we then take the time derivative of both sides of the equation which yields the following equation:

θ˙i(t)=1nsin(θi(t))j=1neβxi(t),xj(t)(xj(t),e1xi(t),xj(t)xi(t),e1).

Secondly, noting that  xi(t),xj(t)=cos(θi(t)θj(t)) , we could substitute this relation and rewrite it into the following equation:  

θ˙i(t)=1nsin(θi(t))j=1neβcos(θi(t)θj(t))[cos(θj(t))cos(θi(t)θj(t))cos(θi(t))].

Thirdly, by using some elementary trigonometric relations, this finally leads us to define the following expression of the equation:

Definition 1. The expression of Angular Dynamics under equation 6 is:

θ˙i(t)=1nj=1neβcos(θi(t)θj(t))sin(θi(t)θj(t)),

which governs the evolution of  θi(t)  in the presence of interactions weighted by  β  and the relative angle between particles.

Remark 1. when  β=0 , the dynamics reduce to the well-known Kuramoto model which describes the synchronization phenomenon in coupled oscillators.

θ˙i(t)=1nj=1nsin(θi(t)θj(t)). .

Remark 2. Considering the dynamics in definition 1 for  β>0 , we observe that it can also be written as a gradient flow with interaction energy  Eβ:TnR0 :

Eβ(θ)=12βn2i=1nj=1neβcos(θiθj),

which reaches its maximum when all  θi  align at a single fixed real value in  T=[0,2π) .

4.2. The Kuramoto Model and Its Genalizations

In this section, we reviewed some Kuramoto models and its genalizations in this section by following section 7.2 in [8] As mentioned in the last section, when  β=0 , the dynamics in the previous section simplify to a particular case of the Kuramoto model. In fact, the Kuramoto model could be described in the following definition:

Definition 2.The Kuramoto model for oscillator  i  is given by:

θ˙i(t)=ωi+Knj=1nsin(θj(t)θi(t)),

where  K>0  is a coupling constant and  ωiT  is the intrinsic frequency of oscillator  i .

Remark 3. In this model in the previous definition, for small  K , oscillators do not synchronize over long time. As  K  exceeds a critical threshold, some oscillators begin to synchronize. For very large  K , all oscillators eventually synchronize in the long term.

Observation 1.

If all intrinsic frequencies  ωi  are equal to a real number like  ω  in the previous definition, we can actually shift and rewrite variables by setting  θi(t)θi(t)ωt. This transforms the dynamics in definition 2 into the following gradient flow form:

θ˙(t)=nF(θ),

where the energy  F:TnR0  is defined by:

F(θ)=K2n2i=1nj=1ncos(θiθj).

Remark 4. This energy  F  is exactly maximized when all oscillators synchronize (i.e.,  θi=θ  for some fixed  θT  and for all  i  in  1,2...,n ), with equilibrium states occurring at the critical points of  F .

Definition 3. 

The Kuramoto model can also be generalized to include more general non-linear interaction functions. In particular, an extension of the form can be written as following:

θ˙i(t)=ωi+Knj=1nh(θj(t)θi(t)),

where  h:TR  is a general non-linear function, which captures both the classic Kuramoto model (when  h(θ)=sin(θ)) in definition 2 and the model in definition 1 as specific cases.

Example 1. One example of such a generalization is when  h(θ)=eβcos(θ) , leading to the interaction function:

hβ(θ)=eβcos(θ)=kZIk(β)eikθ,

where  Ik(β)  denotes the modified Bessel function of the first kind. \end{example}

5. A practical transformer - Sumformer

5.1. Definition of Sumformer

A Sumformer is a sequence-to-sequence function  S:Rn×dRn×d , defined for input sequence  X=[x1,,xn]  as:

Σ=i=1nϕ(xi),

S(X)=[ψ(x1,Σ),,ψ(xn,Σ)],

where  ϕ:RdRd1  and  ψ:Rd×Rd1Rd  are learnable functions.

5.2. Approximation theorem of Sumformer

Let  f  be a continuous permutation-equivariant sequence-to-sequence function on compact sets, defined as  f:Rn×dRn×d . Then, for any  ϵ>0 , there exists a Transformer  T  such that:

supXRn×df(X)T(X)<ϵ.

5.3. Proof

We aim to prove that a Transformer can approximate the Sumformer  S  and hence approximate any equivariant function  f . The proof consists of two main steps:

Step 1: Sumformer Approximation of  f 

1. Goal: Construct a Sumformer  S  that approximates  f .

2. Constructing  Σ  using  ϕ : For each input sequence  X=[x1,,xn] , define

Σ=i=1nϕ(xi),

where  ϕ(xi)  encodes information from each input  xi  in a way that captures permutation-equivariant properties. For example, using multisymmetric polynomials to approximate  f ’s behavior.

3. Defining  S(X)  using  ψ : Using  Σ , define

S(X)=[ψ(x1,Σ),,ψ(xn,Σ)],

where  ψ(xi,Σ)  combines each  xi  with the aggregate information  Σ  to approximate  f(xi) .

4. Approximation Error of  S  for  f : Given a continuous  f , we can choose  ϕ  and  ψ  such that:

supXRn×df(X)S(X)<ϵ2.

Step 2: Transformer Approximation of Sumformer

1. Input Encoding: For each input  xi , construct the sequence

X=[[x1,ϕ(x1)],,[xn,ϕ(xn)]]Rn×(d+d1).

2. Using Attention to Approximate  Σ : Set the Transformer attention query and key matrices as:

WQ=WK=[1&0&&0]R(d+d1)×1,

so that each attention score is constant, allowing us to approximate the sum  Σ  by aggregating  ϕ(xi)  terms across the sequence.

3. Output Generation Using Feed-Forward Layers: Use two feed-forward layers to approximate  ψ(xi,Σ)  for each  xi , ensuring that:

supXRn×dS(X)T(X)<ϵ2.

Step 3: Combined Approximation By combining these two steps, we have:

supXRn×df(X)T(X)supXRn×df(X)S(X)+supXRn×dS(X)T(X)<ϵ.

This completes the proof.

5.4. Conclusion

In summary, the multi-particle dynamical systems perspective provides a powerful lens to understand transformers. By modeling self-attention as interacting particles, one can rigorously study convergence, clustering, and stability, while also revealing parallels with classical systems such as Kuramoto oscillators. This framework not only deepens theoretical insight but also opens pathways for principled analysis and potential improvements in transformer architectures.


References

[1]. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https: //aclanthology.org/N19-1423

[2]. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., . . . Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In (Vol. abs/2010.11929). Retrieved from https: //api.semanticscholar.org/CorpusID: 225039882

[3]. E, W. (2017). A proposal on machine learning via dynamical systems. Communications in Mathe- matics and Statistics, 5 , 1 - 11. Retrieved from https: //api.semanticscholar.org/CorpusID: 64849498

[4]. Geshkovski, B., Koubbi, H., Polyanskiy, Y., & Rigollet, P. (2024). Dynamic metastability in the self- attention model. Retrieved from  https: //arxiv.org/abs/2410.06833

[5]. Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. (2024). A mathematical perspective on transformers. Retrieved from https: //arxiv.org/abs/2312.10794 6

[6]. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022, September). Transformers in vision: A survey. ACM Comput. Surv., 54 (10s). Retrieved from https: //doi.org/10.1145/ 3505244 DOI: 10.1145/3505244

[7]. Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., . . . Liu, T.-Y. (2019). Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv, abs/1906.02762 . Retrieved from https: //api.semanticscholar.org/CorpusID: 174801126

[8]. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (140), 1–67. Retrieved from http: //jmlr.org/papers/v21/20-074.html


Cite this article

Zhang,Y. (2025). Multi-Particle Dynamical Systems Modeling Transformers. Theoretical and Natural Science,132,114-121.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of CONF-APMM 2025 Symposium: Simulation and Theory of Differential-Integral Equation in Applied Physics

ISBN:978-1-80590-305-5(Print) / 978-1-80590-306-2(Online)
Editor:Marwan Omar, Shuxia Zhao
Conference date: 27 September 2025
Series: Theoretical and Natural Science
Volume number: Vol.132
ISSN:2753-8818(Print) / 2753-8826(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https: //aclanthology.org/N19-1423

[2]. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., . . . Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In (Vol. abs/2010.11929). Retrieved from https: //api.semanticscholar.org/CorpusID: 225039882

[3]. E, W. (2017). A proposal on machine learning via dynamical systems. Communications in Mathe- matics and Statistics, 5 , 1 - 11. Retrieved from https: //api.semanticscholar.org/CorpusID: 64849498

[4]. Geshkovski, B., Koubbi, H., Polyanskiy, Y., & Rigollet, P. (2024). Dynamic metastability in the self- attention model. Retrieved from  https: //arxiv.org/abs/2410.06833

[5]. Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. (2024). A mathematical perspective on transformers. Retrieved from https: //arxiv.org/abs/2312.10794 6

[6]. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022, September). Transformers in vision: A survey. ACM Comput. Surv., 54 (10s). Retrieved from https: //doi.org/10.1145/ 3505244 DOI: 10.1145/3505244

[7]. Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., . . . Liu, T.-Y. (2019). Understanding and improving transformer from a multi-particle dynamic system point of view. ArXiv, abs/1906.02762 . Retrieved from https: //api.semanticscholar.org/CorpusID: 174801126

[8]. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (140), 1–67. Retrieved from http: //jmlr.org/papers/v21/20-074.html