## Overview

Introduce a long-term memory mechanism in Transformer.

## Main content

Assumption\(X \in \mathbb{R}^{L \times d}\)ie each row\(x_i\)Represents a TOKEN corresponding characteristics.

Attertion needs to be proceed as follows:

For the smell of symbols, we don’t consider Multi-HEAD situations, the following ideas can be applied directly.

We know that you can approximate any continuous function by radial basis functions:

Now, us\(t_i = \frac{i}{L}\)ie\(L\)TOKENS crown in timing,\(X\)Each column can be seen as a special\(f_j(t)\)\(t_i, i=0,1,\cdots, L-1\)The value is at.

Given\(N\)Substrate function\(\psi_k (t), k=0,1,\cdots, N-1\)We want to solve the coefficient\(\bm{b}_j = [b_{j0}, b_{j1},\cdots b_{j,N-1}]^T\)Come approach\(f_j\)(\(X\)\(j\)Column).

\(\Psi \in \mathbb{R}^{N \times L}, \Psi_{ki}=\psi_{k}(t_i)\), \(B \in \mathbb{R}^{d \times N}, B_{jk} = b_{jk}\).

The author will solve the coefficient through the ridge return\(b\):

It shows the expression:

Now we use\(\tilde{X} := \Psi^T B^T\)\(X\)

Note that we are not right\(Q\)Replace, because this is just used as long-term records, Q Each recalculates.

For each\(q_i\)We build a\(t\)Density function\(p_i(t)\)The text is assumed to meet the Gaussian distribution:

\(\mu_i, \sigma_i^2\)Estimates below:

Pay attention to the last order\(w^T\Psi^T = w^T\)Since\(\Psi\)is determined in advance.

We know

Actually solved is a discretization\(p_i(t)\)\(q_i\)and\(k_j\)The level of completion, and

In fact, it is solving expectations.

Now we have a continuous\(p_i(t)\)You can also get the last one in this way.\(z_i\):

When we take\(\psi\)When the Gaussian radial basis function, the above is displayed.

Now come out, where is it?

Original\(K\)Yes\(L\times d\)Now because we only need to calculate\(B^TW\)So actually only\(N \times d\)We can choose very large\(L\)But choose smaller\(N\)To avoid higher complexity.

### How to expand?

It is difficult to recalculate every time\(B\)If it is really, it is not a long-term memory.

The author takes a smart method, in fact, now\(B\psi(t)\)It can be seen as a\(d\)Weel vector function.

We first compressed it to\([0, \tau], \tau \in (0, 1)\):

So, the energy of the entire function is in\([0, \tau]\)In the rest of us\((\tau, 1]\)Come placed new\(X\).

We first\([0, \tau]\)Sampling\(M\)\(t_0, \cdots, t_{M-1}\)and get:

Plus new\(X_{new}\)We have

\(X\)Relivery to the above logic\(B\)Memory can be updated.

About how to sampling this\(M\)Point, the author puts a Sticky Memories method, and it is not well leaving to the density function.

### Experimental details

When watching this papers, what is this radial basis function?

Lift an author in Language Model:

Select 150 Gaussian radial basis functions\(\mathcal{N}(t;\mu, \sigma^2)\)

\(\mu\)from\([0, 1]\)Among the sample,\(\sigma \in \{0.01, 0.05\}\).

Also use KL scatter to prevent generalization. I feel that this interesting point is to compress this place, and\(\Psi\)Treatment.