$ \ Infty $ -former: Infinite Memory Transformer

Martins P., Marinho Z. and Martins A. \(\infty\)-former: Infinite Memory Transformer. arXiv preprint arXiv:2109.00301, 2021.


Introduce a long-term memory mechanism in Transformer.

Main content

Assumption\(X \in \mathbb{R}^{L \times d}\)ie each row\(x_i\)Represents a TOKEN corresponding characteristics.
Attertion needs to be proceed as follows:

\[Q = XW^Q, K = X W^K, V = XW^V, \\Z = \mathrm{softmax}(\frac{QK^T}{\sqrt{d}})V.\]

For the smell of symbols, we don’t consider Multi-HEAD situations, the following ideas can be applied directly.

We know that you can approximate any continuous function by radial basis functions:

\[\sum_{k} b_k \psi_k (t) \rightarrow f(t).\]

Now, us\(t_i = \frac{i}{L}\)ie\(L\)TOKENS crown in timing,\(X\)Each column can be seen as a special\(f_j(t)\)\(t_i, i=0,1,\cdots, L-1\)The value is at.
Given\(N\)Substrate function\(\psi_k (t), k=0,1,\cdots, N-1\)We want to solve the coefficient\(\bm{b}_j = [b_{j0}, b_{j1},\cdots b_{j,N-1}]^T\)Come approach\(f_j\)(\(X\)\(j\)Column).
\(\Psi \in \mathbb{R}^{N \times L}, \Psi_{ki}=\psi_{k}(t_i)\), \(B \in \mathbb{R}^{d \times N}, B_{jk} = b_{jk}\).
The author will solve the coefficient through the ridge return\(b\):

\[B = \arg \min_{B} \|B \Psi – X^T\|_F^2 + \lambda \|B\|_F^2,\]

It shows the expression:

\[B = X^T\Psi^T(\Psi\Psi^T + \lambda I)^{-1}.\]

\[X^T \approx B\Psi \rightarrow x_i \approx B \psi (t_i).\]

Now we use\(\tilde{X} := \Psi^T B^T\)\(X\)

\[K = \tilde{X} W^K = \Psi^TB^TW^K, \tilde{V} = \tilde{X}W^V = \Psi^TB^TW^V. \]

Note that we are not right\(Q\)Replace, because this is just used as long-term records, Q Each recalculates.
For each\(q_i\)We build a\(t\)Density function\(p_i(t)\)The text is assumed to meet the Gaussian distribution:

\[\mathcal{N}(t; \mu_i; \sigma_i^2).\]

\(\mu_i, \sigma_i^2\)Estimates below:

\[\mu_i = \mathrm{sigmoid} (w_{\mu}^T K q_i)=\mathrm{sigmoid} (w_{\mu}^T B^TW^K q_i), \\\sigma^2_i = \mathrm{softplus} (w_{\sigma}^T K q_i)=\mathrm{sigmoid} (w_{\sigma}^T B^TW^K q_i). \\\]

Pay attention to the last order\(w^T\Psi^T = w^T\)Since\(\Psi\)is determined in advance.
We know


Actually solved is a discretization\(p_i(t)\)\(q_i\)​​and\(k_j\)The level of completion, and


In fact, it is solving expectations.


Now we have a continuous\(p_i(t)\)You can also get the last one in this way.\(z_i\):


When we take\(\psi\)When the Gaussian radial basis function, the above is displayed.

Now come out, where is it?
Original\(K\)Yes\(L\times d\)Now because we only need to calculate\(B^TW\)So actually only\(N \times d\)We can choose very large\(L\)But choose smaller\(N\)To avoid higher complexity.

How to expand?

It is difficult to recalculate every time\(B\)If it is really, it is not a long-term memory.
The author takes a smart method, in fact, now\(B\psi(t)\)It can be seen as a\(d\)Weel vector function.
We first compressed it to\([0, \tau], \tau \in (0, 1)\):

\[B\psi(t /\tau),\]

So, the energy of the entire function is in\([0, \tau]\)In the rest of us\((\tau, 1]\)Come placed new\(X\).
We first\([0, \tau]\)Sampling\(M\)\(t_0, \cdots, t_{M-1}\)and get:

\[X_{past} = [x_0, \cdots, x_{M-1}]^T \in \mathbb{R}^{M \times d}, x_m=\psi^T(t_m/\tau)B^T.\]

Plus new\(X_{new}\)We have

\[X = [X_{past}^T, X_{new}^T]^T \in \mathbb{R}^{(M + L) \times d},\]

\(X\)Relivery to the above logic\(B\)Memory can be updated.

About how to sampling this\(M\)Point, the author puts a Sticky Memories method, and it is not well leaving to the density function.

Experimental details

When watching this papers, what is this radial basis function?
Lift an author in Language Model:
Select 150 Gaussian radial basis functions\(\mathcal{N}(t;\mu, \sigma^2)\)
\(\mu\)from\([0, 1]\)Among the sample,\(\sigma \in \{0.01, 0.05\}\).

Also use KL scatter to prevent generalization. I feel that this interesting point is to compress this place, and\(\Psi\)Treatment.