# $\ Infty$ -former: Infinite Memory Transformer

Directory

Martins P., Marinho Z. and Martins A. $$\infty$$-former: Infinite Memory Transformer. arXiv preprint arXiv:2109.00301, 2021.

## Overview

Introduce a long-term memory mechanism in Transformer.

Assumption$$X \in \mathbb{R}^{L \times d}$$ie each row$$x_i$$Represents a TOKEN corresponding characteristics.
Attertion needs to be proceed as follows:

$Q = XW^Q, K = X W^K, V = XW^V, \\Z = \mathrm{softmax}(\frac{QK^T}{\sqrt{d}})V.$

For the smell of symbols, we don’t consider Multi-HEAD situations, the following ideas can be applied directly.

We know that you can approximate any continuous function by radial basis functions:

$\sum_{k} b_k \psi_k (t) \rightarrow f(t).$

Now, us$$t_i = \frac{i}{L}$$ie$$L$$TOKENS crown in timing,$$X$$Each column can be seen as a special$$f_j(t)$$$$t_i, i=0,1,\cdots, L-1$$The value is at.
Given$$N$$Substrate function$$\psi_k (t), k=0,1,\cdots, N-1$$We want to solve the coefficient$$\bm{b}_j = [b_{j0}, b_{j1},\cdots b_{j,N-1}]^T$$Come approach$$f_j$$($$X$$$$j$$Column).
$$\Psi \in \mathbb{R}^{N \times L}, \Psi_{ki}=\psi_{k}(t_i)$$, $$B \in \mathbb{R}^{d \times N}, B_{jk} = b_{jk}$$.
The author will solve the coefficient through the ridge return$$b$$:

$B = \arg \min_{B} \|B \Psi – X^T\|_F^2 + \lambda \|B\|_F^2,$

It shows the expression:

$B = X^T\Psi^T(\Psi\Psi^T + \lambda I)^{-1}.$

$X^T \approx B\Psi \rightarrow x_i \approx B \psi (t_i).$

Now we use$$\tilde{X} := \Psi^T B^T$$$$X$$

$K = \tilde{X} W^K = \Psi^TB^TW^K, \tilde{V} = \tilde{X}W^V = \Psi^TB^TW^V.$

Note that we are not right$$Q$$Replace, because this is just used as long-term records, Q Each recalculates.
For each$$q_i$$We build a$$t$$Density function$$p_i(t)$$The text is assumed to meet the Gaussian distribution:

$\mathcal{N}(t; \mu_i; \sigma_i^2).$

$$\mu_i, \sigma_i^2$$Estimates below:

$\mu_i = \mathrm{sigmoid} (w_{\mu}^T K q_i)=\mathrm{sigmoid} (w_{\mu}^T B^TW^K q_i), \\\sigma^2_i = \mathrm{softplus} (w_{\sigma}^T K q_i)=\mathrm{sigmoid} (w_{\sigma}^T B^TW^K q_i). \\$

Pay attention to the last order$$w^T\Psi^T = w^T$$Since$$\Psi$$is determined in advance.
We know

$\mathrm{softmax}(\frac{Kq_i}{\sqrt{d}})$

Actually solved is a discretization$$p_i(t)$$$$q_i$$​​and$$k_j$$The level of completion, and

$\mathrm{softmax}(\frac{Kq_i}{\sqrt{d}})^TV$

In fact, it is solving expectations.

$\mathbb{E}_{p_i}[v(t)].$

Now we have a continuous$$p_i(t)$$You can also get the last one in this way.$$z_i$$:

$\mathbb{E}_{p_i}[v(t)]=\mathbb{E}_{p_i}[\psi^T(t)B^TW^V]=\mathbb{E}_{p_i}[\psi^T(t)]B^TW^V.$

When we take$$\psi$$When the Gaussian radial basis function, the above is displayed.

Now come out, where is it?
Original$$K$$Yes$$L\times d$$Now because we only need to calculate$$B^TW$$So actually only$$N \times d$$We can choose very large$$L$$But choose smaller$$N$$To avoid higher complexity.

### How to expand?

It is difficult to recalculate every time$$B$$If it is really, it is not a long-term memory.
The author takes a smart method, in fact, now$$B\psi(t)$$It can be seen as a$$d$$Weel vector function.
We first compressed it to$$[0, \tau], \tau \in (0, 1)$$:

$B\psi(t /\tau),$

So, the energy of the entire function is in$$[0, \tau]$$In the rest of us$$(\tau, 1]$$Come placed new$$X$$.
We first$$[0, \tau]$$Sampling$$M$$$$t_0, \cdots, t_{M-1}$$and get:

$X_{past} = [x_0, \cdots, x_{M-1}]^T \in \mathbb{R}^{M \times d}, x_m=\psi^T(t_m/\tau)B^T.$

Plus new$$X_{new}$$We have

$X = [X_{past}^T, X_{new}^T]^T \in \mathbb{R}^{(M + L) \times d},$

$$X$$Relivery to the above logic$$B$$Memory can be updated.

About how to sampling this$$M$$Point, the author puts a Sticky Memories method, and it is not well leaving to the density function.

### Experimental details

When watching this papers, what is this radial basis function?
Lift an author in Language Model:
Select 150 Gaussian radial basis functions$$\mathcal{N}(t;\mu, \sigma^2)$$
$$\mu$$from$$[0, 1]$$Among the sample,$$\sigma \in \{0.01, 0.05\}$$.

Also use KL scatter to prevent generalization. I feel that this interesting point is to compress this place, and$$\Psi$$Treatment.

• Top