Sequence Modeling

LSTM is awesome, but not good enough
Transformers: how and why
$f(X) \approx y$, where $f$ is the model, X is the input and $y$ is the outcome/prediction
Sequence Modeling is a problem
$f:\mathbb{R}^d \mapsto \mathbb{R}$, $\mathbb{R}^d$ contains fixed size $d$ vectors
can not represent documents a fixed size vector
Documents are of various lengths
Not aware of any linear algebra that works on variable dimensionality
Classic way: bag of words
Order matters: “work to live” vs “live to work”; Both score same.
Solution would be N-grams dimensionality $V^N$
Bi-grams (Every pair of possible words)
Tri-grams (Every combination of three words)
This way we can distinguish between the two
English tri-grams: (10^15) dimensions
All sort of problems with that size of dimensions.
A natural way to solve this problem is the RNN (Recurrent Neural Network)
How to calculate $f(x_1,x_2,x_3,\cdots,x_n)$
A for-loop in math $H_{i+1}=A(H_i,X_I$)
The problem with RNN is vanishing and exploding gradients
$H_3= A(A(A(H_0,x_0),x_1),x_2)$
$A(H,x):=Wx+ZH$
$H_N=W^N x_0 + W^{N-1} x_1 + …$
For 100 words, W^100, $0.9^100=7 \times 10^{-10}$, $1.1^100=189905276$
In linear algebra, this is about same except we need to think about eigenvalues of the matrix
Eigenvalues say- how much the matrix is going to grow or shrink vectors when the transformations are applied
If the eigenvalues are les than one, we see the gradients go to zero. If #\lambda>1$, gradients are going to explode.
This made RNN extremely difficult, So LSTM to the rescue
LSTM is a kind of Learning; Here it is not applied recursively on the main hidden vector, it is not like a CNN resnet.