Sequence Modeling
- LSTM is awesome, but not good enough
- Transformers: how and why
- $f(X) \approx y$, where $f$ is the model, X is the input and $y$ is the outcome/prediction
- Sequence Modeling is a problem
- $f:\mathbb{R}^d \mapsto \mathbb{R}$, $\mathbb{R}^d$ contains fixed size $d$ vectors
- can not represent documents a fixed size vector
- Documents are of various lengths
- Not aware of any linear algebra that works on variable dimensionality
- Classic way: bag of words
- Order matters: “work to live” vs “live to work”; Both score same.
- Solution would be N-grams dimensionality $V^N$
- Bi-grams (Every pair of possible words)
- Tri-grams (Every combination of three words)
- This way we can distinguish between the two
- English tri-grams: (10^15) dimensions
- All sort of problems with that size of dimensions.
- A natural way to solve this problem is the RNN (Recurrent Neural Network)
- How to calculate $f(x_1,x_2,x_3,\cdots,x_n)$
- A for-loop in math $H_{i+1}=A(H_i,X_I$)
- The problem with RNN is vanishing and exploding gradients
$H_3= A(A(A(H_0,x_0),x_1),x_2)$
$A(H,x):=Wx+ZH$
$H_N=W^N x_0 + W^{N-1} x_1 + …$
- For 100 words, W^100, $0.9^100=7 \times 10^{-10}$, $1.1^100=189905276$
- In linear algebra, this is about same except we need to think about eigenvalues of the matrix
- Eigenvalues say- how much the matrix is going to grow or shrink vectors when the transformations are applied
- If the eigenvalues are les than one, we see the gradients go to zero. If #\lambda>1$, gradients are going to explode.
- This made RNN extremely difficult, So LSTM to the rescue
- LSTM is a kind of Learning; Here it is not applied recursively on the main hidden vector, it is not like a CNN resnet.