Data Science Notes

Sequence Modeling

  • LSTM is awesome, but not good enough
  • Transformers: how and why
  • $f(X) \approx y$, where $f$ is the model, X is the input and $y$ is the outcome/prediction
  • Sequence Modeling is a problem
  • $f:\mathbb{R}^d \mapsto \mathbb{R}$, $\mathbb{R}^d$ contains fixed size $d$ vectors
  • can not represent documents a fixed size vector
  • Documents are of various lengths
  • Not aware of any linear algebra that works on variable dimensionality
  • Classic way: bag of words
  • Order matters: “work to live” vs “live to work”; Both score same.
  • Solution would be N-grams dimensionality $V^N$
  • Bi-grams (Every pair of possible words)
  • Tri-grams (Every combination of three words)
  • This way we can distinguish between the two
  • English tri-grams: (10^15) dimensions
  • All sort of problems with that size of dimensions.
  • A natural way to solve this problem is the RNN (Recurrent Neural Network)
  • How to calculate $f(x_1,x_2,x_3,\cdots,x_n)$
  • A for-loop in math $H_{i+1}=A(H_i,X_I$)
  • The problem with RNN is vanishing and exploding gradients
    $H_3= A(A(A(H_0,x_0),x_1),x_2)$
    $A(H,x):=Wx+ZH$
    $H_N=W^N x_0 + W^{N-1} x_1 + …$
  • For 100 words, W^100, $0.9^100=7 \times 10^{-10}$, $1.1^100=189905276$
  • In linear algebra, this is about same except we need to think about eigenvalues of the matrix
  • Eigenvalues say- how much the matrix is going to grow or shrink vectors when the transformations are applied
  • If the eigenvalues are les than one, we see the gradients go to zero. If #\lambda>1$, gradients are going to explode.
  • This made RNN extremely difficult, So LSTM to the rescue
  • LSTM is a kind of Learning; Here it is not applied recursively on the main hidden vector, it is not like a CNN resnet.