Seattle Applied Deep Learning
Seattle Applied Deep Learning
  • 2
  • 542 318
LSTM is dead. Long Live Transformers!
Leo Dirac (@leopd) talks about how LSTM models for Natural Language Processing (NLP) have been practically replaced by transformer-based models. Basic background on NLP, and a brief history of supervised learning techniques on documents, from bag of words, through vanilla RNNs and LSTM. Then there's a technical deep dive into how Transformers work with multi-headed self-attention, and positional encoding. Includes sample code for applying these ideas to real-world projects.
Переглядів: 524 636

Відео

Geometric Intuition for Training Neural Networks
Переглядів 18 тис.4 роки тому
Leo Dirac (@leopd) gives a geometric intuition for what happens when you train a deep learning neural network. Starting with a physics analogy for how SGD works, and describing the shape of neural network loss surfaces. This talk was recorded live on 12 Nov 2019 as part of the Seattle Applied Deep Learning (sea-adl.org) series. References from the talk: Loss Surfaces of Multilayer networks arxi...

КОМЕНТАРІ

  • @jeffg4686
    @jeffg4686 2 місяці тому

    Relevance is just how often a word appears in the input? NM on this. I looked it up. The answer is similarity of tokens in the embedding - ones with higher similarity gets more relevance.

  • @maciej2320
    @maciej2320 3 місяці тому

    Four years ago! Shocking.

  • @DeltonMyalil
    @DeltonMyalil 3 місяці тому

    This aged like fine wine.

  • @joneskiller8
    @joneskiller8 4 місяці тому

    I need that belt.

  • @axe863
    @axe863 5 місяців тому

    My greatest successes are blending traditional time series modeling with Transformer like Wavelet Denoised ARTFIMA + TFT

  • @SideOfHustle
    @SideOfHustle 6 місяців тому

    are there really a half million of you out there that understand this?

  • @rohitdhankar360
    @rohitdhankar360 8 місяців тому

    @10:30 - Attention is all you need -- Multi Head Attention Mechanism --

  • @chrisfiegel9455
    @chrisfiegel9455 9 місяців тому

    This was amazing.

  • @felipevaldes7679
    @felipevaldes7679 10 місяців тому

    Leo Dirac: Can't pretrain on large corpus Sam Altman: Hold my beer...

    • @LeoDirac
      @LeoDirac 3 місяці тому

      While I appreciate the association, what did I say to imply you can't retrain on a large corpus? In the summary "Key Advantages of Transformers" I wrote "Can be trained on unsupervised text; all the world's text data is now valid training data."

  • @DoctorMGL
    @DoctorMGL 10 місяців тому

    i came here for "Transformers movie" and end up watching something i didn't understand s*t from it, the whole video was like alien language to me

  • @carefir
    @carefir 11 місяців тому

    I can't believe all the zombie praises piled on this lecture. The concepts are not explained clearly, and there are potential errors in several of his understanding (e.g. 8 times? for all transformers? why? magic number?). I would recommend Michael Phi's, and then Sebastian Raschka's videos (or better yet, Niels Rogge's), over this one, for anyone who really wants to know what's going on. Those videos, at the very least, get very few things wrong, if any.

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning 11 місяців тому

      The reference to doing the attention process "8 times" is from the original attention paper arxiv.org/pdf/1706.03762.pdf where it says "h = 8 parallel attention layers" at the top of page 5. h=8 is of course a hyper-parameter that varies based on the transformer architecture - BERT and GPT2 used 12, others use hundreds. aclanthology.org/2020.acl-main.311.pdf has an interesting analysis of the role of these heads if you'd like to dive in. (Sorry for the "of course" - as an explanatory video I shouldn't assume too much about my audience, but I guess I was assuming that at this point everybody understands that various structural choices like number of layers or embedding dimension are fairly arbitrary.)

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning 11 місяців тому

      If you see any factual mistakes, please point them out. UA-cam doesn't seem to make it easy to fix them. The only one I know if is the equation at 4:20 should be H[i+1] = A(H[i] ; x[i+1]) - which is a miss, but I don't think it's likely to confuse too many people.

    • @carefir
      @carefir 11 місяців тому

      @@seattleapplieddeeplearning I don't know. Correctness and accuracy is important to me. If I even see one error in my own work, I would apologize to everyone who've seen it, take the thing down instantly, and put up an errata asap. Correctness does not seem as important to the new, "postmodern", generation -- it seems "truth" no longer has much utility to them. Besides typos, I have several points which I find lacking with your video. For instance, why introduce ReLU in a lecture about transformers? Explaining the multi-headed attention as learning grammar, vocabulary, etc, is misleading since the architecture provides no builtin mechanism for learning those features -- they may end up being learned but there is no such guarantee, and you will need to prove that it indeed does so. But my position is that, if there exists even one error in our publication, it is our responsibility to have it fixed lest the error propagate through the industry (and it will! especially when there are so many "bootleggers"). Of course, YMMV. That's all the ranting from me. Thanks for responding.

  • @zeeshanashraf4502
    @zeeshanashraf4502 11 місяців тому

    Great presentation.

  • @tastyw0rm
    @tastyw0rm 11 місяців тому

    This was more than meets the eye

  • @driziiD
    @driziiD Рік тому

    very impressive presentation. thank you.

  • @_RMSG_
    @_RMSG_ Рік тому

    I love this presentation Doesn't assume that the audience knows far more than is necessary, goes through explanations of relevant parts of Transformers, notes shortcomings, etc; Best slideshow I've seen this year, and it's from over 3 years ago

  • @JohnNy-ni9np
    @JohnNy-ni9np Рік тому

    I'm glad to find your clip which sheds some light about internal working of ChatGPT 3.5 . Now I have a question : Recently there is a potential defamation lawsuit against OpenAI because ChatGPT has wrongly stated that Brian Hood was guilty of Securency bribery scandal, while in fact he is just a whistle blower. For OpenAI to remedy this response from ChatGPT do they have to retrain the model completely (costly) from scratch or do they have a (cheaply) mechanism to alter ChatGPT response ?

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning Рік тому

      IANAL, and this is certainly a legal issue. I will say that it's standard practice in many real-world ML applications to have a list of hand-coded filters and rules which run after ML predictions to deal with special cases like this.

  • @ziruiliu3998
    @ziruiliu3998 Рік тому

    supposing i am using a net to approximate a real world physis ODE equation with time series data, in this case the Transformer is still the best choice?

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning Рік тому

      I'm not sure. I have barely read any papers on this kind of modeling. I will say that a wonderful property of transformers is that they can learn to analyze arbitrary dimensional inputs - it's easy to create positional encodings for 1D inputs (sequence), or 2D (image), or 3D, 4D, 5D, etc. Some physics modeling scenarios will want this kind of input. If your inputs are purely 1D, you could use older NN architectures, but in 2023 there are very few situations where I'd choose an LSTM over a transformer. (e.g. if you need an extremely long time horizon.) -Leo

    • @ziruiliu3998
      @ziruiliu3998 Рік тому

      @@seattleapplieddeeplearning Thanks for your reply, this realy helps me.

  • @ChrisHalden007
    @ChrisHalden007 Рік тому

    Great video. Thanks

  • @kampkrieger
    @kampkrieger Рік тому

    Typical example of a bad lecture. Only showing stuff and without introducing or explaining what he is showing (what is that graph about? what are the axis or the arrows mean?) talks about it and goes on to the next slide

  • @GoogleUser-ee8ro
    @GoogleUser-ee8ro Рік тому

    This beautiful speech is before OpenAI GPT, the world badly needs an update

    • @JohnNy-ni9np
      @JohnNy-ni9np Рік тому

      Unfortunately OpenAI is a Close Source by now, people cannot openly talk about its internal structure anymore.

  • @aiglv
    @aiglv Рік тому

    cool

  • @musicphilebd9862
    @musicphilebd9862 Рік тому

    Schmidhuba comin to get ya !

  • @TJVideoChannelUTube
    @TJVideoChannelUTube Рік тому

    At 18:25, Leo Dirac mentioned Transformer Model doesn't need activation functions. Is this correct?

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning Рік тому

      No that's not correct. I understand the confusion, though. The advantage is that Transformers don't require those specific "complex" activations of sigmoid or tanh which LSTM relies on. Transformers can use ReLU activations which are computationally much simpler. With GPUs the actual amount of computation isn't really the issue, but rather the precision at which they need to run. LSTMs typically need to run in full 32-bit precision, whereas modern datacenter GPUs like A100, H100 or TPUs are way faster at 16-bit computation. That's because tanh and sigmoid squash inputs down to numbers that are very close to 1, 0 or -1, and so small differences, say between 0.991 and 0.992, become very meaningful, requiring lots of digits of precision, and thus lots of silicon to keep track of them. But simpler activations like ReLU tend to work much better on 16-bit silicon. One more clarification: Transformers typically require softmax computations for the attention mechanism, which are technically very similar to sigmoid. Softmax still squash inputs close to 0 and 1. But for reasons ... the small differences in softmax activations don't matter much.

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning Рік тому

      Okay, reasons. The difference is subtle but essentially it's because a 0.999 coming out of softmax means "pay full attention here" and the other 0.001 doesn't matter for anything. But in an LSTM a 0.999 coming out of a sigmoid is effectively saying "0.001 of the backdrop signal goes back in time to the previous token". And then you might need another 0.001 to back in time to the token before that. So that's why the higher precision is critical for LSTM. Vanishing gradients.

    • @TJVideoChannelUTube
      @TJVideoChannelUTube Рік тому

      @@seattleapplieddeeplearning Here are comments by ChatGPT: Activation functions are not essential for the Transformer model as they are not used within the model. The self-attention mechanism within the Transformer is able to introduce non-linearity into the model, which is why the Transformer does not require activation functions like other neural networks. Instead, the self-attention mechanism in the Transformer uses matrix multiplication and softmax functions to compute the attention scores, and these scores are used to weight the input vectors. The use of the softmax function in the attention mechanism can be considered a form of activation function, but it is not the same as the commonly used activation functions in other neural network models. However, activation functions can be used in other parts of the Transformer model, such as in the feed-forward neural networks in the encoder and decoder layers. In transformer models, the self-attention mechanism replaces the need for activation functions in the traditional sense, as it allows for the model to selectively weight the input features without explicitly passing them through an activation function.

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning Рік тому

      ChatGPT's answer is a bit misleading (especially the first sentence), but not necessarily wrong. Any NN is effectively useless without some kind of nonlinearity, and "activation" is pretty synonymous with "nonlinearity" but a bit more vague. It's true that because there's nonlinearity in attention mechanism that you don't need it in the other areas, which are typically MLP's so I'll call them MLP's. But I'm not aware of any real transformers in use which skip the nonlinearity in the "MLP" parts of the model. But the point I was making on that slide is why Transformers are better than LSTM and similar old-school NN's. And that's because LSTM intrinsically _needs_ tanh and sigmoid to operate, and these have a bunch of problems. Transformers can use any activation function. (Or arguably "none" but I wouldn't say that because I think it's misleading, and probably would hurt quality a lot too.)

    • @TJVideoChannelUTube
      @TJVideoChannelUTube Рік тому

      @@seattleapplieddeeplearning I think ChatGPT implies that self-attention mechanism, in the Transformer model, does not use activation functions. Is this statement correct: The self-attention mechanism will not involve in deep learning processes, because no activation functions needed in this layer.

  • @Djellowman
    @Djellowman Рік тому

    What's with the belt

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning Рік тому

      I love that belt! This video doesn't show it well, but I think that belt really brings out the brown in my eyes.

    • @Djellowman
      @Djellowman Рік тому

      @@seattleapplieddeeplearning could be, but i think it doesn't fit your fit

  • @TruthOfZ0
    @TruthOfZ0 Рік тому

    Well the problem is that you are using AI to make it learn wasting time and resources than use machine learning as an optimizer which is a better usage of neural networks! What i mean is that most dont get it when you are supposed to use neural network as an AI to make it learn from data and when to use neural network as a machine learning as an optimizer!! You need an engineer for that not a Phd IT professor xD. Stop wasting your time and hire more engineers!!!

  • @Kevin_Kennelly
    @Kevin_Kennelly Рік тому

    When using acronyms, it is not good to LRTD. And do not ever GLERD. People won't understand the SMARG. . It does help if you ETFM (Explain the Fu*king Meaning) as you write.

  • @stevelam5898
    @stevelam5898 Рік тому

    I had a tutorial few hours ago on how to build an LSTM network using TF only, left me feeling completely stupid. Thank you for showing there is a better way.

  • @terjeoseberg990
    @terjeoseberg990 Рік тому

    Did anyone try scaling the matricies so that that the Eigen value is exactly 1?

    • @seattleapplieddeeplearning
      @seattleapplieddeeplearning Рік тому

      (Leo here - sorry if you see this twice, but YT is blocking comments from my account for some reason.) Yes! My favorite paper on this topic is from Bengio's group which uses Unitary weight matrices, which are complex-valued, but constrained to have their eigenvalues exactly as 1. arxiv.org/abs/1511.06464 A simpler approach is to just initialize the weight-matrices with real-valued orthonormal matrices, a good summary at smerity.com/articles/2016/orthogonal_init.html But overall I think the key thing is that not long after these ideas were being explored, Transformers came along, which are simpler, more robust, and have plenty of other advantages. Critically IMHO, the training depth doesn't scale by the sequence length, which makes convergence much simpler.

    • @terjeoseberg990
      @terjeoseberg990 Рік тому

      @@seattleapplieddeeplearning, Thanks.

  • @lmao4982
    @lmao4982 Рік тому

    This is like 90% of what I remember from my NLP course with all the uncertainty cleared up, thanks!

  • @JeffCaplan313
    @JeffCaplan313 Рік тому

    Transformers seem overly prone to recency bias.

  • @beire1569
    @beire1569 Рік тому

    ooooh I so want to see a documentary about this ==> @25:20

  • @danielschoch9604
    @danielschoch9604 Рік тому

    Linear algebra of variable dimensions? Fock Spaces. Known for 90 years. en.wikipedia.org/wiki/Fock_space

  • @23232323rdurian
    @23232323rdurian Рік тому

    the Eng:French matrix/diagram from 11:35 shows attention between an English and a French vector. But that would involve both the ENCODing and DECODing....how they interact. Whereas speaker is discussing *only* the internals of the ATTENTION mechanim in the Encoder at this point. I'd really like to see a similar matrix/diagram illustrating use of attention WITHIN the ENCODing session......it wouldnt involve French at all at this point, cuz ENCODER hasnt even got to the shared representation yet....the machine version of the <meaning> of the input that comes AFTER the ENCODE, but BEFORE the DECODE..... ==> and you're not alone, I see this same vaguery elsewhere in other <explanations> of Transformer processing.... ==> but then, most likely I just misunderstand......

  • @cafeinomano_
    @cafeinomano_ Рік тому

    Best Transformer explanation ever.

  • @morgengabe1
    @morgengabe1 Рік тому

    If only he'd discovered this before thinking wework up.

  • @argc
    @argc Рік тому

    zlatan?

  • @FrancescoCapuano-ll1md
    @FrancescoCapuano-ll1md Рік тому

    This is outstanding!

  • @Stopinvadingmyhardware
    @Stopinvadingmyhardware Рік тому

    No, don’t care about them.

  • @cliffrosen5180
    @cliffrosen5180 Рік тому

    Wonderfully clear and precise presentation. One thing that tripped me up, though, is this formula at 4 minutes in: Hi+1 = A(Hi, xi) Seems this should rather be: Hi+1 = A(Hi,xi+1) which might be more intuitively written as: Hi = A(Hi-1,xi)

  • @johnnyBrwn
    @johnnyBrwn Рік тому

    This is such a rich talk. He should definitely change the title. I've searched far and wide for a lucid explanation of LSTM - this is the best online but doesn't seem as such due to odd title.

  • @BartoszBielecki
    @BartoszBielecki Рік тому

    World deserve more lectures like this one. I don't need examples on how to tune U-net, but the overview of this huge research space and ideas underneath each group.

  • @user-iw7ku6ml7j
    @user-iw7ku6ml7j Рік тому

    Awesome!

  • @randomcandy1000
    @randomcandy1000 Рік тому

    this is awesome!!! thank you

  • @matthewhuang7857
    @matthewhuang7857 Рік тому

    Thanks for the speech Leo! I'm now a couple of months into ML and this level of articulation really helped a lot. I know this is probably a rookie mistake in this context but often when it's hard for my model to converge, I thought it's probably because it reaches a 'local minima'. My practice is often significantly bumping up the learning rate to hopefully let the model to kinda leap over and get to a point where it can re-converge. According to what you said, there are evidences conclusively proving there's no local minima in loss functions. I'm wondering which specific papers you were talking about. regards, Matt

  • @jackholloway7516
    @jackholloway7516 Рік тому

    💓💓☪️

  • @sainissunil
    @sainissunil Рік тому

    This talk is awesome!

  • @Jirayu.Kaewprateep
    @Jirayu.Kaewprateep Рік тому

    📺💬 Yui krub I give you ice cream 🍦 when we plot sine wave in the word sentiment we still see some relationship that can be converted into word sequences in the sentence. 🐑💬 It is possible and what you to do with the time domain when input is in bunches of frequencies with the time-related relationship. 🥺💬 I hope they can mixed together with embedding or shuffling but remain the information within the same set of the inputs. 🐑💬 You plot the Sigmoid function, Tanh and reLU and yes you can do a direct compares the estimated values within the same time domain. 📺💬 Now give me some see what me dress like ⁉️ 👧💬 There are many points one significant see is low precisions network machine when execution with less precision but high accuracy. 📺💬 Words CNN it can do some tasks better for di-grams tri-grams tasks it is working as CNN layer. 🐑💬 That is meaning we can add label or additional data into it ⁉️ 👧💬 Do you mean the scores, good, bad or some properties you earn from other networks or training with concatenated layers ⁉️ 🧸💬 You cannot copies and separated each parts when they are working.

  • @juliawang3131
    @juliawang3131 Рік тому

    impressive!

  • @miguelduqueb7065
    @miguelduqueb7065 Рік тому

    Such insights so easily explained denote a deep understanding of the topic and great teaching skills. I am eager to see more lectures or talks by this author. Thanks.

  • @johnoboyle3097
    @johnoboyle3097 2 роки тому

    Any chance this guy is related to Paul Dirac?