1 00:00:00,530 --> 00:00:06,240 OK so let's start of Chapter 11 point sorry twelve point one by looking at the types of optimizers we 2 00:00:06,240 --> 00:00:10,520 have available in Karris and look at some adaptive learning methods. 3 00:00:10,530 --> 00:00:12,540 So let's dive in. 4 00:00:12,600 --> 00:00:15,560 So optimizes what exactly optimizes. 5 00:00:15,570 --> 00:00:21,540 Now you may have remembered in a neural net explanation optimizes with the algorithm we use to minimize 6 00:00:21,550 --> 00:00:22,400 our loss. 7 00:00:22,560 --> 00:00:27,120 And some examples of this which I should be familiar to you know would be really in the sense stochastic 8 00:00:27,120 --> 00:00:28,940 really innocent and many batched. 9 00:00:30,380 --> 00:00:35,930 So Cara's actually comes with a lot more itemizes than those we have the standard stochastic Grilli 10 00:00:35,930 --> 00:00:39,560 and dissent armis Prop 8 a grad in adulthood. 11 00:00:39,610 --> 00:00:41,300 Adam Adam Max and Adam 12 00:00:44,800 --> 00:00:52,320 so in a quick aside about snow constant rates are generally bad especially if you start off too big. 13 00:00:52,400 --> 00:00:58,440 So I imagine after 50 long epochs ebox when we're thinking we're close to Convergence but then the problem 14 00:00:58,550 --> 00:01:05,970 is I will integrate basically bounces around and loss and our sorry our training center at Teslas So 15 00:01:06,180 --> 00:01:07,890 basically stop increasing. 16 00:01:07,890 --> 00:01:09,170 That's a bad situation. 17 00:01:09,300 --> 00:01:14,760 When you're training so you always want to use a learning rate that's small as small as possible without 18 00:01:14,760 --> 00:01:17,630 being too small because it's too small or small of learning. 19 00:01:17,640 --> 00:01:21,730 It would just simply take forever to train. 20 00:01:22,210 --> 00:01:23,580 So there's so many choices. 21 00:01:23,630 --> 00:01:24,790 What's the difference. 22 00:01:24,800 --> 00:01:29,510 The main difference in these algorithms is how they manipulate learning rates to allow faster convergence 23 00:01:29,540 --> 00:01:31,700 and better validation accuracy. 24 00:01:31,700 --> 00:01:38,150 Some require like a gradient descent setting off some manual parameters or even adjusting or learning 25 00:01:38,160 --> 00:01:43,460 which you will come to shortly and then some of them use a stick approach to provide adaptive learning 26 00:01:43,460 --> 00:01:45,120 rates which are quite cool. 27 00:01:45,170 --> 00:01:47,690 We'll actually see some of the comparisons shortly. 28 00:01:49,760 --> 00:01:54,910 So let's talk a bit about stochastic really and descent and the parameters Karris allows us to control. 29 00:01:55,130 --> 00:02:00,920 So by default cursor uses a constantly integrates in SAGD itemizes at stochastic really and descent 30 00:02:01,590 --> 00:02:08,360 and however we can set these parameters off momentum to key and also turn off or on something called 31 00:02:08,450 --> 00:02:10,340 noster of momentum. 32 00:02:10,340 --> 00:02:12,980 So let's talk about a bit about momentum. 33 00:02:12,980 --> 00:02:19,220 Momentum is a technique that accelerates SAGD by pushing degree and steps along the relevant direction 34 00:02:19,670 --> 00:02:23,680 but reducing the jump in oscillations away from relevant directions. 35 00:02:23,690 --> 00:02:29,360 So basically it basically encourages O'Grady and descent to head into the direction that is basically 36 00:02:29,360 --> 00:02:30,380 reducing loss. 37 00:02:30,440 --> 00:02:33,340 So it doesn't actually go away from that spot. 38 00:02:34,480 --> 00:02:42,200 And big jumps at least to kids is something that the kids are able great every batch of it not Iput 39 00:02:42,260 --> 00:02:42,700 Depok. 40 00:02:42,710 --> 00:02:44,300 So be careful how you set about size. 41 00:02:44,300 --> 00:02:44,930 By the way. 42 00:02:45,050 --> 00:02:49,040 This is it only at a time but size becomes relevant in training. 43 00:02:49,460 --> 00:02:54,230 A good rule of thumb though is for setting the Dekhi is equal to leaning read divided by a number of 44 00:02:54,230 --> 00:02:54,890 ebox. 45 00:02:54,950 --> 00:02:55,920 OK. 46 00:02:56,450 --> 00:03:02,090 So that's how we choose the key value noster off basically is a guy who actually developed a method 47 00:03:02,450 --> 00:03:08,480 that solves the problem of oscillating around r.m.r not us losing oscillating around on minima basically 48 00:03:08,480 --> 00:03:13,410 means getting rid of too big for us to actually converge at a minimum point. 49 00:03:13,910 --> 00:03:18,140 And this happens when a minute momentum is high and unable to slow down. 50 00:03:18,530 --> 00:03:23,370 So this makes a big jump then a small correction and after that the gradient is calculated. 51 00:03:23,570 --> 00:03:25,530 That's not Esther of weeks. 52 00:03:25,550 --> 00:03:33,160 So all of those encourage you to use Nassor of being if you actually want to get fast to. 53 00:03:33,170 --> 00:03:38,440 So this is a good illustration here of Nestor off woman to illustrate it. 54 00:03:38,510 --> 00:03:42,120 It was taken from this source it seems to try one stone for its course. 55 00:03:43,980 --> 00:03:48,930 So basically this is actually all the other algorithms combined here. 56 00:03:49,530 --> 00:03:53,350 And basically it shows you how women actually looks as well. 57 00:03:53,370 --> 00:03:54,230 So take a look. 58 00:03:54,510 --> 00:03:56,250 It's actually quite interesting. 59 00:03:56,250 --> 00:04:04,070 This is SDD Witt moment with sort of enabled by the way so you can see it's taking a while to get here. 60 00:04:04,350 --> 00:04:08,710 But it will eventually get here although the others have gotten their way quickly. 61 00:04:11,890 --> 00:04:14,360 I actually was wrong in one thing. 62 00:04:14,360 --> 00:04:16,360 I just remember them taking it on my second screen. 63 00:04:16,580 --> 00:04:20,090 Actually no momentum actually had noster of enabled here. 64 00:04:20,280 --> 00:04:22,260 SDD was just plain vanilla yesterday. 65 00:04:23,990 --> 00:04:29,680 So I can see all of these advanced optimizers got there eventually except for yesterday which took forever. 66 00:04:31,660 --> 00:04:34,600 So let's talk a bit about more of those other algorithms here. 67 00:04:34,660 --> 00:04:35,720 Some of these here. 68 00:04:36,100 --> 00:04:39,120 Let's start talking about the ones that are available in Paris. 69 00:04:39,520 --> 00:04:44,870 So we just saw we set up parameters to control that getting Richard you and living Richard ules are 70 00:04:44,890 --> 00:04:46,050 basically either. 71 00:04:46,150 --> 00:04:52,900 How are we leading rates adapt over to treating process based be it based on number of e-books have 72 00:04:52,900 --> 00:04:55,290 been completed or other parameters. 73 00:04:55,320 --> 00:04:56,070 OK. 74 00:04:56,830 --> 00:05:00,430 That's why it's adaptive learning repentance and each book differently. 75 00:05:00,430 --> 00:05:03,130 So let's talk quickly about that. 76 00:05:03,360 --> 00:05:09,490 This performs large of the more sparse parameters and smaller of the ads for less pass parameters. 77 00:05:09,530 --> 00:05:12,240 Is this actually about English. 78 00:05:12,340 --> 00:05:19,480 It is the Tuss it is us well-suited for spazzed do how ever will because the learning rate is always 79 00:05:19,480 --> 00:05:24,060 decreasing monotonically after many books learning slows down to a crawl. 80 00:05:24,580 --> 00:05:31,640 So it adults actually solves this monotonically decreasing gradient basically an integral and occurs 81 00:05:31,660 --> 00:05:35,730 in allograft armis proper is actually similar to integrate it adults. 82 00:05:35,760 --> 00:05:41,410 Sorry couldn't find much information to explain this but just remember these are similar probably discovered 83 00:05:41,440 --> 00:05:48,430 and have similar discovered separately but have similar methods of action an atom which is not really 84 00:05:48,520 --> 00:05:55,360 item its name but item it's similar to either Delta but still has momentum for learning rates for each 85 00:05:55,360 --> 00:05:56,070 parameter. 86 00:05:56,400 --> 00:05:57,910 Each of the parameters separately. 87 00:05:58,240 --> 00:06:05,240 I'll correct these little mistakes here and there before this should be decided to you guys. 88 00:06:05,300 --> 00:06:07,640 So what does a good learning read look like. 89 00:06:07,640 --> 00:06:10,580 So this is Holan a loss graph here. 90 00:06:10,770 --> 00:06:16,010 Epoxied This is how it would look if we had a very high rate a large rate. 91 00:06:16,130 --> 00:06:20,660 Basically we would never find a convergence zone because we'd be bouncing around everywhere and a lot 92 00:06:20,660 --> 00:06:22,940 of problem just get worse over time. 93 00:06:23,270 --> 00:06:27,560 A little learning it will eventually get there however it'll take a while. 94 00:06:27,640 --> 00:06:29,930 Highly integrated will eventually get here too. 95 00:06:30,240 --> 00:06:35,780 But this could actually could be interchangeable but basically it will get here to good living rates 96 00:06:35,780 --> 00:06:41,560 have nice gradual smooth and decreasing steps and basically converge at lower points over time. 97 00:06:43,080 --> 00:06:49,320 And finally if you go to Icarus you slash optimizes it brings up a list of all optimizes and actually 98 00:06:49,320 --> 00:06:56,220 how to use decode here and how to what settings that are available to optimize this so you can see as 99 00:06:56,220 --> 00:06:59,570 she does here basically we have some of these parameters here. 100 00:07:00,400 --> 00:07:03,080 Noster of decie I mentioned to you before. 101 00:07:04,070 --> 00:07:10,130 Armis prop all of these are available here and some explanations on what they do what to what you can 102 00:07:10,130 --> 00:07:13,480 tweak is all available on Kerrison site. 103 00:07:13,480 --> 00:07:15,350 So take a look. 104 00:07:15,680 --> 00:07:22,630 Just so you know in best practice I find Adam to be one of the best optimizers. 105 00:07:22,650 --> 00:07:27,270 I don't even actually have to go for or from these default values. 106 00:07:27,280 --> 00:07:29,520 I actually just put it to be slightly smaller. 107 00:07:29,780 --> 00:07:38,000 But all these values are usually fine also what's quite good to use as well is SAGD with a very low 108 00:07:38,000 --> 00:07:38,890 living rate. 109 00:07:39,020 --> 00:07:41,830 You can set the momentum and key according to your parameters. 110 00:07:42,000 --> 00:07:44,900 I mentioned before and always used sort of as true.