AI_DL_Assignment / 12. Optimizers, Learning Rates & Callbacks with Fruit Classification /2. Types Optimizers and Adaptive Learning Rate Methods.srt
| 1 | |
| 00:00:00,530 --> 00:00:06,240 | |
| OK so let's start of Chapter 11 point sorry twelve point one by looking at the types of optimizers we | |
| 2 | |
| 00:00:06,240 --> 00:00:10,520 | |
| have available in Karris and look at some adaptive learning methods. | |
| 3 | |
| 00:00:10,530 --> 00:00:12,540 | |
| So let's dive in. | |
| 4 | |
| 00:00:12,600 --> 00:00:15,560 | |
| So optimizes what exactly optimizes. | |
| 5 | |
| 00:00:15,570 --> 00:00:21,540 | |
| Now you may have remembered in a neural net explanation optimizes with the algorithm we use to minimize | |
| 6 | |
| 00:00:21,550 --> 00:00:22,400 | |
| our loss. | |
| 7 | |
| 00:00:22,560 --> 00:00:27,120 | |
| And some examples of this which I should be familiar to you know would be really in the sense stochastic | |
| 8 | |
| 00:00:27,120 --> 00:00:28,940 | |
| really innocent and many batched. | |
| 9 | |
| 00:00:30,380 --> 00:00:35,930 | |
| So Cara's actually comes with a lot more itemizes than those we have the standard stochastic Grilli | |
| 10 | |
| 00:00:35,930 --> 00:00:39,560 | |
| and dissent armis Prop 8 a grad in adulthood. | |
| 11 | |
| 00:00:39,610 --> 00:00:41,300 | |
| Adam Adam Max and Adam | |
| 12 | |
| 00:00:44,800 --> 00:00:52,320 | |
| so in a quick aside about snow constant rates are generally bad especially if you start off too big. | |
| 13 | |
| 00:00:52,400 --> 00:00:58,440 | |
| So I imagine after 50 long epochs ebox when we're thinking we're close to Convergence but then the problem | |
| 14 | |
| 00:00:58,550 --> 00:01:05,970 | |
| is I will integrate basically bounces around and loss and our sorry our training center at Teslas So | |
| 15 | |
| 00:01:06,180 --> 00:01:07,890 | |
| basically stop increasing. | |
| 16 | |
| 00:01:07,890 --> 00:01:09,170 | |
| That's a bad situation. | |
| 17 | |
| 00:01:09,300 --> 00:01:14,760 | |
| When you're training so you always want to use a learning rate that's small as small as possible without | |
| 18 | |
| 00:01:14,760 --> 00:01:17,630 | |
| being too small because it's too small or small of learning. | |
| 19 | |
| 00:01:17,640 --> 00:01:21,730 | |
| It would just simply take forever to train. | |
| 20 | |
| 00:01:22,210 --> 00:01:23,580 | |
| So there's so many choices. | |
| 21 | |
| 00:01:23,630 --> 00:01:24,790 | |
| What's the difference. | |
| 22 | |
| 00:01:24,800 --> 00:01:29,510 | |
| The main difference in these algorithms is how they manipulate learning rates to allow faster convergence | |
| 23 | |
| 00:01:29,540 --> 00:01:31,700 | |
| and better validation accuracy. | |
| 24 | |
| 00:01:31,700 --> 00:01:38,150 | |
| Some require like a gradient descent setting off some manual parameters or even adjusting or learning | |
| 25 | |
| 00:01:38,160 --> 00:01:43,460 | |
| which you will come to shortly and then some of them use a stick approach to provide adaptive learning | |
| 26 | |
| 00:01:43,460 --> 00:01:45,120 | |
| rates which are quite cool. | |
| 27 | |
| 00:01:45,170 --> 00:01:47,690 | |
| We'll actually see some of the comparisons shortly. | |
| 28 | |
| 00:01:49,760 --> 00:01:54,910 | |
| So let's talk a bit about stochastic really and descent and the parameters Karris allows us to control. | |
| 29 | |
| 00:01:55,130 --> 00:02:00,920 | |
| So by default cursor uses a constantly integrates in SAGD itemizes at stochastic really and descent | |
| 30 | |
| 00:02:01,590 --> 00:02:08,360 | |
| and however we can set these parameters off momentum to key and also turn off or on something called | |
| 31 | |
| 00:02:08,450 --> 00:02:10,340 | |
| noster of momentum. | |
| 32 | |
| 00:02:10,340 --> 00:02:12,980 | |
| So let's talk about a bit about momentum. | |
| 33 | |
| 00:02:12,980 --> 00:02:19,220 | |
| Momentum is a technique that accelerates SAGD by pushing degree and steps along the relevant direction | |
| 34 | |
| 00:02:19,670 --> 00:02:23,680 | |
| but reducing the jump in oscillations away from relevant directions. | |
| 35 | |
| 00:02:23,690 --> 00:02:29,360 | |
| So basically it basically encourages O'Grady and descent to head into the direction that is basically | |
| 36 | |
| 00:02:29,360 --> 00:02:30,380 | |
| reducing loss. | |
| 37 | |
| 00:02:30,440 --> 00:02:33,340 | |
| So it doesn't actually go away from that spot. | |
| 38 | |
| 00:02:34,480 --> 00:02:42,200 | |
| And big jumps at least to kids is something that the kids are able great every batch of it not Iput | |
| 39 | |
| 00:02:42,260 --> 00:02:42,700 | |
| Depok. | |
| 40 | |
| 00:02:42,710 --> 00:02:44,300 | |
| So be careful how you set about size. | |
| 41 | |
| 00:02:44,300 --> 00:02:44,930 | |
| By the way. | |
| 42 | |
| 00:02:45,050 --> 00:02:49,040 | |
| This is it only at a time but size becomes relevant in training. | |
| 43 | |
| 00:02:49,460 --> 00:02:54,230 | |
| A good rule of thumb though is for setting the Dekhi is equal to leaning read divided by a number of | |
| 44 | |
| 00:02:54,230 --> 00:02:54,890 | |
| ebox. | |
| 45 | |
| 00:02:54,950 --> 00:02:55,920 | |
| OK. | |
| 46 | |
| 00:02:56,450 --> 00:03:02,090 | |
| So that's how we choose the key value noster off basically is a guy who actually developed a method | |
| 47 | |
| 00:03:02,450 --> 00:03:08,480 | |
| that solves the problem of oscillating around r.m.r not us losing oscillating around on minima basically | |
| 48 | |
| 00:03:08,480 --> 00:03:13,410 | |
| means getting rid of too big for us to actually converge at a minimum point. | |
| 49 | |
| 00:03:13,910 --> 00:03:18,140 | |
| And this happens when a minute momentum is high and unable to slow down. | |
| 50 | |
| 00:03:18,530 --> 00:03:23,370 | |
| So this makes a big jump then a small correction and after that the gradient is calculated. | |
| 51 | |
| 00:03:23,570 --> 00:03:25,530 | |
| That's not Esther of weeks. | |
| 52 | |
| 00:03:25,550 --> 00:03:33,160 | |
| So all of those encourage you to use Nassor of being if you actually want to get fast to. | |
| 53 | |
| 00:03:33,170 --> 00:03:38,440 | |
| So this is a good illustration here of Nestor off woman to illustrate it. | |
| 54 | |
| 00:03:38,510 --> 00:03:42,120 | |
| It was taken from this source it seems to try one stone for its course. | |
| 55 | |
| 00:03:43,980 --> 00:03:48,930 | |
| So basically this is actually all the other algorithms combined here. | |
| 56 | |
| 00:03:49,530 --> 00:03:53,350 | |
| And basically it shows you how women actually looks as well. | |
| 57 | |
| 00:03:53,370 --> 00:03:54,230 | |
| So take a look. | |
| 58 | |
| 00:03:54,510 --> 00:03:56,250 | |
| It's actually quite interesting. | |
| 59 | |
| 00:03:56,250 --> 00:04:04,070 | |
| This is SDD Witt moment with sort of enabled by the way so you can see it's taking a while to get here. | |
| 60 | |
| 00:04:04,350 --> 00:04:08,710 | |
| But it will eventually get here although the others have gotten their way quickly. | |
| 61 | |
| 00:04:11,890 --> 00:04:14,360 | |
| I actually was wrong in one thing. | |
| 62 | |
| 00:04:14,360 --> 00:04:16,360 | |
| I just remember them taking it on my second screen. | |
| 63 | |
| 00:04:16,580 --> 00:04:20,090 | |
| Actually no momentum actually had noster of enabled here. | |
| 64 | |
| 00:04:20,280 --> 00:04:22,260 | |
| SDD was just plain vanilla yesterday. | |
| 65 | |
| 00:04:23,990 --> 00:04:29,680 | |
| So I can see all of these advanced optimizers got there eventually except for yesterday which took forever. | |
| 66 | |
| 00:04:31,660 --> 00:04:34,600 | |
| So let's talk a bit about more of those other algorithms here. | |
| 67 | |
| 00:04:34,660 --> 00:04:35,720 | |
| Some of these here. | |
| 68 | |
| 00:04:36,100 --> 00:04:39,120 | |
| Let's start talking about the ones that are available in Paris. | |
| 69 | |
| 00:04:39,520 --> 00:04:44,870 | |
| So we just saw we set up parameters to control that getting Richard you and living Richard ules are | |
| 70 | |
| 00:04:44,890 --> 00:04:46,050 | |
| basically either. | |
| 71 | |
| 00:04:46,150 --> 00:04:52,900 | |
| How are we leading rates adapt over to treating process based be it based on number of e-books have | |
| 72 | |
| 00:04:52,900 --> 00:04:55,290 | |
| been completed or other parameters. | |
| 73 | |
| 00:04:55,320 --> 00:04:56,070 | |
| OK. | |
| 74 | |
| 00:04:56,830 --> 00:05:00,430 | |
| That's why it's adaptive learning repentance and each book differently. | |
| 75 | |
| 00:05:00,430 --> 00:05:03,130 | |
| So let's talk quickly about that. | |
| 76 | |
| 00:05:03,360 --> 00:05:09,490 | |
| This performs large of the more sparse parameters and smaller of the ads for less pass parameters. | |
| 77 | |
| 00:05:09,530 --> 00:05:12,240 | |
| Is this actually about English. | |
| 78 | |
| 00:05:12,340 --> 00:05:19,480 | |
| It is the Tuss it is us well-suited for spazzed do how ever will because the learning rate is always | |
| 79 | |
| 00:05:19,480 --> 00:05:24,060 | |
| decreasing monotonically after many books learning slows down to a crawl. | |
| 80 | |
| 00:05:24,580 --> 00:05:31,640 | |
| So it adults actually solves this monotonically decreasing gradient basically an integral and occurs | |
| 81 | |
| 00:05:31,660 --> 00:05:35,730 | |
| in allograft armis proper is actually similar to integrate it adults. | |
| 82 | |
| 00:05:35,760 --> 00:05:41,410 | |
| Sorry couldn't find much information to explain this but just remember these are similar probably discovered | |
| 83 | |
| 00:05:41,440 --> 00:05:48,430 | |
| and have similar discovered separately but have similar methods of action an atom which is not really | |
| 84 | |
| 00:05:48,520 --> 00:05:55,360 | |
| item its name but item it's similar to either Delta but still has momentum for learning rates for each | |
| 85 | |
| 00:05:55,360 --> 00:05:56,070 | |
| parameter. | |
| 86 | |
| 00:05:56,400 --> 00:05:57,910 | |
| Each of the parameters separately. | |
| 87 | |
| 00:05:58,240 --> 00:06:05,240 | |
| I'll correct these little mistakes here and there before this should be decided to you guys. | |
| 88 | |
| 00:06:05,300 --> 00:06:07,640 | |
| So what does a good learning read look like. | |
| 89 | |
| 00:06:07,640 --> 00:06:10,580 | |
| So this is Holan a loss graph here. | |
| 90 | |
| 00:06:10,770 --> 00:06:16,010 | |
| Epoxied This is how it would look if we had a very high rate a large rate. | |
| 91 | |
| 00:06:16,130 --> 00:06:20,660 | |
| Basically we would never find a convergence zone because we'd be bouncing around everywhere and a lot | |
| 92 | |
| 00:06:20,660 --> 00:06:22,940 | |
| of problem just get worse over time. | |
| 93 | |
| 00:06:23,270 --> 00:06:27,560 | |
| A little learning it will eventually get there however it'll take a while. | |
| 94 | |
| 00:06:27,640 --> 00:06:29,930 | |
| Highly integrated will eventually get here too. | |
| 95 | |
| 00:06:30,240 --> 00:06:35,780 | |
| But this could actually could be interchangeable but basically it will get here to good living rates | |
| 96 | |
| 00:06:35,780 --> 00:06:41,560 | |
| have nice gradual smooth and decreasing steps and basically converge at lower points over time. | |
| 97 | |
| 00:06:43,080 --> 00:06:49,320 | |
| And finally if you go to Icarus you slash optimizes it brings up a list of all optimizes and actually | |
| 98 | |
| 00:06:49,320 --> 00:06:56,220 | |
| how to use decode here and how to what settings that are available to optimize this so you can see as | |
| 99 | |
| 00:06:56,220 --> 00:06:59,570 | |
| she does here basically we have some of these parameters here. | |
| 100 | |
| 00:07:00,400 --> 00:07:03,080 | |
| Noster of decie I mentioned to you before. | |
| 101 | |
| 00:07:04,070 --> 00:07:10,130 | |
| Armis prop all of these are available here and some explanations on what they do what to what you can | |
| 102 | |
| 00:07:10,130 --> 00:07:13,480 | |
| tweak is all available on Kerrison site. | |
| 103 | |
| 00:07:13,480 --> 00:07:15,350 | |
| So take a look. | |
| 104 | |
| 00:07:15,680 --> 00:07:22,630 | |
| Just so you know in best practice I find Adam to be one of the best optimizers. | |
| 105 | |
| 00:07:22,650 --> 00:07:27,270 | |
| I don't even actually have to go for or from these default values. | |
| 106 | |
| 00:07:27,280 --> 00:07:29,520 | |
| I actually just put it to be slightly smaller. | |
| 107 | |
| 00:07:29,780 --> 00:07:38,000 | |
| But all these values are usually fine also what's quite good to use as well is SAGD with a very low | |
| 108 | |
| 00:07:38,000 --> 00:07:38,890 | |
| living rate. | |
| 109 | |
| 00:07:39,020 --> 00:07:41,830 | |
| You can set the momentum and key according to your parameters. | |
| 110 | |
| 00:07:42,000 --> 00:07:44,900 | |
| I mentioned before and always used sort of as true. | |