AI_DL_Assignment / 12. Optimizers, Learning Rates & Callbacks with Fruit Classification /2. Types Optimizers and Adaptive Learning Rate Methods.srt
Prince-1's picture
Add files using upload-large-folder tool
0182da2 verified
1
00:00:00,530 --> 00:00:06,240
OK so let's start of Chapter 11 point sorry twelve point one by looking at the types of optimizers we
2
00:00:06,240 --> 00:00:10,520
have available in Karris and look at some adaptive learning methods.
3
00:00:10,530 --> 00:00:12,540
So let's dive in.
4
00:00:12,600 --> 00:00:15,560
So optimizes what exactly optimizes.
5
00:00:15,570 --> 00:00:21,540
Now you may have remembered in a neural net explanation optimizes with the algorithm we use to minimize
6
00:00:21,550 --> 00:00:22,400
our loss.
7
00:00:22,560 --> 00:00:27,120
And some examples of this which I should be familiar to you know would be really in the sense stochastic
8
00:00:27,120 --> 00:00:28,940
really innocent and many batched.
9
00:00:30,380 --> 00:00:35,930
So Cara's actually comes with a lot more itemizes than those we have the standard stochastic Grilli
10
00:00:35,930 --> 00:00:39,560
and dissent armis Prop 8 a grad in adulthood.
11
00:00:39,610 --> 00:00:41,300
Adam Adam Max and Adam
12
00:00:44,800 --> 00:00:52,320
so in a quick aside about snow constant rates are generally bad especially if you start off too big.
13
00:00:52,400 --> 00:00:58,440
So I imagine after 50 long epochs ebox when we're thinking we're close to Convergence but then the problem
14
00:00:58,550 --> 00:01:05,970
is I will integrate basically bounces around and loss and our sorry our training center at Teslas So
15
00:01:06,180 --> 00:01:07,890
basically stop increasing.
16
00:01:07,890 --> 00:01:09,170
That's a bad situation.
17
00:01:09,300 --> 00:01:14,760
When you're training so you always want to use a learning rate that's small as small as possible without
18
00:01:14,760 --> 00:01:17,630
being too small because it's too small or small of learning.
19
00:01:17,640 --> 00:01:21,730
It would just simply take forever to train.
20
00:01:22,210 --> 00:01:23,580
So there's so many choices.
21
00:01:23,630 --> 00:01:24,790
What's the difference.
22
00:01:24,800 --> 00:01:29,510
The main difference in these algorithms is how they manipulate learning rates to allow faster convergence
23
00:01:29,540 --> 00:01:31,700
and better validation accuracy.
24
00:01:31,700 --> 00:01:38,150
Some require like a gradient descent setting off some manual parameters or even adjusting or learning
25
00:01:38,160 --> 00:01:43,460
which you will come to shortly and then some of them use a stick approach to provide adaptive learning
26
00:01:43,460 --> 00:01:45,120
rates which are quite cool.
27
00:01:45,170 --> 00:01:47,690
We'll actually see some of the comparisons shortly.
28
00:01:49,760 --> 00:01:54,910
So let's talk a bit about stochastic really and descent and the parameters Karris allows us to control.
29
00:01:55,130 --> 00:02:00,920
So by default cursor uses a constantly integrates in SAGD itemizes at stochastic really and descent
30
00:02:01,590 --> 00:02:08,360
and however we can set these parameters off momentum to key and also turn off or on something called
31
00:02:08,450 --> 00:02:10,340
noster of momentum.
32
00:02:10,340 --> 00:02:12,980
So let's talk about a bit about momentum.
33
00:02:12,980 --> 00:02:19,220
Momentum is a technique that accelerates SAGD by pushing degree and steps along the relevant direction
34
00:02:19,670 --> 00:02:23,680
but reducing the jump in oscillations away from relevant directions.
35
00:02:23,690 --> 00:02:29,360
So basically it basically encourages O'Grady and descent to head into the direction that is basically
36
00:02:29,360 --> 00:02:30,380
reducing loss.
37
00:02:30,440 --> 00:02:33,340
So it doesn't actually go away from that spot.
38
00:02:34,480 --> 00:02:42,200
And big jumps at least to kids is something that the kids are able great every batch of it not Iput
39
00:02:42,260 --> 00:02:42,700
Depok.
40
00:02:42,710 --> 00:02:44,300
So be careful how you set about size.
41
00:02:44,300 --> 00:02:44,930
By the way.
42
00:02:45,050 --> 00:02:49,040
This is it only at a time but size becomes relevant in training.
43
00:02:49,460 --> 00:02:54,230
A good rule of thumb though is for setting the Dekhi is equal to leaning read divided by a number of
44
00:02:54,230 --> 00:02:54,890
ebox.
45
00:02:54,950 --> 00:02:55,920
OK.
46
00:02:56,450 --> 00:03:02,090
So that's how we choose the key value noster off basically is a guy who actually developed a method
47
00:03:02,450 --> 00:03:08,480
that solves the problem of oscillating around r.m.r not us losing oscillating around on minima basically
48
00:03:08,480 --> 00:03:13,410
means getting rid of too big for us to actually converge at a minimum point.
49
00:03:13,910 --> 00:03:18,140
And this happens when a minute momentum is high and unable to slow down.
50
00:03:18,530 --> 00:03:23,370
So this makes a big jump then a small correction and after that the gradient is calculated.
51
00:03:23,570 --> 00:03:25,530
That's not Esther of weeks.
52
00:03:25,550 --> 00:03:33,160
So all of those encourage you to use Nassor of being if you actually want to get fast to.
53
00:03:33,170 --> 00:03:38,440
So this is a good illustration here of Nestor off woman to illustrate it.
54
00:03:38,510 --> 00:03:42,120
It was taken from this source it seems to try one stone for its course.
55
00:03:43,980 --> 00:03:48,930
So basically this is actually all the other algorithms combined here.
56
00:03:49,530 --> 00:03:53,350
And basically it shows you how women actually looks as well.
57
00:03:53,370 --> 00:03:54,230
So take a look.
58
00:03:54,510 --> 00:03:56,250
It's actually quite interesting.
59
00:03:56,250 --> 00:04:04,070
This is SDD Witt moment with sort of enabled by the way so you can see it's taking a while to get here.
60
00:04:04,350 --> 00:04:08,710
But it will eventually get here although the others have gotten their way quickly.
61
00:04:11,890 --> 00:04:14,360
I actually was wrong in one thing.
62
00:04:14,360 --> 00:04:16,360
I just remember them taking it on my second screen.
63
00:04:16,580 --> 00:04:20,090
Actually no momentum actually had noster of enabled here.
64
00:04:20,280 --> 00:04:22,260
SDD was just plain vanilla yesterday.
65
00:04:23,990 --> 00:04:29,680
So I can see all of these advanced optimizers got there eventually except for yesterday which took forever.
66
00:04:31,660 --> 00:04:34,600
So let's talk a bit about more of those other algorithms here.
67
00:04:34,660 --> 00:04:35,720
Some of these here.
68
00:04:36,100 --> 00:04:39,120
Let's start talking about the ones that are available in Paris.
69
00:04:39,520 --> 00:04:44,870
So we just saw we set up parameters to control that getting Richard you and living Richard ules are
70
00:04:44,890 --> 00:04:46,050
basically either.
71
00:04:46,150 --> 00:04:52,900
How are we leading rates adapt over to treating process based be it based on number of e-books have
72
00:04:52,900 --> 00:04:55,290
been completed or other parameters.
73
00:04:55,320 --> 00:04:56,070
OK.
74
00:04:56,830 --> 00:05:00,430
That's why it's adaptive learning repentance and each book differently.
75
00:05:00,430 --> 00:05:03,130
So let's talk quickly about that.
76
00:05:03,360 --> 00:05:09,490
This performs large of the more sparse parameters and smaller of the ads for less pass parameters.
77
00:05:09,530 --> 00:05:12,240
Is this actually about English.
78
00:05:12,340 --> 00:05:19,480
It is the Tuss it is us well-suited for spazzed do how ever will because the learning rate is always
79
00:05:19,480 --> 00:05:24,060
decreasing monotonically after many books learning slows down to a crawl.
80
00:05:24,580 --> 00:05:31,640
So it adults actually solves this monotonically decreasing gradient basically an integral and occurs
81
00:05:31,660 --> 00:05:35,730
in allograft armis proper is actually similar to integrate it adults.
82
00:05:35,760 --> 00:05:41,410
Sorry couldn't find much information to explain this but just remember these are similar probably discovered
83
00:05:41,440 --> 00:05:48,430
and have similar discovered separately but have similar methods of action an atom which is not really
84
00:05:48,520 --> 00:05:55,360
item its name but item it's similar to either Delta but still has momentum for learning rates for each
85
00:05:55,360 --> 00:05:56,070
parameter.
86
00:05:56,400 --> 00:05:57,910
Each of the parameters separately.
87
00:05:58,240 --> 00:06:05,240
I'll correct these little mistakes here and there before this should be decided to you guys.
88
00:06:05,300 --> 00:06:07,640
So what does a good learning read look like.
89
00:06:07,640 --> 00:06:10,580
So this is Holan a loss graph here.
90
00:06:10,770 --> 00:06:16,010
Epoxied This is how it would look if we had a very high rate a large rate.
91
00:06:16,130 --> 00:06:20,660
Basically we would never find a convergence zone because we'd be bouncing around everywhere and a lot
92
00:06:20,660 --> 00:06:22,940
of problem just get worse over time.
93
00:06:23,270 --> 00:06:27,560
A little learning it will eventually get there however it'll take a while.
94
00:06:27,640 --> 00:06:29,930
Highly integrated will eventually get here too.
95
00:06:30,240 --> 00:06:35,780
But this could actually could be interchangeable but basically it will get here to good living rates
96
00:06:35,780 --> 00:06:41,560
have nice gradual smooth and decreasing steps and basically converge at lower points over time.
97
00:06:43,080 --> 00:06:49,320
And finally if you go to Icarus you slash optimizes it brings up a list of all optimizes and actually
98
00:06:49,320 --> 00:06:56,220
how to use decode here and how to what settings that are available to optimize this so you can see as
99
00:06:56,220 --> 00:06:59,570
she does here basically we have some of these parameters here.
100
00:07:00,400 --> 00:07:03,080
Noster of decie I mentioned to you before.
101
00:07:04,070 --> 00:07:10,130
Armis prop all of these are available here and some explanations on what they do what to what you can
102
00:07:10,130 --> 00:07:13,480
tweak is all available on Kerrison site.
103
00:07:13,480 --> 00:07:15,350
So take a look.
104
00:07:15,680 --> 00:07:22,630
Just so you know in best practice I find Adam to be one of the best optimizers.
105
00:07:22,650 --> 00:07:27,270
I don't even actually have to go for or from these default values.
106
00:07:27,280 --> 00:07:29,520
I actually just put it to be slightly smaller.
107
00:07:29,780 --> 00:07:38,000
But all these values are usually fine also what's quite good to use as well is SAGD with a very low
108
00:07:38,000 --> 00:07:38,890
living rate.
109
00:07:39,020 --> 00:07:41,830
You can set the momentum and key according to your parameters.
110
00:07:42,000 --> 00:07:44,900
I mentioned before and always used sort of as true.