File size: 11,224 Bytes

0182da2

1
00:00:00,530 --> 00:00:06,240
OK so let's start of Chapter 11 point sorry twelve point one by looking at the types of optimizers we

2
00:00:06,240 --> 00:00:10,520
have available in Karris and look at some adaptive learning methods.

3
00:00:10,530 --> 00:00:12,540
So let's dive in.

4
00:00:12,600 --> 00:00:15,560
So optimizes what exactly optimizes.

5
00:00:15,570 --> 00:00:21,540
Now you may have remembered in a neural net explanation optimizes with the algorithm we use to minimize

6
00:00:21,550 --> 00:00:22,400
our loss.

7
00:00:22,560 --> 00:00:27,120
And some examples of this which I should be familiar to you know would be really in the sense stochastic

8
00:00:27,120 --> 00:00:28,940
really innocent and many batched.

9
00:00:30,380 --> 00:00:35,930
So Cara's actually comes with a lot more itemizes than those we have the standard stochastic Grilli

10
00:00:35,930 --> 00:00:39,560
and dissent armis Prop 8 a grad in adulthood.

11
00:00:39,610 --> 00:00:41,300
Adam Adam Max and Adam

12
00:00:44,800 --> 00:00:52,320
so in a quick aside about snow constant rates are generally bad especially if you start off too big.

13
00:00:52,400 --> 00:00:58,440
So I imagine after 50 long epochs ebox when we're thinking we're close to Convergence but then the problem

14
00:00:58,550 --> 00:01:05,970
is I will integrate basically bounces around and loss and our sorry our training center at Teslas So

15
00:01:06,180 --> 00:01:07,890
basically stop increasing.

16
00:01:07,890 --> 00:01:09,170
That's a bad situation.

17
00:01:09,300 --> 00:01:14,760
When you're training so you always want to use a learning rate that's small as small as possible without

18
00:01:14,760 --> 00:01:17,630
being too small because it's too small or small of learning.

19
00:01:17,640 --> 00:01:21,730
It would just simply take forever to train.

20
00:01:22,210 --> 00:01:23,580
So there's so many choices.

21
00:01:23,630 --> 00:01:24,790
What's the difference.

22
00:01:24,800 --> 00:01:29,510
The main difference in these algorithms is how they manipulate learning rates to allow faster convergence

23
00:01:29,540 --> 00:01:31,700
and better validation accuracy.

24
00:01:31,700 --> 00:01:38,150
Some require like a gradient descent setting off some manual parameters or even adjusting or learning

25
00:01:38,160 --> 00:01:43,460
which you will come to shortly and then some of them use a stick approach to provide adaptive learning

26
00:01:43,460 --> 00:01:45,120
rates which are quite cool.

27
00:01:45,170 --> 00:01:47,690
We'll actually see some of the comparisons shortly.

28
00:01:49,760 --> 00:01:54,910
So let's talk a bit about stochastic really and descent and the parameters Karris allows us to control.

29
00:01:55,130 --> 00:02:00,920
So by default cursor uses a constantly integrates in SAGD itemizes at stochastic really and descent

30
00:02:01,590 --> 00:02:08,360
and however we can set these parameters off momentum to key and also turn off or on something called

31
00:02:08,450 --> 00:02:10,340
noster of momentum.

32
00:02:10,340 --> 00:02:12,980
So let's talk about a bit about momentum.

33
00:02:12,980 --> 00:02:19,220
Momentum is a technique that accelerates SAGD by pushing degree and steps along the relevant direction

34
00:02:19,670 --> 00:02:23,680
but reducing the jump in oscillations away from relevant directions.

35
00:02:23,690 --> 00:02:29,360
So basically it basically encourages O'Grady and descent to head into the direction that is basically

36
00:02:29,360 --> 00:02:30,380
reducing loss.

37
00:02:30,440 --> 00:02:33,340
So it doesn't actually go away from that spot.

38
00:02:34,480 --> 00:02:42,200
And big jumps at least to kids is something that the kids are able great every batch of it not Iput

39
00:02:42,260 --> 00:02:42,700
Depok.

40
00:02:42,710 --> 00:02:44,300
So be careful how you set about size.

41
00:02:44,300 --> 00:02:44,930
By the way.

42
00:02:45,050 --> 00:02:49,040
This is it only at a time but size becomes relevant in training.

43
00:02:49,460 --> 00:02:54,230
A good rule of thumb though is for setting the Dekhi is equal to leaning read divided by a number of

44
00:02:54,230 --> 00:02:54,890
ebox.

45
00:02:54,950 --> 00:02:55,920
OK.

46
00:02:56,450 --> 00:03:02,090
So that's how we choose the key value noster off basically is a guy who actually developed a method

47
00:03:02,450 --> 00:03:08,480
that solves the problem of oscillating around r.m.r not us losing oscillating around on minima basically

48
00:03:08,480 --> 00:03:13,410
means getting rid of too big for us to actually converge at a minimum point.

49
00:03:13,910 --> 00:03:18,140
And this happens when a minute momentum is high and unable to slow down.

50
00:03:18,530 --> 00:03:23,370
So this makes a big jump then a small correction and after that the gradient is calculated.

51
00:03:23,570 --> 00:03:25,530
That's not Esther of weeks.

52
00:03:25,550 --> 00:03:33,160
So all of those encourage you to use Nassor of being if you actually want to get fast to.

53
00:03:33,170 --> 00:03:38,440
So this is a good illustration here of Nestor off woman to illustrate it.

54
00:03:38,510 --> 00:03:42,120
It was taken from this source it seems to try one stone for its course.

55
00:03:43,980 --> 00:03:48,930
So basically this is actually all the other algorithms combined here.

56
00:03:49,530 --> 00:03:53,350
And basically it shows you how women actually looks as well.

57
00:03:53,370 --> 00:03:54,230
So take a look.

58
00:03:54,510 --> 00:03:56,250
It's actually quite interesting.

59
00:03:56,250 --> 00:04:04,070
This is SDD Witt moment with sort of enabled by the way so you can see it's taking a while to get here.

60
00:04:04,350 --> 00:04:08,710
But it will eventually get here although the others have gotten their way quickly.

61
00:04:11,890 --> 00:04:14,360
I actually was wrong in one thing.

62
00:04:14,360 --> 00:04:16,360
I just remember them taking it on my second screen.

63
00:04:16,580 --> 00:04:20,090
Actually no momentum actually had noster of enabled here.

64
00:04:20,280 --> 00:04:22,260
SDD was just plain vanilla yesterday.

65
00:04:23,990 --> 00:04:29,680
So I can see all of these advanced optimizers got there eventually except for yesterday which took forever.

66
00:04:31,660 --> 00:04:34,600
So let's talk a bit about more of those other algorithms here.

67
00:04:34,660 --> 00:04:35,720
Some of these here.

68
00:04:36,100 --> 00:04:39,120
Let's start talking about the ones that are available in Paris.

69
00:04:39,520 --> 00:04:44,870
So we just saw we set up parameters to control that getting Richard you and living Richard ules are

70
00:04:44,890 --> 00:04:46,050
basically either.

71
00:04:46,150 --> 00:04:52,900
How are we leading rates adapt over to treating process based be it based on number of e-books have

72
00:04:52,900 --> 00:04:55,290
been completed or other parameters.

73
00:04:55,320 --> 00:04:56,070
OK.

74
00:04:56,830 --> 00:05:00,430
That's why it's adaptive learning repentance and each book differently.

75
00:05:00,430 --> 00:05:03,130
So let's talk quickly about that.

76
00:05:03,360 --> 00:05:09,490
This performs large of the more sparse parameters and smaller of the ads for less pass parameters.

77
00:05:09,530 --> 00:05:12,240
Is this actually about English.

78
00:05:12,340 --> 00:05:19,480
It is the Tuss it is us well-suited for spazzed do how ever will because the learning rate is always

79
00:05:19,480 --> 00:05:24,060
decreasing monotonically after many books learning slows down to a crawl.

80
00:05:24,580 --> 00:05:31,640
So it adults actually solves this monotonically decreasing gradient basically an integral and occurs

81
00:05:31,660 --> 00:05:35,730
in allograft armis proper is actually similar to integrate it adults.

82
00:05:35,760 --> 00:05:41,410
Sorry couldn't find much information to explain this but just remember these are similar probably discovered

83
00:05:41,440 --> 00:05:48,430
and have similar discovered separately but have similar methods of action an atom which is not really

84
00:05:48,520 --> 00:05:55,360
item its name but item it's similar to either Delta but still has momentum for learning rates for each

85
00:05:55,360 --> 00:05:56,070
parameter.

86
00:05:56,400 --> 00:05:57,910
Each of the parameters separately.

87
00:05:58,240 --> 00:06:05,240
I'll correct these little mistakes here and there before this should be decided to you guys.

88
00:06:05,300 --> 00:06:07,640
So what does a good learning read look like.

89
00:06:07,640 --> 00:06:10,580
So this is Holan a loss graph here.

90
00:06:10,770 --> 00:06:16,010
Epoxied This is how it would look if we had a very high rate a large rate.

91
00:06:16,130 --> 00:06:20,660
Basically we would never find a convergence zone because we'd be bouncing around everywhere and a lot

92
00:06:20,660 --> 00:06:22,940
of problem just get worse over time.

93
00:06:23,270 --> 00:06:27,560
A little learning it will eventually get there however it'll take a while.

94
00:06:27,640 --> 00:06:29,930
Highly integrated will eventually get here too.

95
00:06:30,240 --> 00:06:35,780
But this could actually could be interchangeable but basically it will get here to good living rates

96
00:06:35,780 --> 00:06:41,560
have nice gradual smooth and decreasing steps and basically converge at lower points over time.

97
00:06:43,080 --> 00:06:49,320
And finally if you go to Icarus you slash optimizes it brings up a list of all optimizes and actually

98
00:06:49,320 --> 00:06:56,220
how to use decode here and how to what settings that are available to optimize this so you can see as

99
00:06:56,220 --> 00:06:59,570
she does here basically we have some of these parameters here.

100
00:07:00,400 --> 00:07:03,080
Noster of decie I mentioned to you before.

101
00:07:04,070 --> 00:07:10,130
Armis prop all of these are available here and some explanations on what they do what to what you can

102
00:07:10,130 --> 00:07:13,480
tweak is all available on Kerrison site.

103
00:07:13,480 --> 00:07:15,350
So take a look.

104
00:07:15,680 --> 00:07:22,630
Just so you know in best practice I find Adam to be one of the best optimizers.

105
00:07:22,650 --> 00:07:27,270
I don't even actually have to go for or from these default values.

106
00:07:27,280 --> 00:07:29,520
I actually just put it to be slightly smaller.

107
00:07:29,780 --> 00:07:38,000
But all these values are usually fine also what's quite good to use as well is SAGD with a very low

108
00:07:38,000 --> 00:07:38,890
living rate.

109
00:07:39,020 --> 00:07:41,830
You can set the momentum and key according to your parameters.

110
00:07:42,000 --> 00:07:44,900
I mentioned before and always used sort of as true.