File size: 11,224 Bytes
0182da2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
1
00:00:00,530 --> 00:00:06,240
OK so let's start of Chapter 11 point sorry twelve point one by looking at the types of optimizers we

2
00:00:06,240 --> 00:00:10,520
have available in Karris and look at some adaptive learning methods.

3
00:00:10,530 --> 00:00:12,540
So let's dive in.

4
00:00:12,600 --> 00:00:15,560
So optimizes what exactly optimizes.

5
00:00:15,570 --> 00:00:21,540
Now you may have remembered in a neural net explanation optimizes with the algorithm we use to minimize

6
00:00:21,550 --> 00:00:22,400
our loss.

7
00:00:22,560 --> 00:00:27,120
And some examples of this which I should be familiar to you know would be really in the sense stochastic

8
00:00:27,120 --> 00:00:28,940
really innocent and many batched.

9
00:00:30,380 --> 00:00:35,930
So Cara's actually comes with a lot more itemizes than those we have the standard stochastic Grilli

10
00:00:35,930 --> 00:00:39,560
and dissent armis Prop 8 a grad in adulthood.

11
00:00:39,610 --> 00:00:41,300
Adam Adam Max and Adam

12
00:00:44,800 --> 00:00:52,320
so in a quick aside about snow constant rates are generally bad especially if you start off too big.

13
00:00:52,400 --> 00:00:58,440
So I imagine after 50 long epochs ebox when we're thinking we're close to Convergence but then the problem

14
00:00:58,550 --> 00:01:05,970
is I will integrate basically bounces around and loss and our sorry our training center at Teslas So

15
00:01:06,180 --> 00:01:07,890
basically stop increasing.

16
00:01:07,890 --> 00:01:09,170
That's a bad situation.

17
00:01:09,300 --> 00:01:14,760
When you're training so you always want to use a learning rate that's small as small as possible without

18
00:01:14,760 --> 00:01:17,630
being too small because it's too small or small of learning.

19
00:01:17,640 --> 00:01:21,730
It would just simply take forever to train.

20
00:01:22,210 --> 00:01:23,580
So there's so many choices.

21
00:01:23,630 --> 00:01:24,790
What's the difference.

22
00:01:24,800 --> 00:01:29,510
The main difference in these algorithms is how they manipulate learning rates to allow faster convergence

23
00:01:29,540 --> 00:01:31,700
and better validation accuracy.

24
00:01:31,700 --> 00:01:38,150
Some require like a gradient descent setting off some manual parameters or even adjusting or learning

25
00:01:38,160 --> 00:01:43,460
which you will come to shortly and then some of them use a stick approach to provide adaptive learning

26
00:01:43,460 --> 00:01:45,120
rates which are quite cool.

27
00:01:45,170 --> 00:01:47,690
We'll actually see some of the comparisons shortly.

28
00:01:49,760 --> 00:01:54,910
So let's talk a bit about stochastic really and descent and the parameters Karris allows us to control.

29
00:01:55,130 --> 00:02:00,920
So by default cursor uses a constantly integrates in SAGD itemizes at stochastic really and descent

30
00:02:01,590 --> 00:02:08,360
and however we can set these parameters off momentum to key and also turn off or on something called

31
00:02:08,450 --> 00:02:10,340
noster of momentum.

32
00:02:10,340 --> 00:02:12,980
So let's talk about a bit about momentum.

33
00:02:12,980 --> 00:02:19,220
Momentum is a technique that accelerates SAGD by pushing degree and steps along the relevant direction

34
00:02:19,670 --> 00:02:23,680
but reducing the jump in oscillations away from relevant directions.

35
00:02:23,690 --> 00:02:29,360
So basically it basically encourages O'Grady and descent to head into the direction that is basically

36
00:02:29,360 --> 00:02:30,380
reducing loss.

37
00:02:30,440 --> 00:02:33,340
So it doesn't actually go away from that spot.

38
00:02:34,480 --> 00:02:42,200
And big jumps at least to kids is something that the kids are able great every batch of it not Iput

39
00:02:42,260 --> 00:02:42,700
Depok.

40
00:02:42,710 --> 00:02:44,300
So be careful how you set about size.

41
00:02:44,300 --> 00:02:44,930
By the way.

42
00:02:45,050 --> 00:02:49,040
This is it only at a time but size becomes relevant in training.

43
00:02:49,460 --> 00:02:54,230
A good rule of thumb though is for setting the Dekhi is equal to leaning read divided by a number of

44
00:02:54,230 --> 00:02:54,890
ebox.

45
00:02:54,950 --> 00:02:55,920
OK.

46
00:02:56,450 --> 00:03:02,090
So that's how we choose the key value noster off basically is a guy who actually developed a method

47
00:03:02,450 --> 00:03:08,480
that solves the problem of oscillating around r.m.r not us losing oscillating around on minima basically

48
00:03:08,480 --> 00:03:13,410
means getting rid of too big for us to actually converge at a minimum point.

49
00:03:13,910 --> 00:03:18,140
And this happens when a minute momentum is high and unable to slow down.

50
00:03:18,530 --> 00:03:23,370
So this makes a big jump then a small correction and after that the gradient is calculated.

51
00:03:23,570 --> 00:03:25,530
That's not Esther of weeks.

52
00:03:25,550 --> 00:03:33,160
So all of those encourage you to use Nassor of being if you actually want to get fast to.

53
00:03:33,170 --> 00:03:38,440
So this is a good illustration here of Nestor off woman to illustrate it.

54
00:03:38,510 --> 00:03:42,120
It was taken from this source it seems to try one stone for its course.

55
00:03:43,980 --> 00:03:48,930
So basically this is actually all the other algorithms combined here.

56
00:03:49,530 --> 00:03:53,350
And basically it shows you how women actually looks as well.

57
00:03:53,370 --> 00:03:54,230
So take a look.

58
00:03:54,510 --> 00:03:56,250
It's actually quite interesting.

59
00:03:56,250 --> 00:04:04,070
This is SDD Witt moment with sort of enabled by the way so you can see it's taking a while to get here.

60
00:04:04,350 --> 00:04:08,710
But it will eventually get here although the others have gotten their way quickly.

61
00:04:11,890 --> 00:04:14,360
I actually was wrong in one thing.

62
00:04:14,360 --> 00:04:16,360
I just remember them taking it on my second screen.

63
00:04:16,580 --> 00:04:20,090
Actually no momentum actually had noster of enabled here.

64
00:04:20,280 --> 00:04:22,260
SDD was just plain vanilla yesterday.

65
00:04:23,990 --> 00:04:29,680
So I can see all of these advanced optimizers got there eventually except for yesterday which took forever.

66
00:04:31,660 --> 00:04:34,600
So let's talk a bit about more of those other algorithms here.

67
00:04:34,660 --> 00:04:35,720
Some of these here.

68
00:04:36,100 --> 00:04:39,120
Let's start talking about the ones that are available in Paris.

69
00:04:39,520 --> 00:04:44,870
So we just saw we set up parameters to control that getting Richard you and living Richard ules are

70
00:04:44,890 --> 00:04:46,050
basically either.

71
00:04:46,150 --> 00:04:52,900
How are we leading rates adapt over to treating process based be it based on number of e-books have

72
00:04:52,900 --> 00:04:55,290
been completed or other parameters.

73
00:04:55,320 --> 00:04:56,070
OK.

74
00:04:56,830 --> 00:05:00,430
That's why it's adaptive learning repentance and each book differently.

75
00:05:00,430 --> 00:05:03,130
So let's talk quickly about that.

76
00:05:03,360 --> 00:05:09,490
This performs large of the more sparse parameters and smaller of the ads for less pass parameters.

77
00:05:09,530 --> 00:05:12,240
Is this actually about English.

78
00:05:12,340 --> 00:05:19,480
It is the Tuss it is us well-suited for spazzed do how ever will because the learning rate is always

79
00:05:19,480 --> 00:05:24,060
decreasing monotonically after many books learning slows down to a crawl.

80
00:05:24,580 --> 00:05:31,640
So it adults actually solves this monotonically decreasing gradient basically an integral and occurs

81
00:05:31,660 --> 00:05:35,730
in allograft armis proper is actually similar to integrate it adults.

82
00:05:35,760 --> 00:05:41,410
Sorry couldn't find much information to explain this but just remember these are similar probably discovered

83
00:05:41,440 --> 00:05:48,430
and have similar discovered separately but have similar methods of action an atom which is not really

84
00:05:48,520 --> 00:05:55,360
item its name but item it's similar to either Delta but still has momentum for learning rates for each

85
00:05:55,360 --> 00:05:56,070
parameter.

86
00:05:56,400 --> 00:05:57,910
Each of the parameters separately.

87
00:05:58,240 --> 00:06:05,240
I'll correct these little mistakes here and there before this should be decided to you guys.

88
00:06:05,300 --> 00:06:07,640
So what does a good learning read look like.

89
00:06:07,640 --> 00:06:10,580
So this is Holan a loss graph here.

90
00:06:10,770 --> 00:06:16,010
Epoxied This is how it would look if we had a very high rate a large rate.

91
00:06:16,130 --> 00:06:20,660
Basically we would never find a convergence zone because we'd be bouncing around everywhere and a lot

92
00:06:20,660 --> 00:06:22,940
of problem just get worse over time.

93
00:06:23,270 --> 00:06:27,560
A little learning it will eventually get there however it'll take a while.

94
00:06:27,640 --> 00:06:29,930
Highly integrated will eventually get here too.

95
00:06:30,240 --> 00:06:35,780
But this could actually could be interchangeable but basically it will get here to good living rates

96
00:06:35,780 --> 00:06:41,560
have nice gradual smooth and decreasing steps and basically converge at lower points over time.

97
00:06:43,080 --> 00:06:49,320
And finally if you go to Icarus you slash optimizes it brings up a list of all optimizes and actually

98
00:06:49,320 --> 00:06:56,220
how to use decode here and how to what settings that are available to optimize this so you can see as

99
00:06:56,220 --> 00:06:59,570
she does here basically we have some of these parameters here.

100
00:07:00,400 --> 00:07:03,080
Noster of decie I mentioned to you before.

101
00:07:04,070 --> 00:07:10,130
Armis prop all of these are available here and some explanations on what they do what to what you can

102
00:07:10,130 --> 00:07:13,480
tweak is all available on Kerrison site.

103
00:07:13,480 --> 00:07:15,350
So take a look.

104
00:07:15,680 --> 00:07:22,630
Just so you know in best practice I find Adam to be one of the best optimizers.

105
00:07:22,650 --> 00:07:27,270
I don't even actually have to go for or from these default values.

106
00:07:27,280 --> 00:07:29,520
I actually just put it to be slightly smaller.

107
00:07:29,780 --> 00:07:38,000
But all these values are usually fine also what's quite good to use as well is SAGD with a very low

108
00:07:38,000 --> 00:07:38,890
living rate.

109
00:07:39,020 --> 00:07:41,830
You can set the momentum and key according to your parameters.

110
00:07:42,000 --> 00:07:44,900
I mentioned before and always used sort of as true.