File size: 11,224 Bytes
0182da2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 | 1
00:00:00,530 --> 00:00:06,240
OK so let's start of Chapter 11 point sorry twelve point one by looking at the types of optimizers we
2
00:00:06,240 --> 00:00:10,520
have available in Karris and look at some adaptive learning methods.
3
00:00:10,530 --> 00:00:12,540
So let's dive in.
4
00:00:12,600 --> 00:00:15,560
So optimizes what exactly optimizes.
5
00:00:15,570 --> 00:00:21,540
Now you may have remembered in a neural net explanation optimizes with the algorithm we use to minimize
6
00:00:21,550 --> 00:00:22,400
our loss.
7
00:00:22,560 --> 00:00:27,120
And some examples of this which I should be familiar to you know would be really in the sense stochastic
8
00:00:27,120 --> 00:00:28,940
really innocent and many batched.
9
00:00:30,380 --> 00:00:35,930
So Cara's actually comes with a lot more itemizes than those we have the standard stochastic Grilli
10
00:00:35,930 --> 00:00:39,560
and dissent armis Prop 8 a grad in adulthood.
11
00:00:39,610 --> 00:00:41,300
Adam Adam Max and Adam
12
00:00:44,800 --> 00:00:52,320
so in a quick aside about snow constant rates are generally bad especially if you start off too big.
13
00:00:52,400 --> 00:00:58,440
So I imagine after 50 long epochs ebox when we're thinking we're close to Convergence but then the problem
14
00:00:58,550 --> 00:01:05,970
is I will integrate basically bounces around and loss and our sorry our training center at Teslas So
15
00:01:06,180 --> 00:01:07,890
basically stop increasing.
16
00:01:07,890 --> 00:01:09,170
That's a bad situation.
17
00:01:09,300 --> 00:01:14,760
When you're training so you always want to use a learning rate that's small as small as possible without
18
00:01:14,760 --> 00:01:17,630
being too small because it's too small or small of learning.
19
00:01:17,640 --> 00:01:21,730
It would just simply take forever to train.
20
00:01:22,210 --> 00:01:23,580
So there's so many choices.
21
00:01:23,630 --> 00:01:24,790
What's the difference.
22
00:01:24,800 --> 00:01:29,510
The main difference in these algorithms is how they manipulate learning rates to allow faster convergence
23
00:01:29,540 --> 00:01:31,700
and better validation accuracy.
24
00:01:31,700 --> 00:01:38,150
Some require like a gradient descent setting off some manual parameters or even adjusting or learning
25
00:01:38,160 --> 00:01:43,460
which you will come to shortly and then some of them use a stick approach to provide adaptive learning
26
00:01:43,460 --> 00:01:45,120
rates which are quite cool.
27
00:01:45,170 --> 00:01:47,690
We'll actually see some of the comparisons shortly.
28
00:01:49,760 --> 00:01:54,910
So let's talk a bit about stochastic really and descent and the parameters Karris allows us to control.
29
00:01:55,130 --> 00:02:00,920
So by default cursor uses a constantly integrates in SAGD itemizes at stochastic really and descent
30
00:02:01,590 --> 00:02:08,360
and however we can set these parameters off momentum to key and also turn off or on something called
31
00:02:08,450 --> 00:02:10,340
noster of momentum.
32
00:02:10,340 --> 00:02:12,980
So let's talk about a bit about momentum.
33
00:02:12,980 --> 00:02:19,220
Momentum is a technique that accelerates SAGD by pushing degree and steps along the relevant direction
34
00:02:19,670 --> 00:02:23,680
but reducing the jump in oscillations away from relevant directions.
35
00:02:23,690 --> 00:02:29,360
So basically it basically encourages O'Grady and descent to head into the direction that is basically
36
00:02:29,360 --> 00:02:30,380
reducing loss.
37
00:02:30,440 --> 00:02:33,340
So it doesn't actually go away from that spot.
38
00:02:34,480 --> 00:02:42,200
And big jumps at least to kids is something that the kids are able great every batch of it not Iput
39
00:02:42,260 --> 00:02:42,700
Depok.
40
00:02:42,710 --> 00:02:44,300
So be careful how you set about size.
41
00:02:44,300 --> 00:02:44,930
By the way.
42
00:02:45,050 --> 00:02:49,040
This is it only at a time but size becomes relevant in training.
43
00:02:49,460 --> 00:02:54,230
A good rule of thumb though is for setting the Dekhi is equal to leaning read divided by a number of
44
00:02:54,230 --> 00:02:54,890
ebox.
45
00:02:54,950 --> 00:02:55,920
OK.
46
00:02:56,450 --> 00:03:02,090
So that's how we choose the key value noster off basically is a guy who actually developed a method
47
00:03:02,450 --> 00:03:08,480
that solves the problem of oscillating around r.m.r not us losing oscillating around on minima basically
48
00:03:08,480 --> 00:03:13,410
means getting rid of too big for us to actually converge at a minimum point.
49
00:03:13,910 --> 00:03:18,140
And this happens when a minute momentum is high and unable to slow down.
50
00:03:18,530 --> 00:03:23,370
So this makes a big jump then a small correction and after that the gradient is calculated.
51
00:03:23,570 --> 00:03:25,530
That's not Esther of weeks.
52
00:03:25,550 --> 00:03:33,160
So all of those encourage you to use Nassor of being if you actually want to get fast to.
53
00:03:33,170 --> 00:03:38,440
So this is a good illustration here of Nestor off woman to illustrate it.
54
00:03:38,510 --> 00:03:42,120
It was taken from this source it seems to try one stone for its course.
55
00:03:43,980 --> 00:03:48,930
So basically this is actually all the other algorithms combined here.
56
00:03:49,530 --> 00:03:53,350
And basically it shows you how women actually looks as well.
57
00:03:53,370 --> 00:03:54,230
So take a look.
58
00:03:54,510 --> 00:03:56,250
It's actually quite interesting.
59
00:03:56,250 --> 00:04:04,070
This is SDD Witt moment with sort of enabled by the way so you can see it's taking a while to get here.
60
00:04:04,350 --> 00:04:08,710
But it will eventually get here although the others have gotten their way quickly.
61
00:04:11,890 --> 00:04:14,360
I actually was wrong in one thing.
62
00:04:14,360 --> 00:04:16,360
I just remember them taking it on my second screen.
63
00:04:16,580 --> 00:04:20,090
Actually no momentum actually had noster of enabled here.
64
00:04:20,280 --> 00:04:22,260
SDD was just plain vanilla yesterday.
65
00:04:23,990 --> 00:04:29,680
So I can see all of these advanced optimizers got there eventually except for yesterday which took forever.
66
00:04:31,660 --> 00:04:34,600
So let's talk a bit about more of those other algorithms here.
67
00:04:34,660 --> 00:04:35,720
Some of these here.
68
00:04:36,100 --> 00:04:39,120
Let's start talking about the ones that are available in Paris.
69
00:04:39,520 --> 00:04:44,870
So we just saw we set up parameters to control that getting Richard you and living Richard ules are
70
00:04:44,890 --> 00:04:46,050
basically either.
71
00:04:46,150 --> 00:04:52,900
How are we leading rates adapt over to treating process based be it based on number of e-books have
72
00:04:52,900 --> 00:04:55,290
been completed or other parameters.
73
00:04:55,320 --> 00:04:56,070
OK.
74
00:04:56,830 --> 00:05:00,430
That's why it's adaptive learning repentance and each book differently.
75
00:05:00,430 --> 00:05:03,130
So let's talk quickly about that.
76
00:05:03,360 --> 00:05:09,490
This performs large of the more sparse parameters and smaller of the ads for less pass parameters.
77
00:05:09,530 --> 00:05:12,240
Is this actually about English.
78
00:05:12,340 --> 00:05:19,480
It is the Tuss it is us well-suited for spazzed do how ever will because the learning rate is always
79
00:05:19,480 --> 00:05:24,060
decreasing monotonically after many books learning slows down to a crawl.
80
00:05:24,580 --> 00:05:31,640
So it adults actually solves this monotonically decreasing gradient basically an integral and occurs
81
00:05:31,660 --> 00:05:35,730
in allograft armis proper is actually similar to integrate it adults.
82
00:05:35,760 --> 00:05:41,410
Sorry couldn't find much information to explain this but just remember these are similar probably discovered
83
00:05:41,440 --> 00:05:48,430
and have similar discovered separately but have similar methods of action an atom which is not really
84
00:05:48,520 --> 00:05:55,360
item its name but item it's similar to either Delta but still has momentum for learning rates for each
85
00:05:55,360 --> 00:05:56,070
parameter.
86
00:05:56,400 --> 00:05:57,910
Each of the parameters separately.
87
00:05:58,240 --> 00:06:05,240
I'll correct these little mistakes here and there before this should be decided to you guys.
88
00:06:05,300 --> 00:06:07,640
So what does a good learning read look like.
89
00:06:07,640 --> 00:06:10,580
So this is Holan a loss graph here.
90
00:06:10,770 --> 00:06:16,010
Epoxied This is how it would look if we had a very high rate a large rate.
91
00:06:16,130 --> 00:06:20,660
Basically we would never find a convergence zone because we'd be bouncing around everywhere and a lot
92
00:06:20,660 --> 00:06:22,940
of problem just get worse over time.
93
00:06:23,270 --> 00:06:27,560
A little learning it will eventually get there however it'll take a while.
94
00:06:27,640 --> 00:06:29,930
Highly integrated will eventually get here too.
95
00:06:30,240 --> 00:06:35,780
But this could actually could be interchangeable but basically it will get here to good living rates
96
00:06:35,780 --> 00:06:41,560
have nice gradual smooth and decreasing steps and basically converge at lower points over time.
97
00:06:43,080 --> 00:06:49,320
And finally if you go to Icarus you slash optimizes it brings up a list of all optimizes and actually
98
00:06:49,320 --> 00:06:56,220
how to use decode here and how to what settings that are available to optimize this so you can see as
99
00:06:56,220 --> 00:06:59,570
she does here basically we have some of these parameters here.
100
00:07:00,400 --> 00:07:03,080
Noster of decie I mentioned to you before.
101
00:07:04,070 --> 00:07:10,130
Armis prop all of these are available here and some explanations on what they do what to what you can
102
00:07:10,130 --> 00:07:13,480
tweak is all available on Kerrison site.
103
00:07:13,480 --> 00:07:15,350
So take a look.
104
00:07:15,680 --> 00:07:22,630
Just so you know in best practice I find Adam to be one of the best optimizers.
105
00:07:22,650 --> 00:07:27,270
I don't even actually have to go for or from these default values.
106
00:07:27,280 --> 00:07:29,520
I actually just put it to be slightly smaller.
107
00:07:29,780 --> 00:07:38,000
But all these values are usually fine also what's quite good to use as well is SAGD with a very low
108
00:07:38,000 --> 00:07:38,890
living rate.
109
00:07:39,020 --> 00:07:41,830
You can set the momentum and key according to your parameters.
110
00:07:42,000 --> 00:07:44,900
I mentioned before and always used sort of as true.
|