Add files using upload-large-folder tool

0182da2 verified 3 months ago

11.2 kB

	1
	00:00:00,530 --> 00:00:06,240
	OK so let's start of Chapter 11 point sorry twelve point one by looking at the types of optimizers we

	2
	00:00:06,240 --> 00:00:10,520
	have available in Karris and look at some adaptive learning methods.

	3
	00:00:10,530 --> 00:00:12,540
	So let's dive in.

	4
	00:00:12,600 --> 00:00:15,560
	So optimizes what exactly optimizes.

	5
	00:00:15,570 --> 00:00:21,540
	Now you may have remembered in a neural net explanation optimizes with the algorithm we use to minimize

	6
	00:00:21,550 --> 00:00:22,400
	our loss.

	7
	00:00:22,560 --> 00:00:27,120
	And some examples of this which I should be familiar to you know would be really in the sense stochastic

	8
	00:00:27,120 --> 00:00:28,940
	really innocent and many batched.

	9
	00:00:30,380 --> 00:00:35,930
	So Cara's actually comes with a lot more itemizes than those we have the standard stochastic Grilli

	10
	00:00:35,930 --> 00:00:39,560
	and dissent armis Prop 8 a grad in adulthood.

	11
	00:00:39,610 --> 00:00:41,300
	Adam Adam Max and Adam

	12
	00:00:44,800 --> 00:00:52,320
	so in a quick aside about snow constant rates are generally bad especially if you start off too big.

	13
	00:00:52,400 --> 00:00:58,440
	So I imagine after 50 long epochs ebox when we're thinking we're close to Convergence but then the problem

	14
	00:00:58,550 --> 00:01:05,970
	is I will integrate basically bounces around and loss and our sorry our training center at Teslas So

	15
	00:01:06,180 --> 00:01:07,890
	basically stop increasing.

	16
	00:01:07,890 --> 00:01:09,170
	That's a bad situation.

	17
	00:01:09,300 --> 00:01:14,760
	When you're training so you always want to use a learning rate that's small as small as possible without

	18
	00:01:14,760 --> 00:01:17,630
	being too small because it's too small or small of learning.

	19
	00:01:17,640 --> 00:01:21,730
	It would just simply take forever to train.

	20
	00:01:22,210 --> 00:01:23,580
	So there's so many choices.

	21
	00:01:23,630 --> 00:01:24,790
	What's the difference.

	22
	00:01:24,800 --> 00:01:29,510
	The main difference in these algorithms is how they manipulate learning rates to allow faster convergence

	23
	00:01:29,540 --> 00:01:31,700
	and better validation accuracy.

	24
	00:01:31,700 --> 00:01:38,150
	Some require like a gradient descent setting off some manual parameters or even adjusting or learning

	25
	00:01:38,160 --> 00:01:43,460
	which you will come to shortly and then some of them use a stick approach to provide adaptive learning

	26
	00:01:43,460 --> 00:01:45,120
	rates which are quite cool.

	27
	00:01:45,170 --> 00:01:47,690
	We'll actually see some of the comparisons shortly.

	28
	00:01:49,760 --> 00:01:54,910
	So let's talk a bit about stochastic really and descent and the parameters Karris allows us to control.

	29
	00:01:55,130 --> 00:02:00,920
	So by default cursor uses a constantly integrates in SAGD itemizes at stochastic really and descent

	30
	00:02:01,590 --> 00:02:08,360
	and however we can set these parameters off momentum to key and also turn off or on something called

	31
	00:02:08,450 --> 00:02:10,340
	noster of momentum.

	32
	00:02:10,340 --> 00:02:12,980
	So let's talk about a bit about momentum.

	33
	00:02:12,980 --> 00:02:19,220
	Momentum is a technique that accelerates SAGD by pushing degree and steps along the relevant direction

	34
	00:02:19,670 --> 00:02:23,680
	but reducing the jump in oscillations away from relevant directions.

	35
	00:02:23,690 --> 00:02:29,360
	So basically it basically encourages O'Grady and descent to head into the direction that is basically

	36
	00:02:29,360 --> 00:02:30,380
	reducing loss.

	37
	00:02:30,440 --> 00:02:33,340
	So it doesn't actually go away from that spot.

	38
	00:02:34,480 --> 00:02:42,200
	And big jumps at least to kids is something that the kids are able great every batch of it not Iput

	39
	00:02:42,260 --> 00:02:42,700
	Depok.

	40
	00:02:42,710 --> 00:02:44,300
	So be careful how you set about size.

	41
	00:02:44,300 --> 00:02:44,930
	By the way.

	42
	00:02:45,050 --> 00:02:49,040
	This is it only at a time but size becomes relevant in training.

	43
	00:02:49,460 --> 00:02:54,230
	A good rule of thumb though is for setting the Dekhi is equal to leaning read divided by a number of

	44
	00:02:54,230 --> 00:02:54,890
	ebox.

	45
	00:02:54,950 --> 00:02:55,920
	OK.

	46
	00:02:56,450 --> 00:03:02,090
	So that's how we choose the key value noster off basically is a guy who actually developed a method

	47
	00:03:02,450 --> 00:03:08,480
	that solves the problem of oscillating around r.m.r not us losing oscillating around on minima basically

	48
	00:03:08,480 --> 00:03:13,410
	means getting rid of too big for us to actually converge at a minimum point.

	49
	00:03:13,910 --> 00:03:18,140
	And this happens when a minute momentum is high and unable to slow down.

	50
	00:03:18,530 --> 00:03:23,370
	So this makes a big jump then a small correction and after that the gradient is calculated.

	51
	00:03:23,570 --> 00:03:25,530
	That's not Esther of weeks.

	52
	00:03:25,550 --> 00:03:33,160
	So all of those encourage you to use Nassor of being if you actually want to get fast to.

	53
	00:03:33,170 --> 00:03:38,440
	So this is a good illustration here of Nestor off woman to illustrate it.

	54
	00:03:38,510 --> 00:03:42,120
	It was taken from this source it seems to try one stone for its course.

	55
	00:03:43,980 --> 00:03:48,930
	So basically this is actually all the other algorithms combined here.

	56
	00:03:49,530 --> 00:03:53,350
	And basically it shows you how women actually looks as well.

	57
	00:03:53,370 --> 00:03:54,230
	So take a look.

	58
	00:03:54,510 --> 00:03:56,250
	It's actually quite interesting.

	59
	00:03:56,250 --> 00:04:04,070
	This is SDD Witt moment with sort of enabled by the way so you can see it's taking a while to get here.

	60
	00:04:04,350 --> 00:04:08,710
	But it will eventually get here although the others have gotten their way quickly.

	61
	00:04:11,890 --> 00:04:14,360
	I actually was wrong in one thing.

	62
	00:04:14,360 --> 00:04:16,360
	I just remember them taking it on my second screen.

	63
	00:04:16,580 --> 00:04:20,090
	Actually no momentum actually had noster of enabled here.

	64
	00:04:20,280 --> 00:04:22,260
	SDD was just plain vanilla yesterday.

	65
	00:04:23,990 --> 00:04:29,680
	So I can see all of these advanced optimizers got there eventually except for yesterday which took forever.

	66
	00:04:31,660 --> 00:04:34,600
	So let's talk a bit about more of those other algorithms here.

	67
	00:04:34,660 --> 00:04:35,720
	Some of these here.

	68
	00:04:36,100 --> 00:04:39,120
	Let's start talking about the ones that are available in Paris.

	69
	00:04:39,520 --> 00:04:44,870
	So we just saw we set up parameters to control that getting Richard you and living Richard ules are

	70
	00:04:44,890 --> 00:04:46,050
	basically either.

	71
	00:04:46,150 --> 00:04:52,900
	How are we leading rates adapt over to treating process based be it based on number of e-books have

	72
	00:04:52,900 --> 00:04:55,290
	been completed or other parameters.

	73
	00:04:55,320 --> 00:04:56,070
	OK.

	74
	00:04:56,830 --> 00:05:00,430
	That's why it's adaptive learning repentance and each book differently.

	75
	00:05:00,430 --> 00:05:03,130
	So let's talk quickly about that.

	76
	00:05:03,360 --> 00:05:09,490
	This performs large of the more sparse parameters and smaller of the ads for less pass parameters.

	77
	00:05:09,530 --> 00:05:12,240
	Is this actually about English.

	78
	00:05:12,340 --> 00:05:19,480
	It is the Tuss it is us well-suited for spazzed do how ever will because the learning rate is always

	79
	00:05:19,480 --> 00:05:24,060
	decreasing monotonically after many books learning slows down to a crawl.

	80
	00:05:24,580 --> 00:05:31,640
	So it adults actually solves this monotonically decreasing gradient basically an integral and occurs

	81
	00:05:31,660 --> 00:05:35,730
	in allograft armis proper is actually similar to integrate it adults.

	82
	00:05:35,760 --> 00:05:41,410
	Sorry couldn't find much information to explain this but just remember these are similar probably discovered

	83
	00:05:41,440 --> 00:05:48,430
	and have similar discovered separately but have similar methods of action an atom which is not really

	84
	00:05:48,520 --> 00:05:55,360
	item its name but item it's similar to either Delta but still has momentum for learning rates for each

	85
	00:05:55,360 --> 00:05:56,070
	parameter.

	86
	00:05:56,400 --> 00:05:57,910
	Each of the parameters separately.

	87
	00:05:58,240 --> 00:06:05,240
	I'll correct these little mistakes here and there before this should be decided to you guys.

	88
	00:06:05,300 --> 00:06:07,640
	So what does a good learning read look like.

	89
	00:06:07,640 --> 00:06:10,580
	So this is Holan a loss graph here.

	90
	00:06:10,770 --> 00:06:16,010
	Epoxied This is how it would look if we had a very high rate a large rate.

	91
	00:06:16,130 --> 00:06:20,660
	Basically we would never find a convergence zone because we'd be bouncing around everywhere and a lot

	92
	00:06:20,660 --> 00:06:22,940
	of problem just get worse over time.

	93
	00:06:23,270 --> 00:06:27,560
	A little learning it will eventually get there however it'll take a while.

	94
	00:06:27,640 --> 00:06:29,930
	Highly integrated will eventually get here too.

	95
	00:06:30,240 --> 00:06:35,780
	But this could actually could be interchangeable but basically it will get here to good living rates

	96
	00:06:35,780 --> 00:06:41,560
	have nice gradual smooth and decreasing steps and basically converge at lower points over time.

	97
	00:06:43,080 --> 00:06:49,320
	And finally if you go to Icarus you slash optimizes it brings up a list of all optimizes and actually

	98
	00:06:49,320 --> 00:06:56,220
	how to use decode here and how to what settings that are available to optimize this so you can see as

	99
	00:06:56,220 --> 00:06:59,570
	she does here basically we have some of these parameters here.

	100
	00:07:00,400 --> 00:07:03,080
	Noster of decie I mentioned to you before.

	101
	00:07:04,070 --> 00:07:10,130
	Armis prop all of these are available here and some explanations on what they do what to what you can

	102
	00:07:10,130 --> 00:07:13,480
	tweak is all available on Kerrison site.

	103
	00:07:13,480 --> 00:07:15,350
	So take a look.

	104
	00:07:15,680 --> 00:07:22,630
	Just so you know in best practice I find Adam to be one of the best optimizers.

	105
	00:07:22,650 --> 00:07:27,270
	I don't even actually have to go for or from these default values.

	106
	00:07:27,280 --> 00:07:29,520
	I actually just put it to be slightly smaller.

	107
	00:07:29,780 --> 00:07:38,000
	But all these values are usually fine also what's quite good to use as well is SAGD with a very low

	108
	00:07:38,000 --> 00:07:38,890
	living rate.

	109
	00:07:39,020 --> 00:07:41,830
	You can set the momentum and key according to your parameters.

	110
	00:07:42,000 --> 00:07:44,900
	I mentioned before and always used sort of as true.