AI_DL_Assignment / 6. Neural Networks Explained /9. Regularization, Overfitting, Generalization and Test Datasets.srt

Add files using upload-large-folder tool

d157f08 verified 3 months ago

21.7 kB

	1
	00:00:00,390 --> 00:00:00,710
	OK.

	2
	00:00:00,750 --> 00:00:02,420
	So welcome to Section Six point.

	3
	00:00:02,550 --> 00:00:05,970
	Actually just corrected it from the last section I had a six point seven here.

	4
	00:00:06,060 --> 00:00:06,910
	My bad.

	5
	00:00:07,320 --> 00:00:11,500
	So this section deals with a regularisation which was a very important concept.

	6
	00:00:11,520 --> 00:00:16,740
	Also you're going to understand what overfitting is and why it's bad and why we need to have a model.

	7
	00:00:16,920 --> 00:00:21,090
	Generalize well and you understand basically what a tested assets.

	8
	00:00:21,090 --> 00:00:25,110
	I think I mentioned it previously but I'll go into it a bit more here.

	9
	00:00:25,500 --> 00:00:30,920
	Effectively we want to know how or what when and how our trade model becomes good.

	10
	00:00:32,310 --> 00:00:33,920
	So what makes a good model.

	11
	00:00:33,930 --> 00:00:40,270
	Now this is a very very basic explanation of what makes a good model good model is accurate generalizes

	12
	00:00:40,350 --> 00:00:44,050
	well and does not overfit would have these kind of I mean the same thing.

	13
	00:00:44,080 --> 00:00:46,130
	You'll understand that's shortly.

	14
	00:00:46,290 --> 00:00:52,150
	And I deliberately made a slight vague because accuracy all depends on your domain.

	15
	00:00:52,160 --> 00:00:55,660
	You're looking at sometimes you won ninety nine point ninety nine percent accuracy.

	16
	00:00:55,830 --> 00:00:57,050
	Sometimes you can.

	17
	00:00:57,180 --> 00:00:59,060
	You can be happy with 80 percent accuracy.

	18
	00:00:59,070 --> 00:01:00,490
	It all depends on the application.

	19
	00:01:03,020 --> 00:01:05,780
	So let's look at the models here.

	20
	00:01:05,900 --> 00:01:08,990
	Let's look at these two classes one in green one in blue.

	21
	00:01:09,410 --> 00:01:17,360
	And this is a model a model B model see the red line here is basically the decision boundary for each

	22
	00:01:17,540 --> 00:01:19,010
	data set of data here.

	23
	00:01:19,330 --> 00:01:21,540
	Which model I should say so.

	24
	00:01:23,280 --> 00:01:29,430
	What's happening here is that how do you know which model intuitively what would you say is a best model

	25
	00:01:29,430 --> 00:01:29,980
	here.

	26
	00:01:30,030 --> 00:01:32,630
	Now let's look at Mullaly closely.

	27
	00:01:32,890 --> 00:01:36,350
	Muddly actually separates all the data accurately.

	28
	00:01:36,510 --> 00:01:43,890
	It sees a blue ball over here and actually adjust its decision boundary to encapsulate it Model B.

	29
	00:01:43,890 --> 00:01:49,530
	Basically it doesn't do that model B basically does it nice smooth curve here it doesn't push itself

	30
	00:01:49,560 --> 00:01:54,480
	all the way out here to capture this blue ball and basically it forms a nice clean decision boundary

	31
	00:01:54,480 --> 00:01:55,150
	here.

	32
	00:01:55,260 --> 00:02:01,680
	Model C takes a much more simplistic approach giving you a straight line separating these boundaries

	33
	00:02:01,680 --> 00:02:02,050
	here.

	34
	00:02:02,230 --> 00:02:05,790
	Now what would you say is a best model here.

	35
	00:02:06,150 --> 00:02:13,080
	Now I would say B and I'll tell you why even though B doesn't capture the blue ball here as you can

	36
	00:02:13,080 --> 00:02:17,580
	see from the nature of this data the blue ball technically is in the Green Zone.

	37
	00:02:18,050 --> 00:02:18,320
	Yeah.

	38
	00:02:18,390 --> 00:02:20,310
	So this ball here is an anomaly.

	39
	00:02:20,310 --> 00:02:21,920
	He's basically an outlier.

	40
	00:02:22,260 --> 00:02:25,790
	He contends that he may have ended up here from being mislabeled.

	41
	00:02:25,800 --> 00:02:29,470
	Maybe he was supposed to be agreeable or maybe he just highly unusual.

	42
	00:02:29,670 --> 00:02:35,430
	And generally we don't want our models to basically cover this blue ball here.

	43
	00:02:35,640 --> 00:02:43,960
	This is called overfitting and it's bad because in all in most likely case a most likely scenario.

	44
	00:02:44,080 --> 00:02:45,470
	Green balls are going to be right here.

	45
	00:02:45,540 --> 00:02:48,090
	So what happens when a green ball is Heyliger unseen.

	46
	00:02:48,110 --> 00:02:54,030
	GREENE Well in the future where we give it the fetus green ball is x y coordinates that is right here

	47
	00:02:54,530 --> 00:02:55,860
	into this model.

	48
	00:02:55,860 --> 00:02:59,630
	This overfitting model is going to label it as blue.

	49
	00:02:59,730 --> 00:03:06,600
	Unfortunately when in reality if you see it here this nice clean model B boundary it is supposed to

	50
	00:03:06,600 --> 00:03:09,330
	be green green glass.

	51
	00:03:09,570 --> 00:03:16,080
	So that's an example of a model that is overfit probably too complicated for its own good as opposed

	52
	00:03:16,080 --> 00:03:19,610
	to a Model B which generalizes Well model C in it.

	53
	00:03:19,650 --> 00:03:23,380
	On the other hand is way too general and it's not going to be a good model.

	54
	00:03:27,480 --> 00:03:32,190
	As I explained before this model fits this model this ideal of balance.

	55
	00:03:32,190 --> 00:03:33,990
	And this is what it's called unbefitting.

	56
	00:03:34,110 --> 00:03:35,250
	He doesn't fit the data.

	57
	00:03:35,430 --> 00:03:38,930
	Well he just gives you a generic boundary and tells you.

	58
	00:03:39,210 --> 00:03:43,020
	Yeah I tried my best but it's under fitting.

	59
	00:03:43,100 --> 00:03:47,550
	So let's go into overfitting overfitting is what leads to poor models.

	60
	00:03:47,570 --> 00:03:54,650
	And that's one of the most common problems faced in machine learning and that's a problem I face continuously

	61
	00:03:54,650 --> 00:03:57,500
	when training my convolutional neural nets.

	62
	00:03:57,890 --> 00:04:01,570
	It always happens it always tends to overfit when the training data.

	63
	00:04:01,940 --> 00:04:04,960
	And we will we will experience that later on in discourse.

	64
	00:04:05,390 --> 00:04:12,450
	But what it means basically is overfitting is when all muddled fits perfectly still treating it as in

	65
	00:04:12,710 --> 00:04:19,370
	He has very high accuracy when the data he was trained on maybe even high 90s and then nine point nine

	66
	00:04:19,370 --> 00:04:20,820
	something.

	67
	00:04:20,820 --> 00:04:27,740
	However on the test dataset which is the unseen data he is going to be perform poorly because he has

	68
	00:04:27,740 --> 00:04:33,710
	no basically modeled after detecting data but can't generalize well to data he hasn't seen.

	69
	00:04:34,190 --> 00:04:39,100
	So what happens when you try to pass and you point to this position here.

	70
	00:04:39,290 --> 00:04:40,910
	Exactly what I mentioned before.

	71
	00:04:41,450 --> 00:04:45,290
	It's true colors are supposed to be green but it will be classified as blue.

	72
	00:04:45,590 --> 00:04:48,020
	Models don't necessarily need to be too complex to be good.

	73
	00:04:48,030 --> 00:04:54,330
	They need to generalize well so how do you know if you're overfit.

	74
	00:04:54,820 --> 00:04:59,130
	Well that's why I mentioned testier previously in the beginning of the slide.

	75
	00:04:59,350 --> 00:05:05,420
	We need to test a model on test data and an all machine learning algorithms and stuff.

	76
	00:05:05,560 --> 00:05:08,040
	We always use a test data set.

	77
	00:05:08,560 --> 00:05:15,230
	And basically if we have entire data sets save for 2000 images we take seven hundred and vitrine and

	78
	00:05:15,250 --> 00:05:20,310
	those 700 labeled images and we reserve tree hundred as tested.

	79
	00:05:20,440 --> 00:05:27,930
	200 is critical because it tells us how well or algorithm or model performs on data.

	80
	00:05:28,030 --> 00:05:29,680
	The model has never seen before.

	81
	00:05:31,970 --> 00:05:36,120
	This is a very common case of AMSAT overfitting 95 percent plus accuracy.

	82
	00:05:36,260 --> 00:05:38,690
	But on test data you get like 70 percent.

	83
	00:05:38,690 --> 00:05:41,500
	It's a perfect example of overfitting.

	84
	00:05:42,280 --> 00:05:44,260
	So overfitting that graphically.

	85
	00:05:44,330 --> 00:05:45,810
	Here's what happens.

	86
	00:05:45,890 --> 00:05:48,390
	Mumblin mentioned ebox No.

	87
	00:05:48,920 --> 00:05:54,040
	1 indice in this chapter I discuss it discuss exactly what ebox are.

	88
	00:05:54,320 --> 00:06:00,750
	It's basically every time we send the full treating the research into our training algorithm it's we.

	89
	00:06:00,860 --> 00:06:06,680
	We have completed one IPAC and we need to train for maybe hundreds of ebox sometimes to get a good model.

	90
	00:06:06,770 --> 00:06:07,790
	Usually that's not the case.

	91
	00:06:07,790 --> 00:06:13,490
	Usually you can get away with treating 22:00 ebox but generally that's what we have to do to get the

	92
	00:06:13,490 --> 00:06:15,960
	best models.

	93
	00:06:15,980 --> 00:06:22,120
	So this is an illustration of what overfitting looks like.

	94
	00:06:22,130 --> 00:06:29,360
	So look at treating loss here in red and accuracy losses going down quite well accuracy is going up

	95
	00:06:29,360 --> 00:06:30,620
	close to 100 percent.

	96
	00:06:30,640 --> 00:06:37,370
	The scale of accuracy is and decide and losses and decide and ebox or X and all look at the test the

	97
	00:06:37,370 --> 00:06:43,750
	test loss between us fluctuates above between 1 to 1.5 and actually goes up in the end.

	98
	00:06:43,820 --> 00:06:44,970
	It's not good at all.

	99
	00:06:45,380 --> 00:06:51,700
	And look at our test accuracy he's hovering at abysmal rates of like below 50 percent.

	100
	00:06:52,190 --> 00:06:57,800
	And while his training data is at 100 percent that is a very this is actually extreme overfitting to

	101
	00:06:57,800 --> 00:07:00,240
	be fair doesn't actually have to get this bad.

	102
	00:07:00,320 --> 00:07:02,370
	At least I've never gotten it to be disbarred.

	103
	00:07:02,510 --> 00:07:05,850
	That is a good example of what we actually see happening in the real world.

	104
	00:07:06,290 --> 00:07:13,100
	Good training accuracy Porchester accuracy and that's overfitting.

	105
	00:07:13,120 --> 00:07:15,090
	So how do we avoid overfitting.

	106
	00:07:15,360 --> 00:07:16,630
	Are many techniques to avoid it.

	107
	00:07:16,630 --> 00:07:19,030
	And I'll discuss it slowly soon.

	108
	00:07:19,060 --> 00:07:26,200
	In the slide in this section now of overfitting as a consequence of our we it's always been tuned to

	109
	00:07:26,200 --> 00:07:32,080
	fit our training data but don't fit over to you and in a way so that they don't perform well on testing.

	110
	00:07:32,740 --> 00:07:39,310
	And we know we had a decent model just too sensitive to the training data.

	111
	00:07:39,340 --> 00:07:40,990
	So is there a way to fix this.

	112
	00:07:41,040 --> 00:07:42,420
	I mean way too sensitive.

	113
	00:07:42,430 --> 00:07:46,750
	I mean it's just optimized exclusively for treating data

	114
	00:07:49,660 --> 00:07:55,320
	so we can avoid well-fitting by using a smaller less deep model.

	115
	00:07:55,820 --> 00:08:01,720
	Deeper models can sometimes find features or interpret noise to be important noise was example of this

	116
	00:08:01,720 --> 00:08:02,210
	here.

	117
	00:08:03,000 --> 00:08:08,790
	This clearly wasn't that important but yet a deep model will actually try to figure out and model this

	118
	00:08:10,810 --> 00:08:11,740
	to do today.

	119
	00:08:11,830 --> 00:08:17,920
	That's because a deeper ability is deeper that folks have abilities to memorize more complicated features

	120
	00:08:18,850 --> 00:08:21,580
	and that's called the memorization cup capacity.

	121
	00:08:24,960 --> 00:08:28,230
	But there's another way and that's called regularisation.

	122
	00:08:28,470 --> 00:08:35,550
	I don't often recommend I shouldn't recommend using less models to get better more better results because

	123
	00:08:35,550 --> 00:08:40,440
	there are other ways around that and you want to actually always have a deep enough model to have to

	124
	00:08:40,440 --> 00:08:42,890
	represent complicated patterns in your data.

	125
	00:08:43,260 --> 00:08:47,750
	So let's see what techniques we can use to regularize.

	126
	00:08:47,820 --> 00:08:49,400
	So what is regularisation.

	127
	00:08:49,400 --> 00:08:53,750
	It's a method of making all model more general to a data set.

	128
	00:08:53,760 --> 00:08:59,880
	So basically regularisation will take a model that produces a decision boundary like this and sort of

	129
	00:08:59,880 --> 00:09:02,870
	make it tweak it so that it becomes like this.

	130
	00:09:02,970 --> 00:09:05,150
	And let's see now it's not actually doing it.

	131
	00:09:05,150 --> 00:09:07,990
	That's the actual tomb of regularisation.

	132
	00:09:08,010 --> 00:09:11,240
	We basically want to get a model like this and not like this.

	133
	00:09:11,280 --> 00:09:13,000
	And let's find out how we do that.

	134
	00:09:14,750 --> 00:09:18,090
	So there are a few types of regularisation you're actually more in this.

	135
	00:09:18,110 --> 00:09:19,800
	But these are the basic types.

	136
	00:09:19,800 --> 00:09:28,010
	There's L1 L2 regularisation cross-validation stopping dropout and data augmentation.

	137
	00:09:28,040 --> 00:09:31,030
	So let's look at L1 L2 regularisation.

	138
	00:09:31,070 --> 00:09:37,640
	These are techniques we use to penalize large weights large weights of gradients manifest themselves

	139
	00:09:37,640 --> 00:09:43,660
	as abrupt changes in our models decision boundary and by penalizing them effectively making them small

	140
	00:09:44,240 --> 00:09:45,850
	L2 is known as original Russian.

	141
	00:09:45,870 --> 00:09:47,900
	L What is laso regression.

	142
	00:09:48,240 --> 00:09:50,570
	No it is a lot more theory behind these things here.

	143
	00:09:50,720 --> 00:09:54,110
	I'm just basically showing you the formulas of what they actually are.

	144
	00:09:54,590 --> 00:10:02,060
	So as you can see this is basically what we're L2 is here.

	145
	00:10:02,200 --> 00:10:03,760
	This is MSCE here.

	146
	00:10:03,950 --> 00:10:06,690
	But we're actually doing something with it here.

	147
	00:10:06,830 --> 00:10:08,340
	What are we doing.

	148
	00:10:08,360 --> 00:10:14,630
	We're playing a constant here of two and some of the weights squared and L-1 is not square.

	149
	00:10:14,690 --> 00:10:16,940
	It's just absolute value of some of the weight.

	150
	00:10:17,390 --> 00:10:26,180
	So this promise here controls the penalty we apply via propagation to penalty under which it is applied

	151
	00:10:26,340 --> 00:10:28,320
	to wait a bit.

	152
	00:10:28,350 --> 00:10:34,150
	So the differences between them basically is that L-1 brings the widths of unimportant features to zero

	153
	00:10:34,920 --> 00:10:37,220
	thus acting as a feature selection algorithm.

	154
	00:10:37,290 --> 00:10:42,720
	You may not know what feature selection is but if you want to know what feature selection is it's basically

	155
	00:10:43,110 --> 00:10:43,960
	trying to find out.

	156
	00:10:44,010 --> 00:10:48,230
	We have like 20 inputs what input is most important to our.

	157
	00:10:48,240 --> 00:10:52,160
	But it's also on the past models as well.

	158
	00:10:52,560 --> 00:10:56,550
	Whereas L2 penalises even more Does that bring it down to zero.

	159
	00:10:56,710 --> 00:10:57,630
	OK.

	160
	00:10:58,260 --> 00:11:04,950
	So effectively what we're doing here disremember L1 L2 prevents it from being too large so that we don't

	161
	00:11:04,950 --> 00:11:11,830
	have abrupt changes no model of abrupt changes basically mean things like go back to here.

	162
	00:11:13,360 --> 00:11:18,080
	You basically try to make this instead of having this abrupt gritty and changes here.

	163
	00:11:21,290 --> 00:11:23,610
	So let's go into cross-validation now.

	164
	00:11:23,630 --> 00:11:29,720
	Cross-ventilation is something I rarely ever use because they're not used to using it to use to use

	165
	00:11:29,720 --> 00:11:30,140
	it a lot.

	166
	00:11:30,140 --> 00:11:35,000
	Doing previous machine learning stuff but in deep learning I don't use it often but if you want to know

	167
	00:11:35,000 --> 00:11:43,160
	what it is it's quite simple actually basically cross-validation and that is key for the course validation

	168
	00:11:43,730 --> 00:11:49,650
	is how we see is the way we split our data set are trying that a set into different folds and we train

	169
	00:11:49,730 --> 00:11:54,000
	those fools and we basically test on the articles afterward.

	170
	00:11:54,500 --> 00:12:00,600
	So let's look at to see if this here is or test full of allegations that no training set.

	171
	00:12:00,740 --> 00:12:08,650
	So what happens is that we train on these four folds here and we test us then we train on these 4:48

	172
	00:12:08,750 --> 00:12:09,070
	tests.

	173
	00:12:09,080 --> 00:12:15,190
	And that's what this does is that we don't actually have any unseen data in this model.

	174
	00:12:15,260 --> 00:12:21,410
	What we're doing is just we're continuously testing on segments of the data and then creating on segments

	175
	00:12:21,410 --> 00:12:25,320
	of the data and testing on a different segment.

	176
	00:12:25,370 --> 00:12:30,630
	It is a fitting Yes but it also slows down the training process.

	177
	00:12:32,210 --> 00:12:36,030
	Now is nothing is something we can actually automatically do in Paris.

	178
	00:12:36,080 --> 00:12:43,070
	Basically we can set something in Paris that tells us if all a loss tops decreasing stop stop reading.

	179
	00:12:43,290 --> 00:12:50,600
	So if we set our model for 100 ebox but see it Epopt number two something it stops the last stop decreasing

	180
	00:12:51,080 --> 00:12:55,030
	it's going to stop happening somewhere around here when he realizes that and is going to give you this

	181
	00:12:55,040 --> 00:12:55,670
	model here.

	182
	00:12:55,670 --> 00:12:58,670
	The best one with the best loss.

	183
	00:12:58,700 --> 00:13:05,840
	The reason we do this stopping is that sometimes we keep continually training over and over number of

	184
	00:13:05,900 --> 00:13:08,430
	books on our training data.

	185
	00:13:08,510 --> 00:13:11,840
	It tends to basically fit on that data.

	186
	00:13:11,840 --> 00:13:14,540
	So we need to actually stop it really sometimes.

	187
	00:13:14,760 --> 00:13:18,610
	So that release that overfitting does not occur.

	188
	00:13:20,040 --> 00:13:21,670
	And let's talk about drop out.

	189
	00:13:21,710 --> 00:13:26,000
	It is actually very easy to implement and extremely useful.

	190
	00:13:26,520 --> 00:13:28,610
	So drop out refers to dropping nodes.

	191
	00:13:28,700 --> 00:13:33,290
	Hidden and visible in a neural network with him off reducing overfitting.

	192
	00:13:33,530 --> 00:13:35,240
	What do you mean by dropping nodes.

	193
	00:13:35,460 --> 00:13:40,830
	What it means is that in the training process certain parts of the network are ignored during a forward

	194
	00:13:40,830 --> 00:13:42,680
	and back propagations.

	195
	00:13:42,720 --> 00:13:48,660
	This is a way of actually making that work have some redundancy in a way but what it also does is that

	196
	00:13:48,660 --> 00:13:53,100
	it actually adds regularisation to one that works.

	197
	00:13:53,160 --> 00:13:59,030
	It helps reduce the interdependency between winning on your own as well.

	198
	00:13:59,450 --> 00:14:04,840
	So it actually leads to more robust and meaningful features is in dropout.

	199
	00:14:04,910 --> 00:14:10,580
	Does one prominent we use it's called P and P is a probability that the nodes are kept or dropped out

	200
	00:14:11,270 --> 00:14:12,930
	in the training process.

	201
	00:14:12,980 --> 00:14:18,620
	So one are the consequences of using dropout is that it almost doubles the training time to converge

	202
	00:14:18,620 --> 00:14:21,090
	during training.

	203
	00:14:21,110 --> 00:14:23,960
	So this is a good illustration of dropout.

	204
	00:14:24,020 --> 00:14:26,680
	This is a standard neural networks when we train it.

	205
	00:14:26,990 --> 00:14:30,670
	And after playing drop out let's say it was a fairly high value.

	206
	00:14:31,100 --> 00:14:33,060
	These notes here are ignored.

	207
	00:14:33,650 --> 00:14:38,040
	And these notes are used in training.

	208
	00:14:38,140 --> 00:14:44,170
	And lastly was not the last method of regularisation but the last one I'll teach in the schools because

	209
	00:14:44,170 --> 00:14:48,690
	the others are actually quite exotic and not commonly used.

	210
	00:14:48,730 --> 00:14:51,890
	So this one is called the augmentation.

	211
	00:14:51,890 --> 00:14:55,230
	Remember I said you need lots of data to train network.

	212
	00:14:55,450 --> 00:15:02,800
	What if you don't have well incomplete vision especially It lends itself naturally to data augmentation.

	213
	00:15:02,800 --> 00:15:06,570
	What we do is we take a dataset we have one picture of a dog.

	214
	00:15:06,910 --> 00:15:09,910
	How about we just make some manipulations to this image.

	215
	00:15:09,910 --> 00:15:13,050
	We can rotate it completely.

	216
	00:15:13,390 --> 00:15:19,450
	We can I think this one is just mirrored back here and this one has actually zoomed in a bit as well.

	217
	00:15:19,960 --> 00:15:24,400
	So that is where we can actually expand the desert.