AI_DL_Assignment / 6. Neural Networks Explained /9. Regularization, Overfitting, Generalization and Test Datasets.srt
Prince-1's picture
Add files using upload-large-folder tool
d157f08 verified
1
00:00:00,390 --> 00:00:00,710
OK.
2
00:00:00,750 --> 00:00:02,420
So welcome to Section Six point.
3
00:00:02,550 --> 00:00:05,970
Actually just corrected it from the last section I had a six point seven here.
4
00:00:06,060 --> 00:00:06,910
My bad.
5
00:00:07,320 --> 00:00:11,500
So this section deals with a regularisation which was a very important concept.
6
00:00:11,520 --> 00:00:16,740
Also you're going to understand what overfitting is and why it's bad and why we need to have a model.
7
00:00:16,920 --> 00:00:21,090
Generalize well and you understand basically what a tested assets.
8
00:00:21,090 --> 00:00:25,110
I think I mentioned it previously but I'll go into it a bit more here.
9
00:00:25,500 --> 00:00:30,920
Effectively we want to know how or what when and how our trade model becomes good.
10
00:00:32,310 --> 00:00:33,920
So what makes a good model.
11
00:00:33,930 --> 00:00:40,270
Now this is a very very basic explanation of what makes a good model good model is accurate generalizes
12
00:00:40,350 --> 00:00:44,050
well and does not overfit would have these kind of I mean the same thing.
13
00:00:44,080 --> 00:00:46,130
You'll understand that's shortly.
14
00:00:46,290 --> 00:00:52,150
And I deliberately made a slight vague because accuracy all depends on your domain.
15
00:00:52,160 --> 00:00:55,660
You're looking at sometimes you won ninety nine point ninety nine percent accuracy.
16
00:00:55,830 --> 00:00:57,050
Sometimes you can.
17
00:00:57,180 --> 00:00:59,060
You can be happy with 80 percent accuracy.
18
00:00:59,070 --> 00:01:00,490
It all depends on the application.
19
00:01:03,020 --> 00:01:05,780
So let's look at the models here.
20
00:01:05,900 --> 00:01:08,990
Let's look at these two classes one in green one in blue.
21
00:01:09,410 --> 00:01:17,360
And this is a model a model B model see the red line here is basically the decision boundary for each
22
00:01:17,540 --> 00:01:19,010
data set of data here.
23
00:01:19,330 --> 00:01:21,540
Which model I should say so.
24
00:01:23,280 --> 00:01:29,430
What's happening here is that how do you know which model intuitively what would you say is a best model
25
00:01:29,430 --> 00:01:29,980
here.
26
00:01:30,030 --> 00:01:32,630
Now let's look at Mullaly closely.
27
00:01:32,890 --> 00:01:36,350
Muddly actually separates all the data accurately.
28
00:01:36,510 --> 00:01:43,890
It sees a blue ball over here and actually adjust its decision boundary to encapsulate it Model B.
29
00:01:43,890 --> 00:01:49,530
Basically it doesn't do that model B basically does it nice smooth curve here it doesn't push itself
30
00:01:49,560 --> 00:01:54,480
all the way out here to capture this blue ball and basically it forms a nice clean decision boundary
31
00:01:54,480 --> 00:01:55,150
here.
32
00:01:55,260 --> 00:02:01,680
Model C takes a much more simplistic approach giving you a straight line separating these boundaries
33
00:02:01,680 --> 00:02:02,050
here.
34
00:02:02,230 --> 00:02:05,790
Now what would you say is a best model here.
35
00:02:06,150 --> 00:02:13,080
Now I would say B and I'll tell you why even though B doesn't capture the blue ball here as you can
36
00:02:13,080 --> 00:02:17,580
see from the nature of this data the blue ball technically is in the Green Zone.
37
00:02:18,050 --> 00:02:18,320
Yeah.
38
00:02:18,390 --> 00:02:20,310
So this ball here is an anomaly.
39
00:02:20,310 --> 00:02:21,920
He's basically an outlier.
40
00:02:22,260 --> 00:02:25,790
He contends that he may have ended up here from being mislabeled.
41
00:02:25,800 --> 00:02:29,470
Maybe he was supposed to be agreeable or maybe he just highly unusual.
42
00:02:29,670 --> 00:02:35,430
And generally we don't want our models to basically cover this blue ball here.
43
00:02:35,640 --> 00:02:43,960
This is called overfitting and it's bad because in all in most likely case a most likely scenario.
44
00:02:44,080 --> 00:02:45,470
Green balls are going to be right here.
45
00:02:45,540 --> 00:02:48,090
So what happens when a green ball is Heyliger unseen.
46
00:02:48,110 --> 00:02:54,030
GREENE Well in the future where we give it the fetus green ball is x y coordinates that is right here
47
00:02:54,530 --> 00:02:55,860
into this model.
48
00:02:55,860 --> 00:02:59,630
This overfitting model is going to label it as blue.
49
00:02:59,730 --> 00:03:06,600
Unfortunately when in reality if you see it here this nice clean model B boundary it is supposed to
50
00:03:06,600 --> 00:03:09,330
be green green glass.
51
00:03:09,570 --> 00:03:16,080
So that's an example of a model that is overfit probably too complicated for its own good as opposed
52
00:03:16,080 --> 00:03:19,610
to a Model B which generalizes Well model C in it.
53
00:03:19,650 --> 00:03:23,380
On the other hand is way too general and it's not going to be a good model.
54
00:03:27,480 --> 00:03:32,190
As I explained before this model fits this model this ideal of balance.
55
00:03:32,190 --> 00:03:33,990
And this is what it's called unbefitting.
56
00:03:34,110 --> 00:03:35,250
He doesn't fit the data.
57
00:03:35,430 --> 00:03:38,930
Well he just gives you a generic boundary and tells you.
58
00:03:39,210 --> 00:03:43,020
Yeah I tried my best but it's under fitting.
59
00:03:43,100 --> 00:03:47,550
So let's go into overfitting overfitting is what leads to poor models.
60
00:03:47,570 --> 00:03:54,650
And that's one of the most common problems faced in machine learning and that's a problem I face continuously
61
00:03:54,650 --> 00:03:57,500
when training my convolutional neural nets.
62
00:03:57,890 --> 00:04:01,570
It always happens it always tends to overfit when the training data.
63
00:04:01,940 --> 00:04:04,960
And we will we will experience that later on in discourse.
64
00:04:05,390 --> 00:04:12,450
But what it means basically is overfitting is when all muddled fits perfectly still treating it as in
65
00:04:12,710 --> 00:04:19,370
He has very high accuracy when the data he was trained on maybe even high 90s and then nine point nine
66
00:04:19,370 --> 00:04:20,820
something.
67
00:04:20,820 --> 00:04:27,740
However on the test dataset which is the unseen data he is going to be perform poorly because he has
68
00:04:27,740 --> 00:04:33,710
no basically modeled after detecting data but can't generalize well to data he hasn't seen.
69
00:04:34,190 --> 00:04:39,100
So what happens when you try to pass and you point to this position here.
70
00:04:39,290 --> 00:04:40,910
Exactly what I mentioned before.
71
00:04:41,450 --> 00:04:45,290
It's true colors are supposed to be green but it will be classified as blue.
72
00:04:45,590 --> 00:04:48,020
Models don't necessarily need to be too complex to be good.
73
00:04:48,030 --> 00:04:54,330
They need to generalize well so how do you know if you're overfit.
74
00:04:54,820 --> 00:04:59,130
Well that's why I mentioned testier previously in the beginning of the slide.
75
00:04:59,350 --> 00:05:05,420
We need to test a model on test data and an all machine learning algorithms and stuff.
76
00:05:05,560 --> 00:05:08,040
We always use a test data set.
77
00:05:08,560 --> 00:05:15,230
And basically if we have entire data sets save for 2000 images we take seven hundred and vitrine and
78
00:05:15,250 --> 00:05:20,310
those 700 labeled images and we reserve tree hundred as tested.
79
00:05:20,440 --> 00:05:27,930
200 is critical because it tells us how well or algorithm or model performs on data.
80
00:05:28,030 --> 00:05:29,680
The model has never seen before.
81
00:05:31,970 --> 00:05:36,120
This is a very common case of AMSAT overfitting 95 percent plus accuracy.
82
00:05:36,260 --> 00:05:38,690
But on test data you get like 70 percent.
83
00:05:38,690 --> 00:05:41,500
It's a perfect example of overfitting.
84
00:05:42,280 --> 00:05:44,260
So overfitting that graphically.
85
00:05:44,330 --> 00:05:45,810
Here's what happens.
86
00:05:45,890 --> 00:05:48,390
Mumblin mentioned ebox No.
87
00:05:48,920 --> 00:05:54,040
1 indice in this chapter I discuss it discuss exactly what ebox are.
88
00:05:54,320 --> 00:06:00,750
It's basically every time we send the full treating the research into our training algorithm it's we.
89
00:06:00,860 --> 00:06:06,680
We have completed one IPAC and we need to train for maybe hundreds of ebox sometimes to get a good model.
90
00:06:06,770 --> 00:06:07,790
Usually that's not the case.
91
00:06:07,790 --> 00:06:13,490
Usually you can get away with treating 22:00 ebox but generally that's what we have to do to get the
92
00:06:13,490 --> 00:06:15,960
best models.
93
00:06:15,980 --> 00:06:22,120
So this is an illustration of what overfitting looks like.
94
00:06:22,130 --> 00:06:29,360
So look at treating loss here in red and accuracy losses going down quite well accuracy is going up
95
00:06:29,360 --> 00:06:30,620
close to 100 percent.
96
00:06:30,640 --> 00:06:37,370
The scale of accuracy is and decide and losses and decide and ebox or X and all look at the test the
97
00:06:37,370 --> 00:06:43,750
test loss between us fluctuates above between 1 to 1.5 and actually goes up in the end.
98
00:06:43,820 --> 00:06:44,970
It's not good at all.
99
00:06:45,380 --> 00:06:51,700
And look at our test accuracy he's hovering at abysmal rates of like below 50 percent.
100
00:06:52,190 --> 00:06:57,800
And while his training data is at 100 percent that is a very this is actually extreme overfitting to
101
00:06:57,800 --> 00:07:00,240
be fair doesn't actually have to get this bad.
102
00:07:00,320 --> 00:07:02,370
At least I've never gotten it to be disbarred.
103
00:07:02,510 --> 00:07:05,850
That is a good example of what we actually see happening in the real world.
104
00:07:06,290 --> 00:07:13,100
Good training accuracy Porchester accuracy and that's overfitting.
105
00:07:13,120 --> 00:07:15,090
So how do we avoid overfitting.
106
00:07:15,360 --> 00:07:16,630
Are many techniques to avoid it.
107
00:07:16,630 --> 00:07:19,030
And I'll discuss it slowly soon.
108
00:07:19,060 --> 00:07:26,200
In the slide in this section now of overfitting as a consequence of our we it's always been tuned to
109
00:07:26,200 --> 00:07:32,080
fit our training data but don't fit over to you and in a way so that they don't perform well on testing.
110
00:07:32,740 --> 00:07:39,310
And we know we had a decent model just too sensitive to the training data.
111
00:07:39,340 --> 00:07:40,990
So is there a way to fix this.
112
00:07:41,040 --> 00:07:42,420
I mean way too sensitive.
113
00:07:42,430 --> 00:07:46,750
I mean it's just optimized exclusively for treating data
114
00:07:49,660 --> 00:07:55,320
so we can avoid well-fitting by using a smaller less deep model.
115
00:07:55,820 --> 00:08:01,720
Deeper models can sometimes find features or interpret noise to be important noise was example of this
116
00:08:01,720 --> 00:08:02,210
here.
117
00:08:03,000 --> 00:08:08,790
This clearly wasn't that important but yet a deep model will actually try to figure out and model this
118
00:08:10,810 --> 00:08:11,740
to do today.
119
00:08:11,830 --> 00:08:17,920
That's because a deeper ability is deeper that folks have abilities to memorize more complicated features
120
00:08:18,850 --> 00:08:21,580
and that's called the memorization cup capacity.
121
00:08:24,960 --> 00:08:28,230
But there's another way and that's called regularisation.
122
00:08:28,470 --> 00:08:35,550
I don't often recommend I shouldn't recommend using less models to get better more better results because
123
00:08:35,550 --> 00:08:40,440
there are other ways around that and you want to actually always have a deep enough model to have to
124
00:08:40,440 --> 00:08:42,890
represent complicated patterns in your data.
125
00:08:43,260 --> 00:08:47,750
So let's see what techniques we can use to regularize.
126
00:08:47,820 --> 00:08:49,400
So what is regularisation.
127
00:08:49,400 --> 00:08:53,750
It's a method of making all model more general to a data set.
128
00:08:53,760 --> 00:08:59,880
So basically regularisation will take a model that produces a decision boundary like this and sort of
129
00:08:59,880 --> 00:09:02,870
make it tweak it so that it becomes like this.
130
00:09:02,970 --> 00:09:05,150
And let's see now it's not actually doing it.
131
00:09:05,150 --> 00:09:07,990
That's the actual tomb of regularisation.
132
00:09:08,010 --> 00:09:11,240
We basically want to get a model like this and not like this.
133
00:09:11,280 --> 00:09:13,000
And let's find out how we do that.
134
00:09:14,750 --> 00:09:18,090
So there are a few types of regularisation you're actually more in this.
135
00:09:18,110 --> 00:09:19,800
But these are the basic types.
136
00:09:19,800 --> 00:09:28,010
There's L1 L2 regularisation cross-validation stopping dropout and data augmentation.
137
00:09:28,040 --> 00:09:31,030
So let's look at L1 L2 regularisation.
138
00:09:31,070 --> 00:09:37,640
These are techniques we use to penalize large weights large weights of gradients manifest themselves
139
00:09:37,640 --> 00:09:43,660
as abrupt changes in our models decision boundary and by penalizing them effectively making them small
140
00:09:44,240 --> 00:09:45,850
L2 is known as original Russian.
141
00:09:45,870 --> 00:09:47,900
L What is laso regression.
142
00:09:48,240 --> 00:09:50,570
No it is a lot more theory behind these things here.
143
00:09:50,720 --> 00:09:54,110
I'm just basically showing you the formulas of what they actually are.
144
00:09:54,590 --> 00:10:02,060
So as you can see this is basically what we're L2 is here.
145
00:10:02,200 --> 00:10:03,760
This is MSCE here.
146
00:10:03,950 --> 00:10:06,690
But we're actually doing something with it here.
147
00:10:06,830 --> 00:10:08,340
What are we doing.
148
00:10:08,360 --> 00:10:14,630
We're playing a constant here of two and some of the weights squared and L-1 is not square.
149
00:10:14,690 --> 00:10:16,940
It's just absolute value of some of the weight.
150
00:10:17,390 --> 00:10:26,180
So this promise here controls the penalty we apply via propagation to penalty under which it is applied
151
00:10:26,340 --> 00:10:28,320
to wait a bit.
152
00:10:28,350 --> 00:10:34,150
So the differences between them basically is that L-1 brings the widths of unimportant features to zero
153
00:10:34,920 --> 00:10:37,220
thus acting as a feature selection algorithm.
154
00:10:37,290 --> 00:10:42,720
You may not know what feature selection is but if you want to know what feature selection is it's basically
155
00:10:43,110 --> 00:10:43,960
trying to find out.
156
00:10:44,010 --> 00:10:48,230
We have like 20 inputs what input is most important to our.
157
00:10:48,240 --> 00:10:52,160
But it's also on the past models as well.
158
00:10:52,560 --> 00:10:56,550
Whereas L2 penalises even more Does that bring it down to zero.
159
00:10:56,710 --> 00:10:57,630
OK.
160
00:10:58,260 --> 00:11:04,950
So effectively what we're doing here disremember L1 L2 prevents it from being too large so that we don't
161
00:11:04,950 --> 00:11:11,830
have abrupt changes no model of abrupt changes basically mean things like go back to here.
162
00:11:13,360 --> 00:11:18,080
You basically try to make this instead of having this abrupt gritty and changes here.
163
00:11:21,290 --> 00:11:23,610
So let's go into cross-validation now.
164
00:11:23,630 --> 00:11:29,720
Cross-ventilation is something I rarely ever use because they're not used to using it to use to use
165
00:11:29,720 --> 00:11:30,140
it a lot.
166
00:11:30,140 --> 00:11:35,000
Doing previous machine learning stuff but in deep learning I don't use it often but if you want to know
167
00:11:35,000 --> 00:11:43,160
what it is it's quite simple actually basically cross-validation and that is key for the course validation
168
00:11:43,730 --> 00:11:49,650
is how we see is the way we split our data set are trying that a set into different folds and we train
169
00:11:49,730 --> 00:11:54,000
those fools and we basically test on the articles afterward.
170
00:11:54,500 --> 00:12:00,600
So let's look at to see if this here is or test full of allegations that no training set.
171
00:12:00,740 --> 00:12:08,650
So what happens is that we train on these four folds here and we test us then we train on these 4:48
172
00:12:08,750 --> 00:12:09,070
tests.
173
00:12:09,080 --> 00:12:15,190
And that's what this does is that we don't actually have any unseen data in this model.
174
00:12:15,260 --> 00:12:21,410
What we're doing is just we're continuously testing on segments of the data and then creating on segments
175
00:12:21,410 --> 00:12:25,320
of the data and testing on a different segment.
176
00:12:25,370 --> 00:12:30,630
It is a fitting Yes but it also slows down the training process.
177
00:12:32,210 --> 00:12:36,030
Now is nothing is something we can actually automatically do in Paris.
178
00:12:36,080 --> 00:12:43,070
Basically we can set something in Paris that tells us if all a loss tops decreasing stop stop reading.
179
00:12:43,290 --> 00:12:50,600
So if we set our model for 100 ebox but see it Epopt number two something it stops the last stop decreasing
180
00:12:51,080 --> 00:12:55,030
it's going to stop happening somewhere around here when he realizes that and is going to give you this
181
00:12:55,040 --> 00:12:55,670
model here.
182
00:12:55,670 --> 00:12:58,670
The best one with the best loss.
183
00:12:58,700 --> 00:13:05,840
The reason we do this stopping is that sometimes we keep continually training over and over number of
184
00:13:05,900 --> 00:13:08,430
books on our training data.
185
00:13:08,510 --> 00:13:11,840
It tends to basically fit on that data.
186
00:13:11,840 --> 00:13:14,540
So we need to actually stop it really sometimes.
187
00:13:14,760 --> 00:13:18,610
So that release that overfitting does not occur.
188
00:13:20,040 --> 00:13:21,670
And let's talk about drop out.
189
00:13:21,710 --> 00:13:26,000
It is actually very easy to implement and extremely useful.
190
00:13:26,520 --> 00:13:28,610
So drop out refers to dropping nodes.
191
00:13:28,700 --> 00:13:33,290
Hidden and visible in a neural network with him off reducing overfitting.
192
00:13:33,530 --> 00:13:35,240
What do you mean by dropping nodes.
193
00:13:35,460 --> 00:13:40,830
What it means is that in the training process certain parts of the network are ignored during a forward
194
00:13:40,830 --> 00:13:42,680
and back propagations.
195
00:13:42,720 --> 00:13:48,660
This is a way of actually making that work have some redundancy in a way but what it also does is that
196
00:13:48,660 --> 00:13:53,100
it actually adds regularisation to one that works.
197
00:13:53,160 --> 00:13:59,030
It helps reduce the interdependency between winning on your own as well.
198
00:13:59,450 --> 00:14:04,840
So it actually leads to more robust and meaningful features is in dropout.
199
00:14:04,910 --> 00:14:10,580
Does one prominent we use it's called P and P is a probability that the nodes are kept or dropped out
200
00:14:11,270 --> 00:14:12,930
in the training process.
201
00:14:12,980 --> 00:14:18,620
So one are the consequences of using dropout is that it almost doubles the training time to converge
202
00:14:18,620 --> 00:14:21,090
during training.
203
00:14:21,110 --> 00:14:23,960
So this is a good illustration of dropout.
204
00:14:24,020 --> 00:14:26,680
This is a standard neural networks when we train it.
205
00:14:26,990 --> 00:14:30,670
And after playing drop out let's say it was a fairly high value.
206
00:14:31,100 --> 00:14:33,060
These notes here are ignored.
207
00:14:33,650 --> 00:14:38,040
And these notes are used in training.
208
00:14:38,140 --> 00:14:44,170
And lastly was not the last method of regularisation but the last one I'll teach in the schools because
209
00:14:44,170 --> 00:14:48,690
the others are actually quite exotic and not commonly used.
210
00:14:48,730 --> 00:14:51,890
So this one is called the augmentation.
211
00:14:51,890 --> 00:14:55,230
Remember I said you need lots of data to train network.
212
00:14:55,450 --> 00:15:02,800
What if you don't have well incomplete vision especially It lends itself naturally to data augmentation.
213
00:15:02,800 --> 00:15:06,570
What we do is we take a dataset we have one picture of a dog.
214
00:15:06,910 --> 00:15:09,910
How about we just make some manipulations to this image.
215
00:15:09,910 --> 00:15:13,050
We can rotate it completely.
216
00:15:13,390 --> 00:15:19,450
We can I think this one is just mirrored back here and this one has actually zoomed in a bit as well.
217
00:15:19,960 --> 00:15:24,400
So that is where we can actually expand the desert.