1
00:00:00,390 --> 00:00:00,710
OK.

2
00:00:00,750 --> 00:00:02,420
So welcome to Section Six point.

3
00:00:02,550 --> 00:00:05,970
Actually just corrected it from the last section I had a six point seven here.

4
00:00:06,060 --> 00:00:06,910
My bad.

5
00:00:07,320 --> 00:00:11,500
So this section deals with a regularisation which was a very important concept.

6
00:00:11,520 --> 00:00:16,740
Also you're going to understand what overfitting is and why it's bad and why we need to have a model.

7
00:00:16,920 --> 00:00:21,090
Generalize well and you understand basically what a tested assets.

8
00:00:21,090 --> 00:00:25,110
I think I mentioned it previously but I'll go into it a bit more here.

9
00:00:25,500 --> 00:00:30,920
Effectively we want to know how or what when and how our trade model becomes good.

10
00:00:32,310 --> 00:00:33,920
So what makes a good model.

11
00:00:33,930 --> 00:00:40,270
Now this is a very very basic explanation of what makes a good model good model is accurate generalizes

12
00:00:40,350 --> 00:00:44,050
well and does not overfit would have these kind of I mean the same thing.

13
00:00:44,080 --> 00:00:46,130
You'll understand that's shortly.

14
00:00:46,290 --> 00:00:52,150
And I deliberately made a slight vague because accuracy all depends on your domain.

15
00:00:52,160 --> 00:00:55,660
You're looking at sometimes you won ninety nine point ninety nine percent accuracy.

16
00:00:55,830 --> 00:00:57,050
Sometimes you can.

17
00:00:57,180 --> 00:00:59,060
You can be happy with 80 percent accuracy.

18
00:00:59,070 --> 00:01:00,490
It all depends on the application.

19
00:01:03,020 --> 00:01:05,780
So let's look at the models here.

20
00:01:05,900 --> 00:01:08,990
Let's look at these two classes one in green one in blue.

21
00:01:09,410 --> 00:01:17,360
And this is a model a model B model see the red line here is basically the decision boundary for each

22
00:01:17,540 --> 00:01:19,010
data set of data here.

23
00:01:19,330 --> 00:01:21,540
Which model I should say so.

24
00:01:23,280 --> 00:01:29,430
What's happening here is that how do you know which model intuitively what would you say is a best model

25
00:01:29,430 --> 00:01:29,980
here.

26
00:01:30,030 --> 00:01:32,630
Now let's look at Mullaly closely.

27
00:01:32,890 --> 00:01:36,350
Muddly actually separates all the data accurately.

28
00:01:36,510 --> 00:01:43,890
It sees a blue ball over here and actually adjust its decision boundary to encapsulate it Model B.

29
00:01:43,890 --> 00:01:49,530
Basically it doesn't do that model B basically does it nice smooth curve here it doesn't push itself

30
00:01:49,560 --> 00:01:54,480
all the way out here to capture this blue ball and basically it forms a nice clean decision boundary

31
00:01:54,480 --> 00:01:55,150
here.

32
00:01:55,260 --> 00:02:01,680
Model C takes a much more simplistic approach giving you a straight line separating these boundaries

33
00:02:01,680 --> 00:02:02,050
here.

34
00:02:02,230 --> 00:02:05,790
Now what would you say is a best model here.

35
00:02:06,150 --> 00:02:13,080
Now I would say B and I'll tell you why even though B doesn't capture the blue ball here as you can

36
00:02:13,080 --> 00:02:17,580
see from the nature of this data the blue ball technically is in the Green Zone.

37
00:02:18,050 --> 00:02:18,320
Yeah.

38
00:02:18,390 --> 00:02:20,310
So this ball here is an anomaly.

39
00:02:20,310 --> 00:02:21,920
He's basically an outlier.

40
00:02:22,260 --> 00:02:25,790
He contends that he may have ended up here from being mislabeled.

41
00:02:25,800 --> 00:02:29,470
Maybe he was supposed to be agreeable or maybe he just highly unusual.

42
00:02:29,670 --> 00:02:35,430
And generally we don't want our models to basically cover this blue ball here.

43
00:02:35,640 --> 00:02:43,960
This is called overfitting and it's bad because in all in most likely case a most likely scenario.

44
00:02:44,080 --> 00:02:45,470
Green balls are going to be right here.

45
00:02:45,540 --> 00:02:48,090
So what happens when a green ball is Heyliger unseen.

46
00:02:48,110 --> 00:02:54,030
GREENE Well in the future where we give it the fetus green ball is x y coordinates that is right here

47
00:02:54,530 --> 00:02:55,860
into this model.

48
00:02:55,860 --> 00:02:59,630
This overfitting model is going to label it as blue.

49
00:02:59,730 --> 00:03:06,600
Unfortunately when in reality if you see it here this nice clean model B boundary it is supposed to

50
00:03:06,600 --> 00:03:09,330
be green green glass.

51
00:03:09,570 --> 00:03:16,080
So that's an example of a model that is overfit probably too complicated for its own good as opposed

52
00:03:16,080 --> 00:03:19,610
to a Model B which generalizes Well model C in it.

53
00:03:19,650 --> 00:03:23,380
On the other hand is way too general and it's not going to be a good model.

54
00:03:27,480 --> 00:03:32,190
As I explained before this model fits this model this ideal of balance.

55
00:03:32,190 --> 00:03:33,990
And this is what it's called unbefitting.

56
00:03:34,110 --> 00:03:35,250
He doesn't fit the data.

57
00:03:35,430 --> 00:03:38,930
Well he just gives you a generic boundary and tells you.

58
00:03:39,210 --> 00:03:43,020
Yeah I tried my best but it's under fitting.

59
00:03:43,100 --> 00:03:47,550
So let's go into overfitting overfitting is what leads to poor models.

60
00:03:47,570 --> 00:03:54,650
And that's one of the most common problems faced in machine learning and that's a problem I face continuously

61
00:03:54,650 --> 00:03:57,500
when training my convolutional neural nets.

62
00:03:57,890 --> 00:04:01,570
It always happens it always tends to overfit when the training data.

63
00:04:01,940 --> 00:04:04,960
And we will we will experience that later on in discourse.

64
00:04:05,390 --> 00:04:12,450
But what it means basically is overfitting is when all muddled fits perfectly still treating it as in

65
00:04:12,710 --> 00:04:19,370
He has very high accuracy when the data he was trained on maybe even high 90s and then nine point nine

66
00:04:19,370 --> 00:04:20,820
something.

67
00:04:20,820 --> 00:04:27,740
However on the test dataset which is the unseen data he is going to be perform poorly because he has

68
00:04:27,740 --> 00:04:33,710
no basically modeled after detecting data but can't generalize well to data he hasn't seen.

69
00:04:34,190 --> 00:04:39,100
So what happens when you try to pass and you point to this position here.

70
00:04:39,290 --> 00:04:40,910
Exactly what I mentioned before.

71
00:04:41,450 --> 00:04:45,290
It's true colors are supposed to be green but it will be classified as blue.

72
00:04:45,590 --> 00:04:48,020
Models don't necessarily need to be too complex to be good.

73
00:04:48,030 --> 00:04:54,330
They need to generalize well so how do you know if you're overfit.

74
00:04:54,820 --> 00:04:59,130
Well that's why I mentioned testier previously in the beginning of the slide.

75
00:04:59,350 --> 00:05:05,420
We need to test a model on test data and an all machine learning algorithms and stuff.

76
00:05:05,560 --> 00:05:08,040
We always use a test data set.

77
00:05:08,560 --> 00:05:15,230
And basically if we have entire data sets save for 2000 images we take seven hundred and vitrine and

78
00:05:15,250 --> 00:05:20,310
those 700 labeled images and we reserve tree hundred as tested.

79
00:05:20,440 --> 00:05:27,930
200 is critical because it tells us how well or algorithm or model performs on data.

80
00:05:28,030 --> 00:05:29,680
The model has never seen before.

81
00:05:31,970 --> 00:05:36,120
This is a very common case of AMSAT overfitting 95 percent plus accuracy.

82
00:05:36,260 --> 00:05:38,690
But on test data you get like 70 percent.

83
00:05:38,690 --> 00:05:41,500
It's a perfect example of overfitting.

84
00:05:42,280 --> 00:05:44,260
So overfitting that graphically.

85
00:05:44,330 --> 00:05:45,810
Here's what happens.

86
00:05:45,890 --> 00:05:48,390
Mumblin mentioned ebox No.

87
00:05:48,920 --> 00:05:54,040
1 indice in this chapter I discuss it discuss exactly what ebox are.

88
00:05:54,320 --> 00:06:00,750
It's basically every time we send the full treating the research into our training algorithm it's we.

89
00:06:00,860 --> 00:06:06,680
We have completed one IPAC and we need to train for maybe hundreds of ebox sometimes to get a good model.

90
00:06:06,770 --> 00:06:07,790
Usually that's not the case.

91
00:06:07,790 --> 00:06:13,490
Usually you can get away with treating 22:00 ebox but generally that's what we have to do to get the

92
00:06:13,490 --> 00:06:15,960
best models.

93
00:06:15,980 --> 00:06:22,120
So this is an illustration of what overfitting looks like.

94
00:06:22,130 --> 00:06:29,360
So look at treating loss here in red and accuracy losses going down quite well accuracy is going up

95
00:06:29,360 --> 00:06:30,620
close to 100 percent.

96
00:06:30,640 --> 00:06:37,370
The scale of accuracy is and decide and losses and decide and ebox or X and all look at the test the

97
00:06:37,370 --> 00:06:43,750
test loss between us fluctuates above between 1 to 1.5 and actually goes up in the end.

98
00:06:43,820 --> 00:06:44,970
It's not good at all.

99
00:06:45,380 --> 00:06:51,700
And look at our test accuracy he's hovering at abysmal rates of like below 50 percent.

100
00:06:52,190 --> 00:06:57,800
And while his training data is at 100 percent that is a very this is actually extreme overfitting to

101
00:06:57,800 --> 00:07:00,240
be fair doesn't actually have to get this bad.

102
00:07:00,320 --> 00:07:02,370
At least I've never gotten it to be disbarred.

103
00:07:02,510 --> 00:07:05,850
That is a good example of what we actually see happening in the real world.

104
00:07:06,290 --> 00:07:13,100
Good training accuracy Porchester accuracy and that's overfitting.

105
00:07:13,120 --> 00:07:15,090
So how do we avoid overfitting.

106
00:07:15,360 --> 00:07:16,630
Are many techniques to avoid it.

107
00:07:16,630 --> 00:07:19,030
And I'll discuss it slowly soon.

108
00:07:19,060 --> 00:07:26,200
In the slide in this section now of overfitting as a consequence of our we it's always been tuned to

109
00:07:26,200 --> 00:07:32,080
fit our training data but don't fit over to you and in a way so that they don't perform well on testing.

110
00:07:32,740 --> 00:07:39,310
And we know we had a decent model just too sensitive to the training data.

111
00:07:39,340 --> 00:07:40,990
So is there a way to fix this.

112
00:07:41,040 --> 00:07:42,420
I mean way too sensitive.

113
00:07:42,430 --> 00:07:46,750
I mean it's just optimized exclusively for treating data

114
00:07:49,660 --> 00:07:55,320
so we can avoid well-fitting by using a smaller less deep model.

115
00:07:55,820 --> 00:08:01,720
Deeper models can sometimes find features or interpret noise to be important noise was example of this

116
00:08:01,720 --> 00:08:02,210
here.

117
00:08:03,000 --> 00:08:08,790
This clearly wasn't that important but yet a deep model will actually try to figure out and model this

118
00:08:10,810 --> 00:08:11,740
to do today.

119
00:08:11,830 --> 00:08:17,920
That's because a deeper ability is deeper that folks have abilities to memorize more complicated features

120
00:08:18,850 --> 00:08:21,580
and that's called the memorization cup capacity.

121
00:08:24,960 --> 00:08:28,230
But there's another way and that's called regularisation.

122
00:08:28,470 --> 00:08:35,550
I don't often recommend I shouldn't recommend using less models to get better more better results because

123
00:08:35,550 --> 00:08:40,440
there are other ways around that and you want to actually always have a deep enough model to have to

124
00:08:40,440 --> 00:08:42,890
represent complicated patterns in your data.

125
00:08:43,260 --> 00:08:47,750
So let's see what techniques we can use to regularize.

126
00:08:47,820 --> 00:08:49,400
So what is regularisation.

127
00:08:49,400 --> 00:08:53,750
It's a method of making all model more general to a data set.

128
00:08:53,760 --> 00:08:59,880
So basically regularisation will take a model that produces a decision boundary like this and sort of

129
00:08:59,880 --> 00:09:02,870
make it tweak it so that it becomes like this.

130
00:09:02,970 --> 00:09:05,150
And let's see now it's not actually doing it.

131
00:09:05,150 --> 00:09:07,990
That's the actual tomb of regularisation.

132
00:09:08,010 --> 00:09:11,240
We basically want to get a model like this and not like this.

133
00:09:11,280 --> 00:09:13,000
And let's find out how we do that.

134
00:09:14,750 --> 00:09:18,090
So there are a few types of regularisation you're actually more in this.

135
00:09:18,110 --> 00:09:19,800
But these are the basic types.

136
00:09:19,800 --> 00:09:28,010
There's L1 L2 regularisation cross-validation stopping dropout and data augmentation.

137
00:09:28,040 --> 00:09:31,030
So let's look at L1 L2 regularisation.

138
00:09:31,070 --> 00:09:37,640
These are techniques we use to penalize large weights large weights of gradients manifest themselves

139
00:09:37,640 --> 00:09:43,660
as abrupt changes in our models decision boundary and by penalizing them effectively making them small

140
00:09:44,240 --> 00:09:45,850
L2 is known as original Russian.

141
00:09:45,870 --> 00:09:47,900
L What is laso regression.

142
00:09:48,240 --> 00:09:50,570
No it is a lot more theory behind these things here.

143
00:09:50,720 --> 00:09:54,110
I'm just basically showing you the formulas of what they actually are.

144
00:09:54,590 --> 00:10:02,060
So as you can see this is basically what we're L2 is here.

145
00:10:02,200 --> 00:10:03,760
This is MSCE here.

146
00:10:03,950 --> 00:10:06,690
But we're actually doing something with it here.

147
00:10:06,830 --> 00:10:08,340
What are we doing.

148
00:10:08,360 --> 00:10:14,630
We're playing a constant here of two and some of the weights squared and L-1 is not square.

149
00:10:14,690 --> 00:10:16,940
It's just absolute value of some of the weight.

150
00:10:17,390 --> 00:10:26,180
So this promise here controls the penalty we apply via propagation to penalty under which it is applied

151
00:10:26,340 --> 00:10:28,320
to wait a bit.

152
00:10:28,350 --> 00:10:34,150
So the differences between them basically is that L-1 brings the widths of unimportant features to zero

153
00:10:34,920 --> 00:10:37,220
thus acting as a feature selection algorithm.

154
00:10:37,290 --> 00:10:42,720
You may not know what feature selection is but if you want to know what feature selection is it's basically

155
00:10:43,110 --> 00:10:43,960
trying to find out.

156
00:10:44,010 --> 00:10:48,230
We have like 20 inputs what input is most important to our.

157
00:10:48,240 --> 00:10:52,160
But it's also on the past models as well.

158
00:10:52,560 --> 00:10:56,550
Whereas L2 penalises even more Does that bring it down to zero.

159
00:10:56,710 --> 00:10:57,630
OK.

160
00:10:58,260 --> 00:11:04,950
So effectively what we're doing here disremember L1 L2 prevents it from being too large so that we don't

161
00:11:04,950 --> 00:11:11,830
have abrupt changes no model of abrupt changes basically mean things like go back to here.

162
00:11:13,360 --> 00:11:18,080
You basically try to make this instead of having this abrupt gritty and changes here.

163
00:11:21,290 --> 00:11:23,610
So let's go into cross-validation now.

164
00:11:23,630 --> 00:11:29,720
Cross-ventilation is something I rarely ever use because they're not used to using it to use to use

165
00:11:29,720 --> 00:11:30,140
it a lot.

166
00:11:30,140 --> 00:11:35,000
Doing previous machine learning stuff but in deep learning I don't use it often but if you want to know

167
00:11:35,000 --> 00:11:43,160
what it is it's quite simple actually basically cross-validation and that is key for the course validation

168
00:11:43,730 --> 00:11:49,650
is how we see is the way we split our data set are trying that a set into different folds and we train

169
00:11:49,730 --> 00:11:54,000
those fools and we basically test on the articles afterward.

170
00:11:54,500 --> 00:12:00,600
So let's look at to see if this here is or test full of allegations that no training set.

171
00:12:00,740 --> 00:12:08,650
So what happens is that we train on these four folds here and we test us then we train on these 4:48

172
00:12:08,750 --> 00:12:09,070
tests.

173
00:12:09,080 --> 00:12:15,190
And that's what this does is that we don't actually have any unseen data in this model.

174
00:12:15,260 --> 00:12:21,410
What we're doing is just we're continuously testing on segments of the data and then creating on segments

175
00:12:21,410 --> 00:12:25,320
of the data and testing on a different segment.

176
00:12:25,370 --> 00:12:30,630
It is a fitting Yes but it also slows down the training process.

177
00:12:32,210 --> 00:12:36,030
Now is nothing is something we can actually automatically do in Paris.

178
00:12:36,080 --> 00:12:43,070
Basically we can set something in Paris that tells us if all a loss tops decreasing stop stop reading.

179
00:12:43,290 --> 00:12:50,600
So if we set our model for 100 ebox but see it Epopt number two something it stops the last stop decreasing

180
00:12:51,080 --> 00:12:55,030
it's going to stop happening somewhere around here when he realizes that and is going to give you this

181
00:12:55,040 --> 00:12:55,670
model here.

182
00:12:55,670 --> 00:12:58,670
The best one with the best loss.

183
00:12:58,700 --> 00:13:05,840
The reason we do this stopping is that sometimes we keep continually training over and over number of

184
00:13:05,900 --> 00:13:08,430
books on our training data.

185
00:13:08,510 --> 00:13:11,840
It tends to basically fit on that data.

186
00:13:11,840 --> 00:13:14,540
So we need to actually stop it really sometimes.

187
00:13:14,760 --> 00:13:18,610
So that release that overfitting does not occur.

188
00:13:20,040 --> 00:13:21,670
And let's talk about drop out.

189
00:13:21,710 --> 00:13:26,000
It is actually very easy to implement and extremely useful.

190
00:13:26,520 --> 00:13:28,610
So drop out refers to dropping nodes.

191
00:13:28,700 --> 00:13:33,290
Hidden and visible in a neural network with him off reducing overfitting.

192
00:13:33,530 --> 00:13:35,240
What do you mean by dropping nodes.

193
00:13:35,460 --> 00:13:40,830
What it means is that in the training process certain parts of the network are ignored during a forward

194
00:13:40,830 --> 00:13:42,680
and back propagations.

195
00:13:42,720 --> 00:13:48,660
This is a way of actually making that work have some redundancy in a way but what it also does is that

196
00:13:48,660 --> 00:13:53,100
it actually adds regularisation to one that works.

197
00:13:53,160 --> 00:13:59,030
It helps reduce the interdependency between winning on your own as well.

198
00:13:59,450 --> 00:14:04,840
So it actually leads to more robust and meaningful features is in dropout.

199
00:14:04,910 --> 00:14:10,580
Does one prominent we use it's called P and P is a probability that the nodes are kept or dropped out

200
00:14:11,270 --> 00:14:12,930
in the training process.

201
00:14:12,980 --> 00:14:18,620
So one are the consequences of using dropout is that it almost doubles the training time to converge

202
00:14:18,620 --> 00:14:21,090
during training.

203
00:14:21,110 --> 00:14:23,960
So this is a good illustration of dropout.

204
00:14:24,020 --> 00:14:26,680
This is a standard neural networks when we train it.

205
00:14:26,990 --> 00:14:30,670
And after playing drop out let's say it was a fairly high value.

206
00:14:31,100 --> 00:14:33,060
These notes here are ignored.

207
00:14:33,650 --> 00:14:38,040
And these notes are used in training.

208
00:14:38,140 --> 00:14:44,170
And lastly was not the last method of regularisation but the last one I'll teach in the schools because

209
00:14:44,170 --> 00:14:48,690
the others are actually quite exotic and not commonly used.

210
00:14:48,730 --> 00:14:51,890
So this one is called the augmentation.

211
00:14:51,890 --> 00:14:55,230
Remember I said you need lots of data to train network.

212
00:14:55,450 --> 00:15:02,800
What if you don't have well incomplete vision especially It lends itself naturally to data augmentation.

213
00:15:02,800 --> 00:15:06,570
What we do is we take a dataset we have one picture of a dog.

214
00:15:06,910 --> 00:15:09,910
How about we just make some manipulations to this image.

215
00:15:09,910 --> 00:15:13,050
We can rotate it completely.

216
00:15:13,390 --> 00:15:19,450
We can I think this one is just mirrored back here and this one has actually zoomed in a bit as well.

217
00:15:19,960 --> 00:15:24,400
So that is where we can actually expand the desert.