1 00:00:00,390 --> 00:00:00,710 OK. 2 00:00:00,750 --> 00:00:02,420 So welcome to Section Six point. 3 00:00:02,550 --> 00:00:05,970 Actually just corrected it from the last section I had a six point seven here. 4 00:00:06,060 --> 00:00:06,910 My bad. 5 00:00:07,320 --> 00:00:11,500 So this section deals with a regularisation which was a very important concept. 6 00:00:11,520 --> 00:00:16,740 Also you're going to understand what overfitting is and why it's bad and why we need to have a model. 7 00:00:16,920 --> 00:00:21,090 Generalize well and you understand basically what a tested assets. 8 00:00:21,090 --> 00:00:25,110 I think I mentioned it previously but I'll go into it a bit more here. 9 00:00:25,500 --> 00:00:30,920 Effectively we want to know how or what when and how our trade model becomes good. 10 00:00:32,310 --> 00:00:33,920 So what makes a good model. 11 00:00:33,930 --> 00:00:40,270 Now this is a very very basic explanation of what makes a good model good model is accurate generalizes 12 00:00:40,350 --> 00:00:44,050 well and does not overfit would have these kind of I mean the same thing. 13 00:00:44,080 --> 00:00:46,130 You'll understand that's shortly. 14 00:00:46,290 --> 00:00:52,150 And I deliberately made a slight vague because accuracy all depends on your domain. 15 00:00:52,160 --> 00:00:55,660 You're looking at sometimes you won ninety nine point ninety nine percent accuracy. 16 00:00:55,830 --> 00:00:57,050 Sometimes you can. 17 00:00:57,180 --> 00:00:59,060 You can be happy with 80 percent accuracy. 18 00:00:59,070 --> 00:01:00,490 It all depends on the application. 19 00:01:03,020 --> 00:01:05,780 So let's look at the models here. 20 00:01:05,900 --> 00:01:08,990 Let's look at these two classes one in green one in blue. 21 00:01:09,410 --> 00:01:17,360 And this is a model a model B model see the red line here is basically the decision boundary for each 22 00:01:17,540 --> 00:01:19,010 data set of data here. 23 00:01:19,330 --> 00:01:21,540 Which model I should say so. 24 00:01:23,280 --> 00:01:29,430 What's happening here is that how do you know which model intuitively what would you say is a best model 25 00:01:29,430 --> 00:01:29,980 here. 26 00:01:30,030 --> 00:01:32,630 Now let's look at Mullaly closely. 27 00:01:32,890 --> 00:01:36,350 Muddly actually separates all the data accurately. 28 00:01:36,510 --> 00:01:43,890 It sees a blue ball over here and actually adjust its decision boundary to encapsulate it Model B. 29 00:01:43,890 --> 00:01:49,530 Basically it doesn't do that model B basically does it nice smooth curve here it doesn't push itself 30 00:01:49,560 --> 00:01:54,480 all the way out here to capture this blue ball and basically it forms a nice clean decision boundary 31 00:01:54,480 --> 00:01:55,150 here. 32 00:01:55,260 --> 00:02:01,680 Model C takes a much more simplistic approach giving you a straight line separating these boundaries 33 00:02:01,680 --> 00:02:02,050 here. 34 00:02:02,230 --> 00:02:05,790 Now what would you say is a best model here. 35 00:02:06,150 --> 00:02:13,080 Now I would say B and I'll tell you why even though B doesn't capture the blue ball here as you can 36 00:02:13,080 --> 00:02:17,580 see from the nature of this data the blue ball technically is in the Green Zone. 37 00:02:18,050 --> 00:02:18,320 Yeah. 38 00:02:18,390 --> 00:02:20,310 So this ball here is an anomaly. 39 00:02:20,310 --> 00:02:21,920 He's basically an outlier. 40 00:02:22,260 --> 00:02:25,790 He contends that he may have ended up here from being mislabeled. 41 00:02:25,800 --> 00:02:29,470 Maybe he was supposed to be agreeable or maybe he just highly unusual. 42 00:02:29,670 --> 00:02:35,430 And generally we don't want our models to basically cover this blue ball here. 43 00:02:35,640 --> 00:02:43,960 This is called overfitting and it's bad because in all in most likely case a most likely scenario. 44 00:02:44,080 --> 00:02:45,470 Green balls are going to be right here. 45 00:02:45,540 --> 00:02:48,090 So what happens when a green ball is Heyliger unseen. 46 00:02:48,110 --> 00:02:54,030 GREENE Well in the future where we give it the fetus green ball is x y coordinates that is right here 47 00:02:54,530 --> 00:02:55,860 into this model. 48 00:02:55,860 --> 00:02:59,630 This overfitting model is going to label it as blue. 49 00:02:59,730 --> 00:03:06,600 Unfortunately when in reality if you see it here this nice clean model B boundary it is supposed to 50 00:03:06,600 --> 00:03:09,330 be green green glass. 51 00:03:09,570 --> 00:03:16,080 So that's an example of a model that is overfit probably too complicated for its own good as opposed 52 00:03:16,080 --> 00:03:19,610 to a Model B which generalizes Well model C in it. 53 00:03:19,650 --> 00:03:23,380 On the other hand is way too general and it's not going to be a good model. 54 00:03:27,480 --> 00:03:32,190 As I explained before this model fits this model this ideal of balance. 55 00:03:32,190 --> 00:03:33,990 And this is what it's called unbefitting. 56 00:03:34,110 --> 00:03:35,250 He doesn't fit the data. 57 00:03:35,430 --> 00:03:38,930 Well he just gives you a generic boundary and tells you. 58 00:03:39,210 --> 00:03:43,020 Yeah I tried my best but it's under fitting. 59 00:03:43,100 --> 00:03:47,550 So let's go into overfitting overfitting is what leads to poor models. 60 00:03:47,570 --> 00:03:54,650 And that's one of the most common problems faced in machine learning and that's a problem I face continuously 61 00:03:54,650 --> 00:03:57,500 when training my convolutional neural nets. 62 00:03:57,890 --> 00:04:01,570 It always happens it always tends to overfit when the training data. 63 00:04:01,940 --> 00:04:04,960 And we will we will experience that later on in discourse. 64 00:04:05,390 --> 00:04:12,450 But what it means basically is overfitting is when all muddled fits perfectly still treating it as in 65 00:04:12,710 --> 00:04:19,370 He has very high accuracy when the data he was trained on maybe even high 90s and then nine point nine 66 00:04:19,370 --> 00:04:20,820 something. 67 00:04:20,820 --> 00:04:27,740 However on the test dataset which is the unseen data he is going to be perform poorly because he has 68 00:04:27,740 --> 00:04:33,710 no basically modeled after detecting data but can't generalize well to data he hasn't seen. 69 00:04:34,190 --> 00:04:39,100 So what happens when you try to pass and you point to this position here. 70 00:04:39,290 --> 00:04:40,910 Exactly what I mentioned before. 71 00:04:41,450 --> 00:04:45,290 It's true colors are supposed to be green but it will be classified as blue. 72 00:04:45,590 --> 00:04:48,020 Models don't necessarily need to be too complex to be good. 73 00:04:48,030 --> 00:04:54,330 They need to generalize well so how do you know if you're overfit. 74 00:04:54,820 --> 00:04:59,130 Well that's why I mentioned testier previously in the beginning of the slide. 75 00:04:59,350 --> 00:05:05,420 We need to test a model on test data and an all machine learning algorithms and stuff. 76 00:05:05,560 --> 00:05:08,040 We always use a test data set. 77 00:05:08,560 --> 00:05:15,230 And basically if we have entire data sets save for 2000 images we take seven hundred and vitrine and 78 00:05:15,250 --> 00:05:20,310 those 700 labeled images and we reserve tree hundred as tested. 79 00:05:20,440 --> 00:05:27,930 200 is critical because it tells us how well or algorithm or model performs on data. 80 00:05:28,030 --> 00:05:29,680 The model has never seen before. 81 00:05:31,970 --> 00:05:36,120 This is a very common case of AMSAT overfitting 95 percent plus accuracy. 82 00:05:36,260 --> 00:05:38,690 But on test data you get like 70 percent. 83 00:05:38,690 --> 00:05:41,500 It's a perfect example of overfitting. 84 00:05:42,280 --> 00:05:44,260 So overfitting that graphically. 85 00:05:44,330 --> 00:05:45,810 Here's what happens. 86 00:05:45,890 --> 00:05:48,390 Mumblin mentioned ebox No. 87 00:05:48,920 --> 00:05:54,040 1 indice in this chapter I discuss it discuss exactly what ebox are. 88 00:05:54,320 --> 00:06:00,750 It's basically every time we send the full treating the research into our training algorithm it's we. 89 00:06:00,860 --> 00:06:06,680 We have completed one IPAC and we need to train for maybe hundreds of ebox sometimes to get a good model. 90 00:06:06,770 --> 00:06:07,790 Usually that's not the case. 91 00:06:07,790 --> 00:06:13,490 Usually you can get away with treating 22:00 ebox but generally that's what we have to do to get the 92 00:06:13,490 --> 00:06:15,960 best models. 93 00:06:15,980 --> 00:06:22,120 So this is an illustration of what overfitting looks like. 94 00:06:22,130 --> 00:06:29,360 So look at treating loss here in red and accuracy losses going down quite well accuracy is going up 95 00:06:29,360 --> 00:06:30,620 close to 100 percent. 96 00:06:30,640 --> 00:06:37,370 The scale of accuracy is and decide and losses and decide and ebox or X and all look at the test the 97 00:06:37,370 --> 00:06:43,750 test loss between us fluctuates above between 1 to 1.5 and actually goes up in the end. 98 00:06:43,820 --> 00:06:44,970 It's not good at all. 99 00:06:45,380 --> 00:06:51,700 And look at our test accuracy he's hovering at abysmal rates of like below 50 percent. 100 00:06:52,190 --> 00:06:57,800 And while his training data is at 100 percent that is a very this is actually extreme overfitting to 101 00:06:57,800 --> 00:07:00,240 be fair doesn't actually have to get this bad. 102 00:07:00,320 --> 00:07:02,370 At least I've never gotten it to be disbarred. 103 00:07:02,510 --> 00:07:05,850 That is a good example of what we actually see happening in the real world. 104 00:07:06,290 --> 00:07:13,100 Good training accuracy Porchester accuracy and that's overfitting. 105 00:07:13,120 --> 00:07:15,090 So how do we avoid overfitting. 106 00:07:15,360 --> 00:07:16,630 Are many techniques to avoid it. 107 00:07:16,630 --> 00:07:19,030 And I'll discuss it slowly soon. 108 00:07:19,060 --> 00:07:26,200 In the slide in this section now of overfitting as a consequence of our we it's always been tuned to 109 00:07:26,200 --> 00:07:32,080 fit our training data but don't fit over to you and in a way so that they don't perform well on testing. 110 00:07:32,740 --> 00:07:39,310 And we know we had a decent model just too sensitive to the training data. 111 00:07:39,340 --> 00:07:40,990 So is there a way to fix this. 112 00:07:41,040 --> 00:07:42,420 I mean way too sensitive. 113 00:07:42,430 --> 00:07:46,750 I mean it's just optimized exclusively for treating data 114 00:07:49,660 --> 00:07:55,320 so we can avoid well-fitting by using a smaller less deep model. 115 00:07:55,820 --> 00:08:01,720 Deeper models can sometimes find features or interpret noise to be important noise was example of this 116 00:08:01,720 --> 00:08:02,210 here. 117 00:08:03,000 --> 00:08:08,790 This clearly wasn't that important but yet a deep model will actually try to figure out and model this 118 00:08:10,810 --> 00:08:11,740 to do today. 119 00:08:11,830 --> 00:08:17,920 That's because a deeper ability is deeper that folks have abilities to memorize more complicated features 120 00:08:18,850 --> 00:08:21,580 and that's called the memorization cup capacity. 121 00:08:24,960 --> 00:08:28,230 But there's another way and that's called regularisation. 122 00:08:28,470 --> 00:08:35,550 I don't often recommend I shouldn't recommend using less models to get better more better results because 123 00:08:35,550 --> 00:08:40,440 there are other ways around that and you want to actually always have a deep enough model to have to 124 00:08:40,440 --> 00:08:42,890 represent complicated patterns in your data. 125 00:08:43,260 --> 00:08:47,750 So let's see what techniques we can use to regularize. 126 00:08:47,820 --> 00:08:49,400 So what is regularisation. 127 00:08:49,400 --> 00:08:53,750 It's a method of making all model more general to a data set. 128 00:08:53,760 --> 00:08:59,880 So basically regularisation will take a model that produces a decision boundary like this and sort of 129 00:08:59,880 --> 00:09:02,870 make it tweak it so that it becomes like this. 130 00:09:02,970 --> 00:09:05,150 And let's see now it's not actually doing it. 131 00:09:05,150 --> 00:09:07,990 That's the actual tomb of regularisation. 132 00:09:08,010 --> 00:09:11,240 We basically want to get a model like this and not like this. 133 00:09:11,280 --> 00:09:13,000 And let's find out how we do that. 134 00:09:14,750 --> 00:09:18,090 So there are a few types of regularisation you're actually more in this. 135 00:09:18,110 --> 00:09:19,800 But these are the basic types. 136 00:09:19,800 --> 00:09:28,010 There's L1 L2 regularisation cross-validation stopping dropout and data augmentation. 137 00:09:28,040 --> 00:09:31,030 So let's look at L1 L2 regularisation. 138 00:09:31,070 --> 00:09:37,640 These are techniques we use to penalize large weights large weights of gradients manifest themselves 139 00:09:37,640 --> 00:09:43,660 as abrupt changes in our models decision boundary and by penalizing them effectively making them small 140 00:09:44,240 --> 00:09:45,850 L2 is known as original Russian. 141 00:09:45,870 --> 00:09:47,900 L What is laso regression. 142 00:09:48,240 --> 00:09:50,570 No it is a lot more theory behind these things here. 143 00:09:50,720 --> 00:09:54,110 I'm just basically showing you the formulas of what they actually are. 144 00:09:54,590 --> 00:10:02,060 So as you can see this is basically what we're L2 is here. 145 00:10:02,200 --> 00:10:03,760 This is MSCE here. 146 00:10:03,950 --> 00:10:06,690 But we're actually doing something with it here. 147 00:10:06,830 --> 00:10:08,340 What are we doing. 148 00:10:08,360 --> 00:10:14,630 We're playing a constant here of two and some of the weights squared and L-1 is not square. 149 00:10:14,690 --> 00:10:16,940 It's just absolute value of some of the weight. 150 00:10:17,390 --> 00:10:26,180 So this promise here controls the penalty we apply via propagation to penalty under which it is applied 151 00:10:26,340 --> 00:10:28,320 to wait a bit. 152 00:10:28,350 --> 00:10:34,150 So the differences between them basically is that L-1 brings the widths of unimportant features to zero 153 00:10:34,920 --> 00:10:37,220 thus acting as a feature selection algorithm. 154 00:10:37,290 --> 00:10:42,720 You may not know what feature selection is but if you want to know what feature selection is it's basically 155 00:10:43,110 --> 00:10:43,960 trying to find out. 156 00:10:44,010 --> 00:10:48,230 We have like 20 inputs what input is most important to our. 157 00:10:48,240 --> 00:10:52,160 But it's also on the past models as well. 158 00:10:52,560 --> 00:10:56,550 Whereas L2 penalises even more Does that bring it down to zero. 159 00:10:56,710 --> 00:10:57,630 OK. 160 00:10:58,260 --> 00:11:04,950 So effectively what we're doing here disremember L1 L2 prevents it from being too large so that we don't 161 00:11:04,950 --> 00:11:11,830 have abrupt changes no model of abrupt changes basically mean things like go back to here. 162 00:11:13,360 --> 00:11:18,080 You basically try to make this instead of having this abrupt gritty and changes here. 163 00:11:21,290 --> 00:11:23,610 So let's go into cross-validation now. 164 00:11:23,630 --> 00:11:29,720 Cross-ventilation is something I rarely ever use because they're not used to using it to use to use 165 00:11:29,720 --> 00:11:30,140 it a lot. 166 00:11:30,140 --> 00:11:35,000 Doing previous machine learning stuff but in deep learning I don't use it often but if you want to know 167 00:11:35,000 --> 00:11:43,160 what it is it's quite simple actually basically cross-validation and that is key for the course validation 168 00:11:43,730 --> 00:11:49,650 is how we see is the way we split our data set are trying that a set into different folds and we train 169 00:11:49,730 --> 00:11:54,000 those fools and we basically test on the articles afterward. 170 00:11:54,500 --> 00:12:00,600 So let's look at to see if this here is or test full of allegations that no training set. 171 00:12:00,740 --> 00:12:08,650 So what happens is that we train on these four folds here and we test us then we train on these 4:48 172 00:12:08,750 --> 00:12:09,070 tests. 173 00:12:09,080 --> 00:12:15,190 And that's what this does is that we don't actually have any unseen data in this model. 174 00:12:15,260 --> 00:12:21,410 What we're doing is just we're continuously testing on segments of the data and then creating on segments 175 00:12:21,410 --> 00:12:25,320 of the data and testing on a different segment. 176 00:12:25,370 --> 00:12:30,630 It is a fitting Yes but it also slows down the training process. 177 00:12:32,210 --> 00:12:36,030 Now is nothing is something we can actually automatically do in Paris. 178 00:12:36,080 --> 00:12:43,070 Basically we can set something in Paris that tells us if all a loss tops decreasing stop stop reading. 179 00:12:43,290 --> 00:12:50,600 So if we set our model for 100 ebox but see it Epopt number two something it stops the last stop decreasing 180 00:12:51,080 --> 00:12:55,030 it's going to stop happening somewhere around here when he realizes that and is going to give you this 181 00:12:55,040 --> 00:12:55,670 model here. 182 00:12:55,670 --> 00:12:58,670 The best one with the best loss. 183 00:12:58,700 --> 00:13:05,840 The reason we do this stopping is that sometimes we keep continually training over and over number of 184 00:13:05,900 --> 00:13:08,430 books on our training data. 185 00:13:08,510 --> 00:13:11,840 It tends to basically fit on that data. 186 00:13:11,840 --> 00:13:14,540 So we need to actually stop it really sometimes. 187 00:13:14,760 --> 00:13:18,610 So that release that overfitting does not occur. 188 00:13:20,040 --> 00:13:21,670 And let's talk about drop out. 189 00:13:21,710 --> 00:13:26,000 It is actually very easy to implement and extremely useful. 190 00:13:26,520 --> 00:13:28,610 So drop out refers to dropping nodes. 191 00:13:28,700 --> 00:13:33,290 Hidden and visible in a neural network with him off reducing overfitting. 192 00:13:33,530 --> 00:13:35,240 What do you mean by dropping nodes. 193 00:13:35,460 --> 00:13:40,830 What it means is that in the training process certain parts of the network are ignored during a forward 194 00:13:40,830 --> 00:13:42,680 and back propagations. 195 00:13:42,720 --> 00:13:48,660 This is a way of actually making that work have some redundancy in a way but what it also does is that 196 00:13:48,660 --> 00:13:53,100 it actually adds regularisation to one that works. 197 00:13:53,160 --> 00:13:59,030 It helps reduce the interdependency between winning on your own as well. 198 00:13:59,450 --> 00:14:04,840 So it actually leads to more robust and meaningful features is in dropout. 199 00:14:04,910 --> 00:14:10,580 Does one prominent we use it's called P and P is a probability that the nodes are kept or dropped out 200 00:14:11,270 --> 00:14:12,930 in the training process. 201 00:14:12,980 --> 00:14:18,620 So one are the consequences of using dropout is that it almost doubles the training time to converge 202 00:14:18,620 --> 00:14:21,090 during training. 203 00:14:21,110 --> 00:14:23,960 So this is a good illustration of dropout. 204 00:14:24,020 --> 00:14:26,680 This is a standard neural networks when we train it. 205 00:14:26,990 --> 00:14:30,670 And after playing drop out let's say it was a fairly high value. 206 00:14:31,100 --> 00:14:33,060 These notes here are ignored. 207 00:14:33,650 --> 00:14:38,040 And these notes are used in training. 208 00:14:38,140 --> 00:14:44,170 And lastly was not the last method of regularisation but the last one I'll teach in the schools because 209 00:14:44,170 --> 00:14:48,690 the others are actually quite exotic and not commonly used. 210 00:14:48,730 --> 00:14:51,890 So this one is called the augmentation. 211 00:14:51,890 --> 00:14:55,230 Remember I said you need lots of data to train network. 212 00:14:55,450 --> 00:15:02,800 What if you don't have well incomplete vision especially It lends itself naturally to data augmentation. 213 00:15:02,800 --> 00:15:06,570 What we do is we take a dataset we have one picture of a dog. 214 00:15:06,910 --> 00:15:09,910 How about we just make some manipulations to this image. 215 00:15:09,910 --> 00:15:13,050 We can rotate it completely. 216 00:15:13,390 --> 00:15:19,450 We can I think this one is just mirrored back here and this one has actually zoomed in a bit as well. 217 00:15:19,960 --> 00:15:24,400 So that is where we can actually expand the desert.