1 00:00:00,670 --> 00:00:06,470 OK so in 7.2 we are going to learn what convolutions all and what image features. 2 00:00:06,630 --> 00:00:08,960 So let's get. 3 00:00:09,490 --> 00:00:11,460 So before we dive into convolutions. 4 00:00:11,470 --> 00:00:14,300 Let's take a look at what image features actually are. 5 00:00:14,830 --> 00:00:21,280 So when I say image features I'm talking about interesting things in an image as a kind of a vague term 6 00:00:21,450 --> 00:00:25,590 but it basically encapsulates things like edges colors patterns and shapes. 7 00:00:25,600 --> 00:00:26,690 This is a dog here. 8 00:00:26,700 --> 00:00:27,430 It has been. 9 00:00:27,520 --> 00:00:30,490 The edges have been extracted using canny edge detector. 10 00:00:30,910 --> 00:00:38,200 So image feature is basically just that just basically one narrow thing that we find interesting in 11 00:00:38,200 --> 00:00:43,410 an image by narrow I mean like a type of category like edges or colors. 12 00:00:43,440 --> 00:00:53,270 This could have easily been a brown color also or blue or whatever putting the color or shape so before 13 00:00:53,270 --> 00:00:58,650 CNN's came into the picture scientists did feature engineering manually. 14 00:00:58,940 --> 00:01:05,390 You see how I just mentioned that these are these are edges or colored extractors or patterns or shapes. 15 00:01:05,390 --> 00:01:10,730 Now what we did is scientists and I actually had to do this at one time was extract different features 16 00:01:10,730 --> 00:01:17,300 such as histogram of radiance color histograms by intercessions means a structural image. 17 00:01:17,630 --> 00:01:18,740 Many different things. 18 00:01:18,860 --> 00:01:23,930 And it was tedious to actually do this in engineering because a lot of times you're just kind of like 19 00:01:23,930 --> 00:01:27,890 messing around trying different things and you don't even know what works. 20 00:01:27,920 --> 00:01:33,410 And in the end because you don't have a good complicated model that does non-linear representations 21 00:01:33,440 --> 00:01:39,190 Well you still end up getting basically not that great accuracy. 22 00:01:41,250 --> 00:01:47,490 So decent examples of filters learned by this guy here in this publication. 23 00:01:47,490 --> 00:01:53,700 This has been tech here multiple edges actors with whites in black and white as you brush stripes all 24 00:01:53,700 --> 00:01:54,550 over here. 25 00:01:54,780 --> 00:01:56,400 Different color patterns together. 26 00:01:56,590 --> 00:02:03,180 Now one of the image features here exactly what I'm talking what image features but what does it have 27 00:02:03,180 --> 00:02:04,680 to do with convolutions now. 28 00:02:04,890 --> 00:02:08,160 So what are conditions now a convolution. 29 00:02:08,160 --> 00:02:14,010 Before I even tell you how it relates to features convolution is effectively a mathematical two that 30 00:02:14,010 --> 00:02:18,600 describes a process of combining two functions to produce a tiered function. 31 00:02:18,600 --> 00:02:24,810 Now that sounds kind of vague until I tell you the function is a feature map and a feature map is effectively 32 00:02:24,810 --> 00:02:25,870 these things here. 33 00:02:26,220 --> 00:02:30,820 So now we imagine we're applying convolution to an image. 34 00:02:30,860 --> 00:02:38,750 So that's playing a process of two functions so we apply to convolutional to an image to get them up. 35 00:02:38,840 --> 00:02:44,560 So convolution is an action of using a filter or Kunaal we use both interchangeably in discourse and 36 00:02:44,560 --> 00:02:47,180 in research and Terry. 37 00:02:47,190 --> 00:02:53,300 So it's applied to the input and I will keast input being input image and the convolutions. 38 00:02:53,370 --> 00:02:55,110 This is basically the convolutional process. 39 00:02:55,110 --> 00:02:57,240 Now let me just go back to the slide here. 40 00:02:57,390 --> 00:03:03,870 So I just want to reiterate that in pluckiest input which is the first convolution here input is applied 41 00:03:03,960 --> 00:03:06,450 to what function called philatelic. 42 00:03:06,750 --> 00:03:15,530 And that is if you chinup So the convolutional process is basically executed by sliding the filter that's 43 00:03:15,530 --> 00:03:22,820 a filter function over the input image and this slutting process is basically a simple multiplication 44 00:03:22,910 --> 00:03:27,940 matrix multiplication or dot product over to produce Atid function. 45 00:03:27,950 --> 00:03:28,990 So how is it done. 46 00:03:29,150 --> 00:03:32,090 So imagine this is basically an input image. 47 00:03:32,090 --> 00:03:38,330 This is a 2D is not truly in reality but this is for explanation purposes and this is a convolution 48 00:03:38,330 --> 00:03:41,570 filter here to some values in a smaller matrix. 49 00:03:41,840 --> 00:03:43,870 And this is the output feature map. 50 00:03:43,880 --> 00:03:47,220 So what's going to happen in the convolution process. 51 00:03:47,340 --> 00:03:58,070 Well we're going to basically slide this image here over go back here over this area here and then again 52 00:03:58,280 --> 00:04:00,490 and again and you'll see it slowly. 53 00:04:00,500 --> 00:04:06,020 So when I mean convolve that kind of thing means we basically multiply them here. 54 00:04:06,200 --> 00:04:12,050 So as you can see it devalues 1 0 1 1 0 0 0 0 1 1. 55 00:04:12,050 --> 00:04:20,090 And these values here with 0 1 0 1 0 above or below that I actually didn't use these values mainly for 56 00:04:20,090 --> 00:04:21,760 simplicity purposes. 57 00:04:21,760 --> 00:04:26,070 Search engines to these values here to make this calculation far easier for us. 58 00:04:26,090 --> 00:04:33,980 So by multiplying these two together we get zero by 1 1 by 0 0 by 1. 59 00:04:33,980 --> 00:04:37,490 You'll see it here 1 by 0 0 by 1 1 0. 60 00:04:37,730 --> 00:04:42,040 And so on and so on and we just add it up and we get two. 61 00:04:42,280 --> 00:04:46,050 And that forms of force I put future in this box here. 62 00:04:46,400 --> 00:04:54,350 So how many times can this tree by tree Matrix this slidden this or even that we're going to be slighted 63 00:04:54,470 --> 00:04:56,220 over this. 64 00:04:56,220 --> 00:04:58,310 This image here. 65 00:04:58,310 --> 00:05:00,500 So imagine this the good here. 66 00:05:00,740 --> 00:05:05,990 We have one box here and we can shift it again here too just like this here. 67 00:05:06,350 --> 00:05:08,020 And then tree again. 68 00:05:08,420 --> 00:05:15,260 So by studying it up this box we fill up now a second value of FICCI matrix and you don't have to add 69 00:05:15,260 --> 00:05:21,100 one but imagine it as you'd want to hear and then start again at a second row here. 70 00:05:21,410 --> 00:05:28,940 So we have one here to here tree here and then again four five six. 71 00:05:28,940 --> 00:05:29,530 All right. 72 00:05:29,780 --> 00:05:34,030 So we have basically enough values. 73 00:05:34,300 --> 00:05:39,030 So we have in each row one two tree and tree times can misled across. 74 00:05:39,050 --> 00:05:40,880 We have nine values in all. 75 00:05:41,300 --> 00:05:47,450 So we can actually fill out this entire thing by sliding it across nine times. 76 00:05:47,600 --> 00:05:49,200 That's how we build those features. 77 00:05:49,400 --> 00:05:56,630 So by using what I would call tree by tree filter convolution kernel we produce feature map tree by 78 00:05:56,630 --> 00:06:00,260 tree where it produces the filters here. 79 00:06:00,770 --> 00:06:03,270 Now you understand this process. 80 00:06:03,350 --> 00:06:04,930 Basically it's simple not. 81 00:06:05,030 --> 00:06:11,870 But what exactly are effects of doing this and why is this important so fiercely. 82 00:06:12,050 --> 00:06:22,700 Depending on the values of the kernel that was the killer being this blue box here on pollution we produce 83 00:06:22,700 --> 00:06:27,860 different maps obviously because we can have different guilds with different values and they'll all 84 00:06:27,860 --> 00:06:29,720 produce different feature maps. 85 00:06:29,720 --> 00:06:36,050 So playing an artist is skill but as we just saw convolving with different Canal's produces interesting 86 00:06:36,050 --> 00:06:38,890 feature maps that can be used to detect different features. 87 00:06:38,900 --> 00:06:40,240 This is what makes it important. 88 00:06:40,310 --> 00:06:49,670 So imagine we have several filters here each with different sets of values here and we're sliding it 89 00:06:49,700 --> 00:06:50,290 over here. 90 00:06:50,290 --> 00:06:51,990 We're producing different Fincham apps. 91 00:06:52,250 --> 00:06:57,890 So what this means now is that we've now processed input image into basically features that have been 92 00:06:57,890 --> 00:06:58,710 extracted. 93 00:06:59,980 --> 00:07:00,920 So let's keep going. 94 00:07:02,000 --> 00:07:08,570 So it's important to know the convolution keeps a special kinship between pixels by linning image features 95 00:07:08,630 --> 00:07:11,120 over the small segments we pass over. 96 00:07:11,120 --> 00:07:17,530 This means that convolution even though it's reduced in size here it's still sort of retained some for 97 00:07:17,540 --> 00:07:18,470 spatial information. 98 00:07:18,470 --> 00:07:22,400 In this large image just now it's in a more compressed type form. 99 00:07:25,770 --> 00:07:32,070 So these are all examples of Kindles here basically identical kernel does nothing. 100 00:07:32,250 --> 00:07:38,460 We have education canals that simply having these values in the signal changes an input image into this 101 00:07:38,740 --> 00:07:40,590 is quite quite remarkable. 102 00:07:40,590 --> 00:07:44,660 But you can actually write some code or try an open CV and see for yourself. 103 00:07:44,670 --> 00:07:50,910 You can specify Cardinals find you in kennels and open C-v and runs in one form solutions and produce 104 00:07:51,050 --> 00:07:53,910 lose lose sharpen images. 105 00:07:53,920 --> 00:07:55,730 Detection is actually pretty cool. 106 00:07:56,090 --> 00:08:03,000 So let's take a look at an example of a feature applied convolution kernel applied to an image that 107 00:08:03,000 --> 00:08:04,690 extracts features here. 108 00:08:04,770 --> 00:08:11,970 So this is an example gif I've taken from you can actually see how when they applied this and slide 109 00:08:11,970 --> 00:08:15,750 across the image or what the actual convolutional filter output looks like. 110 00:08:15,960 --> 00:08:17,670 So this is the edge to here. 111 00:08:17,750 --> 00:08:19,300 And the other a.. 112 00:08:19,350 --> 00:08:20,760 It's actually pretty cool. 113 00:08:20,760 --> 00:08:22,390 Look at it again there. 114 00:08:23,250 --> 00:08:26,500 And the other thing too they're awesome. 115 00:08:28,100 --> 00:08:30,690 So now as you know that was just wonderful. 116 00:08:30,690 --> 00:08:35,520 So we need many filters in our CNN's Elise as within reason. 117 00:08:35,520 --> 00:08:40,290 You don't want to do too much although there's nothing actually wrong with doing too much just increases 118 00:08:40,290 --> 00:08:47,250 your training time and model complexity and it may be redundant depending on your image data set. 119 00:08:47,280 --> 00:08:50,040 So let's assume we're using 12 filters. 120 00:08:50,040 --> 00:08:56,180 How do we actually visualize how that CNN actually looks at here. 121 00:08:56,190 --> 00:09:04,460 So imagine we have an image that size 28 by 28 and tree tree dimensions red green and blue. 122 00:09:04,530 --> 00:09:06,360 So that's why it has some depth here. 123 00:09:06,960 --> 00:09:11,930 And this is a convolutional Salto which is basically the size here. 124 00:09:12,000 --> 00:09:18,270 One by one by one that's actually the opposite story of the congressional filter. 125 00:09:18,290 --> 00:09:21,310 Each grid this is our congressional filter box here. 126 00:09:21,320 --> 00:09:22,160 All right. 127 00:09:22,310 --> 00:09:25,690 So we're actually doing a one to one mapping of a convolutional filter. 128 00:09:26,120 --> 00:09:32,390 So it's 28 also 28 by 28 by one but now we're using 12 filters here. 129 00:09:32,510 --> 00:09:41,540 So each yellow block here represents a single conventional filter and there are 12 blocks stacked here. 130 00:09:41,900 --> 00:09:47,580 So what happens is that for each filter We slide it across filler fill our values here. 131 00:09:48,790 --> 00:09:55,020 And basically 12 times and we get a box of convolution or a box of filters here. 132 00:09:55,060 --> 00:09:58,820 If you come up s.c this is a box of maps. 133 00:09:58,840 --> 00:10:01,380 This is all convolution kernel matrix. 134 00:10:01,510 --> 00:10:05,920 And in case you're wondering because it actually just slipped my mind when I was explaining this to 135 00:10:05,920 --> 00:10:10,230 you because I did this slide a couple of weeks before explaining that in this video. 136 00:10:10,330 --> 00:10:18,430 Now you noticed that before we had a filter that was say strawberry tree and reproduce a small convolution 137 00:10:18,490 --> 00:10:21,160 of smaller Fincham up here. 138 00:10:21,160 --> 00:10:25,310 However in this example I'm producing basically the same size which I'm up. 139 00:10:25,330 --> 00:10:32,230 And this is actually what we need to do in most cases you don't have to but it'll actually explain to 140 00:10:32,230 --> 00:10:34,880 you how we actually end up with the same size image later on. 141 00:10:34,990 --> 00:10:40,330 But for now just assume we run this let's say this is a tree by a tree or five by five convolution here 142 00:10:40,870 --> 00:10:47,440 we get the upper tier and we fill in our matrix here off each Emap. 143 00:10:47,560 --> 00:10:52,660 So as I can see this is how it filters look stacked up visually see it quite clear there. 144 00:10:54,050 --> 00:11:00,620 So the upwards of all conclusions from last Lavey sort of applying 12 filters of size tree by tree tree 145 00:11:01,310 --> 00:11:04,830 to an image which was of 28 but we have a tree. 146 00:11:04,840 --> 00:11:08,750 We produce 12 feature maps also called Activision maps. 147 00:11:08,750 --> 00:11:15,060 Now these options are stacked together and treated as one big treaty matrix of output size 28 by 28 148 00:11:15,080 --> 00:11:16,170 by 12. 149 00:11:16,520 --> 00:11:20,190 And this is important this go back to this. 150 00:11:20,390 --> 00:11:27,630 This now forms the input this big matrix here to our next layer in the center. 151 00:11:27,650 --> 00:11:32,780 So now let's talk more about what these future maps are Activision maps actually are and how they represent 152 00:11:32,810 --> 00:11:34,190 image features. 153 00:11:34,220 --> 00:11:42,380 So now each cell in it's seldomly meaning each one by one point you know activation map metrics is considered 154 00:11:42,410 --> 00:11:45,100 basically a feature extraction or a single neuron. 155 00:11:45,470 --> 00:11:50,470 And that single neuron is basically looking at a specific region as it slides over the image. 156 00:11:50,470 --> 00:11:54,500 What specific feature I should say as it slides over the image. 157 00:11:54,500 --> 00:12:01,460 So we have a basically a feature map of the 28 by 20 It's like we just did that future map basically 158 00:12:01,460 --> 00:12:07,790 has each neuron each cell basically activates depending on what it sees in the image. 159 00:12:08,330 --> 00:12:15,030 And in the beginning when your own network that of course CNN I should say I would see an old convolutional 160 00:12:15,170 --> 00:12:22,400 is basically basically a low level feature detectors and low level feature detectors basically looking 161 00:12:22,400 --> 00:12:24,650 for simple things and images simple things. 162 00:12:24,650 --> 00:12:29,730 Meaning like maybe edges maybe specific colors maybe a blob here and there. 163 00:12:29,750 --> 00:12:36,200 However if we have consecutive concatenated convolutional Lia's as in deep and that works with Ruelas 164 00:12:36,650 --> 00:12:43,400 of convolutional layers we can start detecting more special features like the thius of a cat what the 165 00:12:43,400 --> 00:12:46,160 shape of a bicycle or the shape of a fetus. 166 00:12:46,250 --> 00:12:52,610 So that's how CNN's actually used these convolutional feature maps to detect features. 167 00:12:52,800 --> 00:12:58,450 So you've seen so far we just use a standard by an arbitrary filter size of tree by tree. 168 00:12:58,860 --> 00:13:01,200 But can we use other sizes. 169 00:13:01,210 --> 00:13:08,310 And how did you affect the convolution size and the future parts of the parts of the convolutional neural 170 00:13:08,310 --> 00:13:09,410 net. 171 00:13:09,420 --> 00:13:13,290 So basically that's called tweaking the hyper pyper parameters. 172 00:13:13,290 --> 00:13:18,870 So the next section at Section 7.3 we look at dept stride and putting.