1
00:00:00,670 --> 00:00:06,470
OK so in 7.2 we are going to learn what convolutions all and what image features.

2
00:00:06,630 --> 00:00:08,960
So let's get.

3
00:00:09,490 --> 00:00:11,460
So before we dive into convolutions.

4
00:00:11,470 --> 00:00:14,300
Let's take a look at what image features actually are.

5
00:00:14,830 --> 00:00:21,280
So when I say image features I'm talking about interesting things in an image as a kind of a vague term

6
00:00:21,450 --> 00:00:25,590
but it basically encapsulates things like edges colors patterns and shapes.

7
00:00:25,600 --> 00:00:26,690
This is a dog here.

8
00:00:26,700 --> 00:00:27,430
It has been.

9
00:00:27,520 --> 00:00:30,490
The edges have been extracted using canny edge detector.

10
00:00:30,910 --> 00:00:38,200
So image feature is basically just that just basically one narrow thing that we find interesting in

11
00:00:38,200 --> 00:00:43,410
an image by narrow I mean like a type of category like edges or colors.

12
00:00:43,440 --> 00:00:53,270
This could have easily been a brown color also or blue or whatever putting the color or shape so before

13
00:00:53,270 --> 00:00:58,650
CNN's came into the picture scientists did feature engineering manually.

14
00:00:58,940 --> 00:01:05,390
You see how I just mentioned that these are these are edges or colored extractors or patterns or shapes.

15
00:01:05,390 --> 00:01:10,730
Now what we did is scientists and I actually had to do this at one time was extract different features

16
00:01:10,730 --> 00:01:17,300
such as histogram of radiance color histograms by intercessions means a structural image.

17
00:01:17,630 --> 00:01:18,740
Many different things.

18
00:01:18,860 --> 00:01:23,930
And it was tedious to actually do this in engineering because a lot of times you're just kind of like

19
00:01:23,930 --> 00:01:27,890
messing around trying different things and you don't even know what works.

20
00:01:27,920 --> 00:01:33,410
And in the end because you don't have a good complicated model that does non-linear representations

21
00:01:33,440 --> 00:01:39,190
Well you still end up getting basically not that great accuracy.

22
00:01:41,250 --> 00:01:47,490
So decent examples of filters learned by this guy here in this publication.

23
00:01:47,490 --> 00:01:53,700
This has been tech here multiple edges actors with whites in black and white as you brush stripes all

24
00:01:53,700 --> 00:01:54,550
over here.

25
00:01:54,780 --> 00:01:56,400
Different color patterns together.

26
00:01:56,590 --> 00:02:03,180
Now one of the image features here exactly what I'm talking what image features but what does it have

27
00:02:03,180 --> 00:02:04,680
to do with convolutions now.

28
00:02:04,890 --> 00:02:08,160
So what are conditions now a convolution.

29
00:02:08,160 --> 00:02:14,010
Before I even tell you how it relates to features convolution is effectively a mathematical two that

30
00:02:14,010 --> 00:02:18,600
describes a process of combining two functions to produce a tiered function.

31
00:02:18,600 --> 00:02:24,810
Now that sounds kind of vague until I tell you the function is a feature map and a feature map is effectively

32
00:02:24,810 --> 00:02:25,870
these things here.

33
00:02:26,220 --> 00:02:30,820
So now we imagine we're applying convolution to an image.

34
00:02:30,860 --> 00:02:38,750
So that's playing a process of two functions so we apply to convolutional to an image to get them up.

35
00:02:38,840 --> 00:02:44,560
So convolution is an action of using a filter or Kunaal we use both interchangeably in discourse and

36
00:02:44,560 --> 00:02:47,180
in research and Terry.

37
00:02:47,190 --> 00:02:53,300
So it's applied to the input and I will keast input being input image and the convolutions.

38
00:02:53,370 --> 00:02:55,110
This is basically the convolutional process.

39
00:02:55,110 --> 00:02:57,240
Now let me just go back to the slide here.

40
00:02:57,390 --> 00:03:03,870
So I just want to reiterate that in pluckiest input which is the first convolution here input is applied

41
00:03:03,960 --> 00:03:06,450
to what function called philatelic.

42
00:03:06,750 --> 00:03:15,530
And that is if you chinup So the convolutional process is basically executed by sliding the filter that's

43
00:03:15,530 --> 00:03:22,820
a filter function over the input image and this slutting process is basically a simple multiplication

44
00:03:22,910 --> 00:03:27,940
matrix multiplication or dot product over to produce Atid function.

45
00:03:27,950 --> 00:03:28,990
So how is it done.

46
00:03:29,150 --> 00:03:32,090
So imagine this is basically an input image.

47
00:03:32,090 --> 00:03:38,330
This is a 2D is not truly in reality but this is for explanation purposes and this is a convolution

48
00:03:38,330 --> 00:03:41,570
filter here to some values in a smaller matrix.

49
00:03:41,840 --> 00:03:43,870
And this is the output feature map.

50
00:03:43,880 --> 00:03:47,220
So what's going to happen in the convolution process.

51
00:03:47,340 --> 00:03:58,070
Well we're going to basically slide this image here over go back here over this area here and then again

52
00:03:58,280 --> 00:04:00,490
and again and you'll see it slowly.

53
00:04:00,500 --> 00:04:06,020
So when I mean convolve that kind of thing means we basically multiply them here.

54
00:04:06,200 --> 00:04:12,050
So as you can see it devalues 1 0 1 1 0 0 0 0 1 1.

55
00:04:12,050 --> 00:04:20,090
And these values here with 0 1 0 1 0 above or below that I actually didn't use these values mainly for

56
00:04:20,090 --> 00:04:21,760
simplicity purposes.

57
00:04:21,760 --> 00:04:26,070
Search engines to these values here to make this calculation far easier for us.

58
00:04:26,090 --> 00:04:33,980
So by multiplying these two together we get zero by 1 1 by 0 0 by 1.

59
00:04:33,980 --> 00:04:37,490
You'll see it here 1 by 0 0 by 1 1 0.

60
00:04:37,730 --> 00:04:42,040
And so on and so on and we just add it up and we get two.

61
00:04:42,280 --> 00:04:46,050
And that forms of force I put future in this box here.

62
00:04:46,400 --> 00:04:54,350
So how many times can this tree by tree Matrix this slidden this or even that we're going to be slighted

63
00:04:54,470 --> 00:04:56,220
over this.

64
00:04:56,220 --> 00:04:58,310
This image here.

65
00:04:58,310 --> 00:05:00,500
So imagine this the good here.

66
00:05:00,740 --> 00:05:05,990
We have one box here and we can shift it again here too just like this here.

67
00:05:06,350 --> 00:05:08,020
And then tree again.

68
00:05:08,420 --> 00:05:15,260
So by studying it up this box we fill up now a second value of FICCI matrix and you don't have to add

69
00:05:15,260 --> 00:05:21,100
one but imagine it as you'd want to hear and then start again at a second row here.

70
00:05:21,410 --> 00:05:28,940
So we have one here to here tree here and then again four five six.

71
00:05:28,940 --> 00:05:29,530
All right.

72
00:05:29,780 --> 00:05:34,030
So we have basically enough values.

73
00:05:34,300 --> 00:05:39,030
So we have in each row one two tree and tree times can misled across.

74
00:05:39,050 --> 00:05:40,880
We have nine values in all.

75
00:05:41,300 --> 00:05:47,450
So we can actually fill out this entire thing by sliding it across nine times.

76
00:05:47,600 --> 00:05:49,200
That's how we build those features.

77
00:05:49,400 --> 00:05:56,630
So by using what I would call tree by tree filter convolution kernel we produce feature map tree by

78
00:05:56,630 --> 00:06:00,260
tree where it produces the filters here.

79
00:06:00,770 --> 00:06:03,270
Now you understand this process.

80
00:06:03,350 --> 00:06:04,930
Basically it's simple not.

81
00:06:05,030 --> 00:06:11,870
But what exactly are effects of doing this and why is this important so fiercely.

82
00:06:12,050 --> 00:06:22,700
Depending on the values of the kernel that was the killer being this blue box here on pollution we produce

83
00:06:22,700 --> 00:06:27,860
different maps obviously because we can have different guilds with different values and they'll all

84
00:06:27,860 --> 00:06:29,720
produce different feature maps.

85
00:06:29,720 --> 00:06:36,050
So playing an artist is skill but as we just saw convolving with different Canal's produces interesting

86
00:06:36,050 --> 00:06:38,890
feature maps that can be used to detect different features.

87
00:06:38,900 --> 00:06:40,240
This is what makes it important.

88
00:06:40,310 --> 00:06:49,670
So imagine we have several filters here each with different sets of values here and we're sliding it

89
00:06:49,700 --> 00:06:50,290
over here.

90
00:06:50,290 --> 00:06:51,990
We're producing different Fincham apps.

91
00:06:52,250 --> 00:06:57,890
So what this means now is that we've now processed input image into basically features that have been

92
00:06:57,890 --> 00:06:58,710
extracted.

93
00:06:59,980 --> 00:07:00,920
So let's keep going.

94
00:07:02,000 --> 00:07:08,570
So it's important to know the convolution keeps a special kinship between pixels by linning image features

95
00:07:08,630 --> 00:07:11,120
over the small segments we pass over.

96
00:07:11,120 --> 00:07:17,530
This means that convolution even though it's reduced in size here it's still sort of retained some for

97
00:07:17,540 --> 00:07:18,470
spatial information.

98
00:07:18,470 --> 00:07:22,400
In this large image just now it's in a more compressed type form.

99
00:07:25,770 --> 00:07:32,070
So these are all examples of Kindles here basically identical kernel does nothing.

100
00:07:32,250 --> 00:07:38,460
We have education canals that simply having these values in the signal changes an input image into this

101
00:07:38,740 --> 00:07:40,590
is quite quite remarkable.

102
00:07:40,590 --> 00:07:44,660
But you can actually write some code or try an open CV and see for yourself.

103
00:07:44,670 --> 00:07:50,910
You can specify Cardinals find you in kennels and open C-v and runs in one form solutions and produce

104
00:07:51,050 --> 00:07:53,910
lose lose sharpen images.

105
00:07:53,920 --> 00:07:55,730
Detection is actually pretty cool.

106
00:07:56,090 --> 00:08:03,000
So let's take a look at an example of a feature applied convolution kernel applied to an image that

107
00:08:03,000 --> 00:08:04,690
extracts features here.

108
00:08:04,770 --> 00:08:11,970
So this is an example gif I've taken from you can actually see how when they applied this and slide

109
00:08:11,970 --> 00:08:15,750
across the image or what the actual convolutional filter output looks like.

110
00:08:15,960 --> 00:08:17,670
So this is the edge to here.

111
00:08:17,750 --> 00:08:19,300
And the other a..

112
00:08:19,350 --> 00:08:20,760
It's actually pretty cool.

113
00:08:20,760 --> 00:08:22,390
Look at it again there.

114
00:08:23,250 --> 00:08:26,500
And the other thing too they're awesome.

115
00:08:28,100 --> 00:08:30,690
So now as you know that was just wonderful.

116
00:08:30,690 --> 00:08:35,520
So we need many filters in our CNN's Elise as within reason.

117
00:08:35,520 --> 00:08:40,290
You don't want to do too much although there's nothing actually wrong with doing too much just increases

118
00:08:40,290 --> 00:08:47,250
your training time and model complexity and it may be redundant depending on your image data set.

119
00:08:47,280 --> 00:08:50,040
So let's assume we're using 12 filters.

120
00:08:50,040 --> 00:08:56,180
How do we actually visualize how that CNN actually looks at here.

121
00:08:56,190 --> 00:09:04,460
So imagine we have an image that size 28 by 28 and tree tree dimensions red green and blue.

122
00:09:04,530 --> 00:09:06,360
So that's why it has some depth here.

123
00:09:06,960 --> 00:09:11,930
And this is a convolutional Salto which is basically the size here.

124
00:09:12,000 --> 00:09:18,270
One by one by one that's actually the opposite story of the congressional filter.

125
00:09:18,290 --> 00:09:21,310
Each grid this is our congressional filter box here.

126
00:09:21,320 --> 00:09:22,160
All right.

127
00:09:22,310 --> 00:09:25,690
So we're actually doing a one to one mapping of a convolutional filter.

128
00:09:26,120 --> 00:09:32,390
So it's 28 also 28 by 28 by one but now we're using 12 filters here.

129
00:09:32,510 --> 00:09:41,540
So each yellow block here represents a single conventional filter and there are 12 blocks stacked here.

130
00:09:41,900 --> 00:09:47,580
So what happens is that for each filter We slide it across filler fill our values here.

131
00:09:48,790 --> 00:09:55,020
And basically 12 times and we get a box of convolution or a box of filters here.

132
00:09:55,060 --> 00:09:58,820
If you come up s.c this is a box of maps.

133
00:09:58,840 --> 00:10:01,380
This is all convolution kernel matrix.

134
00:10:01,510 --> 00:10:05,920
And in case you're wondering because it actually just slipped my mind when I was explaining this to

135
00:10:05,920 --> 00:10:10,230
you because I did this slide a couple of weeks before explaining that in this video.

136
00:10:10,330 --> 00:10:18,430
Now you noticed that before we had a filter that was say strawberry tree and reproduce a small convolution

137
00:10:18,490 --> 00:10:21,160
of smaller Fincham up here.

138
00:10:21,160 --> 00:10:25,310
However in this example I'm producing basically the same size which I'm up.

139
00:10:25,330 --> 00:10:32,230
And this is actually what we need to do in most cases you don't have to but it'll actually explain to

140
00:10:32,230 --> 00:10:34,880
you how we actually end up with the same size image later on.

141
00:10:34,990 --> 00:10:40,330
But for now just assume we run this let's say this is a tree by a tree or five by five convolution here

142
00:10:40,870 --> 00:10:47,440
we get the upper tier and we fill in our matrix here off each Emap.

143
00:10:47,560 --> 00:10:52,660
So as I can see this is how it filters look stacked up visually see it quite clear there.

144
00:10:54,050 --> 00:11:00,620
So the upwards of all conclusions from last Lavey sort of applying 12 filters of size tree by tree tree

145
00:11:01,310 --> 00:11:04,830
to an image which was of 28 but we have a tree.

146
00:11:04,840 --> 00:11:08,750
We produce 12 feature maps also called Activision maps.

147
00:11:08,750 --> 00:11:15,060
Now these options are stacked together and treated as one big treaty matrix of output size 28 by 28

148
00:11:15,080 --> 00:11:16,170
by 12.

149
00:11:16,520 --> 00:11:20,190
And this is important this go back to this.

150
00:11:20,390 --> 00:11:27,630
This now forms the input this big matrix here to our next layer in the center.

151
00:11:27,650 --> 00:11:32,780
So now let's talk more about what these future maps are Activision maps actually are and how they represent

152
00:11:32,810 --> 00:11:34,190
image features.

153
00:11:34,220 --> 00:11:42,380
So now each cell in it's seldomly meaning each one by one point you know activation map metrics is considered

154
00:11:42,410 --> 00:11:45,100
basically a feature extraction or a single neuron.

155
00:11:45,470 --> 00:11:50,470
And that single neuron is basically looking at a specific region as it slides over the image.

156
00:11:50,470 --> 00:11:54,500
What specific feature I should say as it slides over the image.

157
00:11:54,500 --> 00:12:01,460
So we have a basically a feature map of the 28 by 20 It's like we just did that future map basically

158
00:12:01,460 --> 00:12:07,790
has each neuron each cell basically activates depending on what it sees in the image.

159
00:12:08,330 --> 00:12:15,030
And in the beginning when your own network that of course CNN I should say I would see an old convolutional

160
00:12:15,170 --> 00:12:22,400
is basically basically a low level feature detectors and low level feature detectors basically looking

161
00:12:22,400 --> 00:12:24,650
for simple things and images simple things.

162
00:12:24,650 --> 00:12:29,730
Meaning like maybe edges maybe specific colors maybe a blob here and there.

163
00:12:29,750 --> 00:12:36,200
However if we have consecutive concatenated convolutional Lia's as in deep and that works with Ruelas

164
00:12:36,650 --> 00:12:43,400
of convolutional layers we can start detecting more special features like the thius of a cat what the

165
00:12:43,400 --> 00:12:46,160
shape of a bicycle or the shape of a fetus.

166
00:12:46,250 --> 00:12:52,610
So that's how CNN's actually used these convolutional feature maps to detect features.

167
00:12:52,800 --> 00:12:58,450
So you've seen so far we just use a standard by an arbitrary filter size of tree by tree.

168
00:12:58,860 --> 00:13:01,200
But can we use other sizes.

169
00:13:01,210 --> 00:13:08,310
And how did you affect the convolution size and the future parts of the parts of the convolutional neural

170
00:13:08,310 --> 00:13:09,410
net.

171
00:13:09,420 --> 00:13:13,290
So basically that's called tweaking the hyper pyper parameters.

172
00:13:13,290 --> 00:13:18,870
So the next section at Section 7.3 we look at dept stride and putting.