Prince-1's picture
Add files using upload-large-folder tool
d157f08 verified
1
00:00:00,670 --> 00:00:06,470
OK so in 7.2 we are going to learn what convolutions all and what image features.
2
00:00:06,630 --> 00:00:08,960
So let's get.
3
00:00:09,490 --> 00:00:11,460
So before we dive into convolutions.
4
00:00:11,470 --> 00:00:14,300
Let's take a look at what image features actually are.
5
00:00:14,830 --> 00:00:21,280
So when I say image features I'm talking about interesting things in an image as a kind of a vague term
6
00:00:21,450 --> 00:00:25,590
but it basically encapsulates things like edges colors patterns and shapes.
7
00:00:25,600 --> 00:00:26,690
This is a dog here.
8
00:00:26,700 --> 00:00:27,430
It has been.
9
00:00:27,520 --> 00:00:30,490
The edges have been extracted using canny edge detector.
10
00:00:30,910 --> 00:00:38,200
So image feature is basically just that just basically one narrow thing that we find interesting in
11
00:00:38,200 --> 00:00:43,410
an image by narrow I mean like a type of category like edges or colors.
12
00:00:43,440 --> 00:00:53,270
This could have easily been a brown color also or blue or whatever putting the color or shape so before
13
00:00:53,270 --> 00:00:58,650
CNN's came into the picture scientists did feature engineering manually.
14
00:00:58,940 --> 00:01:05,390
You see how I just mentioned that these are these are edges or colored extractors or patterns or shapes.
15
00:01:05,390 --> 00:01:10,730
Now what we did is scientists and I actually had to do this at one time was extract different features
16
00:01:10,730 --> 00:01:17,300
such as histogram of radiance color histograms by intercessions means a structural image.
17
00:01:17,630 --> 00:01:18,740
Many different things.
18
00:01:18,860 --> 00:01:23,930
And it was tedious to actually do this in engineering because a lot of times you're just kind of like
19
00:01:23,930 --> 00:01:27,890
messing around trying different things and you don't even know what works.
20
00:01:27,920 --> 00:01:33,410
And in the end because you don't have a good complicated model that does non-linear representations
21
00:01:33,440 --> 00:01:39,190
Well you still end up getting basically not that great accuracy.
22
00:01:41,250 --> 00:01:47,490
So decent examples of filters learned by this guy here in this publication.
23
00:01:47,490 --> 00:01:53,700
This has been tech here multiple edges actors with whites in black and white as you brush stripes all
24
00:01:53,700 --> 00:01:54,550
over here.
25
00:01:54,780 --> 00:01:56,400
Different color patterns together.
26
00:01:56,590 --> 00:02:03,180
Now one of the image features here exactly what I'm talking what image features but what does it have
27
00:02:03,180 --> 00:02:04,680
to do with convolutions now.
28
00:02:04,890 --> 00:02:08,160
So what are conditions now a convolution.
29
00:02:08,160 --> 00:02:14,010
Before I even tell you how it relates to features convolution is effectively a mathematical two that
30
00:02:14,010 --> 00:02:18,600
describes a process of combining two functions to produce a tiered function.
31
00:02:18,600 --> 00:02:24,810
Now that sounds kind of vague until I tell you the function is a feature map and a feature map is effectively
32
00:02:24,810 --> 00:02:25,870
these things here.
33
00:02:26,220 --> 00:02:30,820
So now we imagine we're applying convolution to an image.
34
00:02:30,860 --> 00:02:38,750
So that's playing a process of two functions so we apply to convolutional to an image to get them up.
35
00:02:38,840 --> 00:02:44,560
So convolution is an action of using a filter or Kunaal we use both interchangeably in discourse and
36
00:02:44,560 --> 00:02:47,180
in research and Terry.
37
00:02:47,190 --> 00:02:53,300
So it's applied to the input and I will keast input being input image and the convolutions.
38
00:02:53,370 --> 00:02:55,110
This is basically the convolutional process.
39
00:02:55,110 --> 00:02:57,240
Now let me just go back to the slide here.
40
00:02:57,390 --> 00:03:03,870
So I just want to reiterate that in pluckiest input which is the first convolution here input is applied
41
00:03:03,960 --> 00:03:06,450
to what function called philatelic.
42
00:03:06,750 --> 00:03:15,530
And that is if you chinup So the convolutional process is basically executed by sliding the filter that's
43
00:03:15,530 --> 00:03:22,820
a filter function over the input image and this slutting process is basically a simple multiplication
44
00:03:22,910 --> 00:03:27,940
matrix multiplication or dot product over to produce Atid function.
45
00:03:27,950 --> 00:03:28,990
So how is it done.
46
00:03:29,150 --> 00:03:32,090
So imagine this is basically an input image.
47
00:03:32,090 --> 00:03:38,330
This is a 2D is not truly in reality but this is for explanation purposes and this is a convolution
48
00:03:38,330 --> 00:03:41,570
filter here to some values in a smaller matrix.
49
00:03:41,840 --> 00:03:43,870
And this is the output feature map.
50
00:03:43,880 --> 00:03:47,220
So what's going to happen in the convolution process.
51
00:03:47,340 --> 00:03:58,070
Well we're going to basically slide this image here over go back here over this area here and then again
52
00:03:58,280 --> 00:04:00,490
and again and you'll see it slowly.
53
00:04:00,500 --> 00:04:06,020
So when I mean convolve that kind of thing means we basically multiply them here.
54
00:04:06,200 --> 00:04:12,050
So as you can see it devalues 1 0 1 1 0 0 0 0 1 1.
55
00:04:12,050 --> 00:04:20,090
And these values here with 0 1 0 1 0 above or below that I actually didn't use these values mainly for
56
00:04:20,090 --> 00:04:21,760
simplicity purposes.
57
00:04:21,760 --> 00:04:26,070
Search engines to these values here to make this calculation far easier for us.
58
00:04:26,090 --> 00:04:33,980
So by multiplying these two together we get zero by 1 1 by 0 0 by 1.
59
00:04:33,980 --> 00:04:37,490
You'll see it here 1 by 0 0 by 1 1 0.
60
00:04:37,730 --> 00:04:42,040
And so on and so on and we just add it up and we get two.
61
00:04:42,280 --> 00:04:46,050
And that forms of force I put future in this box here.
62
00:04:46,400 --> 00:04:54,350
So how many times can this tree by tree Matrix this slidden this or even that we're going to be slighted
63
00:04:54,470 --> 00:04:56,220
over this.
64
00:04:56,220 --> 00:04:58,310
This image here.
65
00:04:58,310 --> 00:05:00,500
So imagine this the good here.
66
00:05:00,740 --> 00:05:05,990
We have one box here and we can shift it again here too just like this here.
67
00:05:06,350 --> 00:05:08,020
And then tree again.
68
00:05:08,420 --> 00:05:15,260
So by studying it up this box we fill up now a second value of FICCI matrix and you don't have to add
69
00:05:15,260 --> 00:05:21,100
one but imagine it as you'd want to hear and then start again at a second row here.
70
00:05:21,410 --> 00:05:28,940
So we have one here to here tree here and then again four five six.
71
00:05:28,940 --> 00:05:29,530
All right.
72
00:05:29,780 --> 00:05:34,030
So we have basically enough values.
73
00:05:34,300 --> 00:05:39,030
So we have in each row one two tree and tree times can misled across.
74
00:05:39,050 --> 00:05:40,880
We have nine values in all.
75
00:05:41,300 --> 00:05:47,450
So we can actually fill out this entire thing by sliding it across nine times.
76
00:05:47,600 --> 00:05:49,200
That's how we build those features.
77
00:05:49,400 --> 00:05:56,630
So by using what I would call tree by tree filter convolution kernel we produce feature map tree by
78
00:05:56,630 --> 00:06:00,260
tree where it produces the filters here.
79
00:06:00,770 --> 00:06:03,270
Now you understand this process.
80
00:06:03,350 --> 00:06:04,930
Basically it's simple not.
81
00:06:05,030 --> 00:06:11,870
But what exactly are effects of doing this and why is this important so fiercely.
82
00:06:12,050 --> 00:06:22,700
Depending on the values of the kernel that was the killer being this blue box here on pollution we produce
83
00:06:22,700 --> 00:06:27,860
different maps obviously because we can have different guilds with different values and they'll all
84
00:06:27,860 --> 00:06:29,720
produce different feature maps.
85
00:06:29,720 --> 00:06:36,050
So playing an artist is skill but as we just saw convolving with different Canal's produces interesting
86
00:06:36,050 --> 00:06:38,890
feature maps that can be used to detect different features.
87
00:06:38,900 --> 00:06:40,240
This is what makes it important.
88
00:06:40,310 --> 00:06:49,670
So imagine we have several filters here each with different sets of values here and we're sliding it
89
00:06:49,700 --> 00:06:50,290
over here.
90
00:06:50,290 --> 00:06:51,990
We're producing different Fincham apps.
91
00:06:52,250 --> 00:06:57,890
So what this means now is that we've now processed input image into basically features that have been
92
00:06:57,890 --> 00:06:58,710
extracted.
93
00:06:59,980 --> 00:07:00,920
So let's keep going.
94
00:07:02,000 --> 00:07:08,570
So it's important to know the convolution keeps a special kinship between pixels by linning image features
95
00:07:08,630 --> 00:07:11,120
over the small segments we pass over.
96
00:07:11,120 --> 00:07:17,530
This means that convolution even though it's reduced in size here it's still sort of retained some for
97
00:07:17,540 --> 00:07:18,470
spatial information.
98
00:07:18,470 --> 00:07:22,400
In this large image just now it's in a more compressed type form.
99
00:07:25,770 --> 00:07:32,070
So these are all examples of Kindles here basically identical kernel does nothing.
100
00:07:32,250 --> 00:07:38,460
We have education canals that simply having these values in the signal changes an input image into this
101
00:07:38,740 --> 00:07:40,590
is quite quite remarkable.
102
00:07:40,590 --> 00:07:44,660
But you can actually write some code or try an open CV and see for yourself.
103
00:07:44,670 --> 00:07:50,910
You can specify Cardinals find you in kennels and open C-v and runs in one form solutions and produce
104
00:07:51,050 --> 00:07:53,910
lose lose sharpen images.
105
00:07:53,920 --> 00:07:55,730
Detection is actually pretty cool.
106
00:07:56,090 --> 00:08:03,000
So let's take a look at an example of a feature applied convolution kernel applied to an image that
107
00:08:03,000 --> 00:08:04,690
extracts features here.
108
00:08:04,770 --> 00:08:11,970
So this is an example gif I've taken from you can actually see how when they applied this and slide
109
00:08:11,970 --> 00:08:15,750
across the image or what the actual convolutional filter output looks like.
110
00:08:15,960 --> 00:08:17,670
So this is the edge to here.
111
00:08:17,750 --> 00:08:19,300
And the other a..
112
00:08:19,350 --> 00:08:20,760
It's actually pretty cool.
113
00:08:20,760 --> 00:08:22,390
Look at it again there.
114
00:08:23,250 --> 00:08:26,500
And the other thing too they're awesome.
115
00:08:28,100 --> 00:08:30,690
So now as you know that was just wonderful.
116
00:08:30,690 --> 00:08:35,520
So we need many filters in our CNN's Elise as within reason.
117
00:08:35,520 --> 00:08:40,290
You don't want to do too much although there's nothing actually wrong with doing too much just increases
118
00:08:40,290 --> 00:08:47,250
your training time and model complexity and it may be redundant depending on your image data set.
119
00:08:47,280 --> 00:08:50,040
So let's assume we're using 12 filters.
120
00:08:50,040 --> 00:08:56,180
How do we actually visualize how that CNN actually looks at here.
121
00:08:56,190 --> 00:09:04,460
So imagine we have an image that size 28 by 28 and tree tree dimensions red green and blue.
122
00:09:04,530 --> 00:09:06,360
So that's why it has some depth here.
123
00:09:06,960 --> 00:09:11,930
And this is a convolutional Salto which is basically the size here.
124
00:09:12,000 --> 00:09:18,270
One by one by one that's actually the opposite story of the congressional filter.
125
00:09:18,290 --> 00:09:21,310
Each grid this is our congressional filter box here.
126
00:09:21,320 --> 00:09:22,160
All right.
127
00:09:22,310 --> 00:09:25,690
So we're actually doing a one to one mapping of a convolutional filter.
128
00:09:26,120 --> 00:09:32,390
So it's 28 also 28 by 28 by one but now we're using 12 filters here.
129
00:09:32,510 --> 00:09:41,540
So each yellow block here represents a single conventional filter and there are 12 blocks stacked here.
130
00:09:41,900 --> 00:09:47,580
So what happens is that for each filter We slide it across filler fill our values here.
131
00:09:48,790 --> 00:09:55,020
And basically 12 times and we get a box of convolution or a box of filters here.
132
00:09:55,060 --> 00:09:58,820
If you come up s.c this is a box of maps.
133
00:09:58,840 --> 00:10:01,380
This is all convolution kernel matrix.
134
00:10:01,510 --> 00:10:05,920
And in case you're wondering because it actually just slipped my mind when I was explaining this to
135
00:10:05,920 --> 00:10:10,230
you because I did this slide a couple of weeks before explaining that in this video.
136
00:10:10,330 --> 00:10:18,430
Now you noticed that before we had a filter that was say strawberry tree and reproduce a small convolution
137
00:10:18,490 --> 00:10:21,160
of smaller Fincham up here.
138
00:10:21,160 --> 00:10:25,310
However in this example I'm producing basically the same size which I'm up.
139
00:10:25,330 --> 00:10:32,230
And this is actually what we need to do in most cases you don't have to but it'll actually explain to
140
00:10:32,230 --> 00:10:34,880
you how we actually end up with the same size image later on.
141
00:10:34,990 --> 00:10:40,330
But for now just assume we run this let's say this is a tree by a tree or five by five convolution here
142
00:10:40,870 --> 00:10:47,440
we get the upper tier and we fill in our matrix here off each Emap.
143
00:10:47,560 --> 00:10:52,660
So as I can see this is how it filters look stacked up visually see it quite clear there.
144
00:10:54,050 --> 00:11:00,620
So the upwards of all conclusions from last Lavey sort of applying 12 filters of size tree by tree tree
145
00:11:01,310 --> 00:11:04,830
to an image which was of 28 but we have a tree.
146
00:11:04,840 --> 00:11:08,750
We produce 12 feature maps also called Activision maps.
147
00:11:08,750 --> 00:11:15,060
Now these options are stacked together and treated as one big treaty matrix of output size 28 by 28
148
00:11:15,080 --> 00:11:16,170
by 12.
149
00:11:16,520 --> 00:11:20,190
And this is important this go back to this.
150
00:11:20,390 --> 00:11:27,630
This now forms the input this big matrix here to our next layer in the center.
151
00:11:27,650 --> 00:11:32,780
So now let's talk more about what these future maps are Activision maps actually are and how they represent
152
00:11:32,810 --> 00:11:34,190
image features.
153
00:11:34,220 --> 00:11:42,380
So now each cell in it's seldomly meaning each one by one point you know activation map metrics is considered
154
00:11:42,410 --> 00:11:45,100
basically a feature extraction or a single neuron.
155
00:11:45,470 --> 00:11:50,470
And that single neuron is basically looking at a specific region as it slides over the image.
156
00:11:50,470 --> 00:11:54,500
What specific feature I should say as it slides over the image.
157
00:11:54,500 --> 00:12:01,460
So we have a basically a feature map of the 28 by 20 It's like we just did that future map basically
158
00:12:01,460 --> 00:12:07,790
has each neuron each cell basically activates depending on what it sees in the image.
159
00:12:08,330 --> 00:12:15,030
And in the beginning when your own network that of course CNN I should say I would see an old convolutional
160
00:12:15,170 --> 00:12:22,400
is basically basically a low level feature detectors and low level feature detectors basically looking
161
00:12:22,400 --> 00:12:24,650
for simple things and images simple things.
162
00:12:24,650 --> 00:12:29,730
Meaning like maybe edges maybe specific colors maybe a blob here and there.
163
00:12:29,750 --> 00:12:36,200
However if we have consecutive concatenated convolutional Lia's as in deep and that works with Ruelas
164
00:12:36,650 --> 00:12:43,400
of convolutional layers we can start detecting more special features like the thius of a cat what the
165
00:12:43,400 --> 00:12:46,160
shape of a bicycle or the shape of a fetus.
166
00:12:46,250 --> 00:12:52,610
So that's how CNN's actually used these convolutional feature maps to detect features.
167
00:12:52,800 --> 00:12:58,450
So you've seen so far we just use a standard by an arbitrary filter size of tree by tree.
168
00:12:58,860 --> 00:13:01,200
But can we use other sizes.
169
00:13:01,210 --> 00:13:08,310
And how did you affect the convolution size and the future parts of the parts of the convolutional neural
170
00:13:08,310 --> 00:13:09,410
net.
171
00:13:09,420 --> 00:13:13,290
So basically that's called tweaking the hyper pyper parameters.
172
00:13:13,290 --> 00:13:18,870
So the next section at Section 7.3 we look at dept stride and putting.