Convolutional Neural Networks (CNN)- Step 2 – Max Pooling
What is Pooling?
Instead of verbally defining pooling, we’ll start off this tutorial with an example right away.
The Cheetah Example
In the example above, the same cheetah image is presented in different ways. It is normal in its first version, rotated in the second, and horizontally squashed in the third.
The purpose of max pooling is enabling the convolutional neural network to detect the cheetah when presented with the image in any manner.
This second example is more advanced. Here we have 6 different images of 6 different cheetahs (or 5, there is 1 that seems to appear in 2 photos) and they are each posing differently in different settings and from different angles.
Again, max pooling is concerned with teaching your convolutional neural network to recognize that despite all of these differences that we mentioned, they are all images of cheetah. In order to do that, the network needs to acquire a property that is known as “spatial variance.”
This property makes the network capable of detecting the object in the image without being confused by the differences in the image’s textures, the distances from where they are shot, their angles, or otherwise.
In order to reach the pooling step, we need to have finished the convolution step, which means that we would have a feature map ready.
Types of Pooling
Before getting into the details, you should know that there are several types of pooling. These include among others the following:
-
Mean pooling
-
Max pooling
-
Sum pooling
Our main focus here will be max pooling.
Pooled Feature Map
The process of filling in a pooled feature map differs from the one we used to come up with the regular feature map.
This time you’ll place a 2×2 box at the top-left corner, and move along the row. For every 4 cells your box stands on, you’ll find the maximum numerical value and insert it into the pooled feature map. In the figure below, for instance, the box currently contains a group of cells where the maximum value is 4.
If you remember the convolution operation example from the previous tutorial, we were using strides of one pixel. In this example, we are using 2-pixel strides. That’s why we end up with a 3×3 pooled featured map. Generally, strides of two are most commonly used.
Note that in the third movement along the same row, you will find yourself stuck with one lonely column.
You would still proceed despite the fact that half of your box will be empty. You still find your maximum value and put it in the pooled feature map. In the least step, you will face a situation where the box will contain a single cell. You will take that value to be the maximum value.
Just like in the convolution step, the creation of the pooled feature map also makes us dispose of unnecessary information or features. In this case, we have lost roughly 75% of the original information found in the feature map since for each 4 pixels in the feature map we ended up with only the maximum value and got rid of the other 3. These are the details that are unnecessary and without which the network can do its job more efficiently.
The reason we extract the maximum value, which is actually the point from the whole pooling step, is to account for distortions. Let’s say we have three cheetah images, and in each image the cheetah’s tear lines are taking a different angle.
The feature after it has been pooled will be detected by the network despite these differences in its appearance between the three images. Consider the tear line feature to be represented by the 4 in the feature map above.
Imagine that instead of the four appearing in cell 4×2, it appeared in 3×1. When pooling the feature, we would still end up with 4 as the maximum value from that group, and thus we would get the same result in the pooled version.
This process is what provides the convolutional neural network with the “spatial variance” capability. In addition to that, pooling serves to minimize the size of the images as well as the number of parameters which, in turn, prevents an issue of “overfitting” from coming up.
Overfitting in a nutshell is when you create an excessively complex model in order to account for the idiosyncracies we just mentioned. Again, this is an abstract explanation of the pooling concept without digging into the mathematical and technical aspects of it.
We can draw an analogy here from the human brain. Our brains, too, conduct a pooling step, since the input image is received through your eyes, but then it is distilled multiple times until, as much as possible, only the most relevant information is preserved for you to be able to recognize what you are looking at.
Additional Reading
If you want to do your own reading on the subject, check out this 2010 paper titled “Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition” by Dominik Scherer and others from the University of Bonn.
It’s a pretty simple read, only 10 pages long, and will boil down the concept of pooling for you just perfectly. You can even skip the second part titled “Related Work” if you find it irrelevant to what you want to understand.
Now, let’s explore a more fun example.
The Number Game
The image above is of an online tool that was developed by You’ll find it hard to find this page using Google, so you can visit it through this link.
Let’s get to it
What we did in this screenshot is insert a random number, in this case we entered the number 4.
As you see the line-up of images in the middle, the box standing alone in the bottom row represents the input image, and then the row after that represents the convolution operation, followed by the pooling phase.
You’ll see the term “downsampling” used in the “layer visibility” section on the left. Downsampling is simply another word for pooling.
If you look at the various versions of the original image that appear in the convolution row, you’ll be able to recognize the filters being used for the convolution operation and the features that the application is focusing on.
You’ll notice that in the pooling row, the images have more or less the same features as their convolved versions minus some information. You can still recognize it as the same image.
On a side note:
-
You can ignore the other rows for now since we haven’t covered these processes yet. Just keep in mind that, like pooling was similar in its steps to the convolution operation, these, too, are just further layers of the same process.
-
If you hover with your mouse over any image, it will send out a ray that points at the source of these particular pixels that you’re standing on in the version that came before it (the pooling version will point to the convolution version, and that one would point to the input image.)