The ability to recognize and categorize objects in our everyday life comes very naturally to humans. When you walk down the street, you can easily distinguish what the various objects that you encounter are. You can see people walking their dogs, various cars driving down the road and airplanes flying across the sky. What if we wanted machines to recognize these objects? If we had a lot of images we want to analyze and categorize, we could use machines to automate this task. Let's take a look into how image recognition and classification works!

How are images represented?

Your eyes register colors as a combination of red, green and blue light. Computers also store and display colors using red, green and blue. Every pixel on the screen you are seeing right now can be represented by just three values. For example the color of fuchsia ██ has a red value of 255, a green value of 0 and a blue value of 255. We would say that the RGB value for this color is (255, 0, 255). To display fuchsia, your computer would then use an additive process to blend these values together. Experiment with the puppy below to see how it works!

Now, what is a convolution?

In the context of image processing, convolutions are the process of combining the pixels of an image with its local neighbors, weighted by something called a "Kernel". Experiment with the visual below to see how convolutions are calculated. The left matrix represents a portion of an image and the right matrix represents the values in our kernel.

The bigger picture

Kernels can have different values depending on the features we want to extract. There are kernels that are designed to sharpen images and even to detect edges! How does an image look after we apply those convolutions to the entire image? Experiment with the visualization below to see how convolutions are calculated and what the resulting convolved images look like. The original image is on the left and the resulting convolved image is on the right.

Stacking convolutions

We can also construct a "network" of convolutions that feed into each other; i.e. convolving an image that's already been convolved before. Some of these convolutions give great analysis of the images as-is, so why not string multiple together? We might be able to detect even more specific features with this method. Experiment with the visualization below to see how this could be helpful.

Putting it together

Below is a simple neural network for classifying a picture as one of ten categories: Plane, Car, Bird, Cat, Deer, Dog, Frog, Horse, Ship and Truck. Hover over the pixels in any layer to see the path of a pixel through the network. Both the relevant pixel and the kernel are highlighted in alternating colors at each layer.

On the right the top 3 predicted probabilities of each class are displayed. Right now the network predicts the class of the image horribly wrong. Use the slider to adjust the weights of the network so that the network predicts the image to be a dog with 100% probability and everything else with 0% probability. As you adjust the slider, watch the networks prediction and note how the output of the first convolution becomes more and more similar the convolved images we saw earlier. This means the network is learning useful kernels.

Once you are done, try selecting some of the other images above!

A real convolutional neural network

In reality, we don't tune the networks manually like this. Usually, the computer will adjust these values automatically, using a process called "Backpropogation". The process is similar to what you did above! The computer makes some predictions, then adjusts the values in the direction that seems to make better predictions. Through this process, the network is able to automatically learn complex kernels that extract useful information from images.

Below you'll find a small fully trained network that runs directly inside of your browser. The network is similar to the VGG architecture. There are 32 images from each of the 10 classes that the network has never seen before. Click on any image to select it and have the network predict what the image is. Try to find some images that the network classifies correctly and some that it misclassifies!