A Simple Explanation of how Computers Recognize Images

# A Simple Explanation of how Computers Recognize Images

|

April 06, 2016

Last Updated 11/01/2016

How do computers recognize complicated images? There are a few algorithms for doing so, but an increasingly popular one is known as a Convolutional Neural Network, which is a specific type of Neural Network. I will explain the following in order to hopefully grant readers a fundamental understanding of what that is:

1. What is a neural network?
2. How is an image represented on a computer?
3. How does a "neural network" come into play and recognize images?
4. How are these networks created?

## What is a neural network?

A neural network is fundamentally a flowchart. Imagine that you want to determine whether or not you should buy a soda. Here are some things you might take into consideration

• Is the drink too unhealthy for me?
• Is the drink too expensive?
• Do I like the drink that's offered?

If the drink is too unhealthy or expensive, or if you don't like it, you probably don't want to buy it. You can create a flowchart to tell if you'll buy it or not.

Fig. 1 - A flowchart posing questions that determine whether or not you would buy a soda.

This flowchart is fairly straightforward, but what if you have some answers that lead to "Yes, I'll buy a soda" and others that lead you to "No, I won't buy a soda"? You can assign points to each answer and add them up. After that, you set a cutoff value for the total and use that to determine if you want to buy a soda or not.

Fig. 2 - A more complex version of the flowchart in Figure 1. Here you add up points from each question to determine if you'll buy a soda or not.

Since "No" always results in adding zero points to the score, we can remove it and simplify the flowchart as seen below:

Fig. 3 - A flowchart describing the questions used to determine if you should buy a soda. For each question, if you answer "yes", you add the points in the blue circle (or subtract if the points are negative) to the score. If the answer is "no", add 0 points. Then, if the number of points are at least the value in the yellow diamond, then your answer to "Should I buy the drink?" is "yes".

In the above example, "points" are added to the question "Should I buy the drink?", and there is a minimum number of points needed for you to answer "yes". Specifically, you need to think the drink is tasty, and it can't be too expensive or unhealthy. If you were a bit less picky, you may lower your requirements, and the graph could look like this:

Fig. 4 - The same flowchart as Figure 3, except the threshold for buying a soda is lower (0 instead of 1).

In the above example, as long as either

• You liked the drink and it wasn't both unhealthy and expensive, or
• The drink was neither expensive nor unhealthy

you would decide to buy the drink.

Now, each of these questions is a bit subjective. Imagine you wanted to make a decision based off objective observations. You may ask three other questions:

1. Does the soda come in a bottle? (Assume bottles are more expensive than cans/fountain drinks)
2. Am I dieting?
3. Is the soda a diet soda?

Using this information, you may come up with a flowchart that looks like this:

Fig. 5 - An extension of Figure 4, where objective criteria are used to determine the answers to the 3 questions that are asked to determine if a soda should be bought. First, you answer the three questions in blue boxes by answering the questions in green boxes. Then you use the answers in the blue boxes to determine if you should buy a drink. As a side note, this assumes diet soda tastes worse than regular, which isn't true for everyone (but is true for the sake of this example).

You could tweak the values to your liking, or add/remove connections to your liking, and then it could easily describe your thought process behind buying a soda. Both the two- and three-layer charts above are examples of neural networks. In this case, you are asking yes/no questions, each which add points to a score for questions above it. If the total score for a question is at least a certain value, you answer "yes". Otherwise, the answer is "no".

## How does a computer represent an image?

A computer generally represents an image using pixels, which are tiny squares that usually contain 1 to 4 numbers. If a pixel has one number, then it is either black-and-white or grayscale. For 3 numbers, it is often made of red, green, and blue (RGB), and four 4 numbers, it is often RGB plus "alpha" (RGBA), which describes how transparent something is. Here are examples of each of these using a picture of my friend's cat, Judy.

Fig. 6 - Examples of different color schemes for digital images. Note that the RGBA image has 50% transparency added. Less complex color schemes are simpler to analyze.

For demonstration purposes, I will be using black and white images, but the process generalizes to grayscale and colored images. An example of how a computer may put values on such an image can be seen below in an example of a 5 pixel by 5 pixel image beside the same image with the value representing the color for each pixel.

Fig. 7 - A simple black-and-white 5-by-5 image next to the same image with the numeric value of each pixel. Grayscale would have values ranging from 0 (black) to 255 (white), with values in between being shades of gray. RGB images would have three numbers in each pixel, representing each of red, green, and blue, and RGBA would have a fourth number representing the transparency.

## How does a "neural network" come into play and recognize images?

### Flowchart Representation

Imagine that you want to determine if an image is a face. You would probably look for eyes, a nose, and a mouth. In a similar manner to the flowcharts I have above, you could make another flowchart that looksl like this:

Fig. 8 - A flowchart describing how you can determine if something is a face by asking three questions.

One could obviously look for more features of a face, like eyebrows, eyelashes, etc., but if you see eyes, nose, and a mouth, that is usually sufficient to determine if something is a face.

How does a computer know what a "nose", "eye", or "mouth" is, though? In the same spirit as the soda-buying example, you can add another "layer" of objects, which has more objective criteria for determining what objects are:

Fig. 9 - An expanded flowchart showing what kinds of features a computer may look for in order to determine if an image has eyes, a mouth, and a nose.

### A Visual Example

The million-dollar question: What does a computer actually do to determine if something is a dot, a line, or some other arbitrary shape?

Answer: A computer can use templates (AKA "convolutions") of shapes or parts of shapes by overlapping them with parts of an image and seeing how closely they match. This is typically done by taking a relatively small template (maybe 7-15 pixels in height & width) and testing every possible placement of the template over the image.

Look at the 5x5 grid with the red and white boxes.

Fig. 10 - A 5x5 template used to detect 3x3 square dots in an image. The shape roughly corresponds to that of an eye.

Consider overlapping this on top of part of an image, calculating the following result as a "score":

1. For each white tile (template) overlapping a black tile (image sample), subtract 1 point.
2. For each red tile (template) overlapping a black tile (image sample), add 2 points.

With the following image of a face, we can calculate scores at various 5x5 samples of it

Fig. 11 - A 25x25 image of a face with grid lines separating each pixel.

"Scoring" near the eyes:

Fig. 12 - The picture in Figure 11 being "scored" by the template in Figure 8 around its left eye region.

"Scoring" near the nose:

Fig. 13 - The "eye" template from Figure 11 being applied around the nose region, resulting in a smaller score.

"Scoring" near the mouth:

Fig. 14 - The "eye" template from Figure 11 being applied around the mouth region, resulting in an even smaller score than the nose region.

If we repeat this over the entire 25 by 25 image, we get a 21 by 21 grid (because the 5 by 5 grid hits the borders after it moves 21 spaces from the top left). This heatmap shows us where the computer recognizes the template the most.

Fig. 15 - The "convolution" of the "eye" template from Figure 11 on the entirety of the face image. Note that the highest values are located around the eyes. The computer could assume that the tiles with dark red are where eyes are located.

The computer can detect the locations of the eyes by looking at where the values are the highest (the darkest red color). Similarly, we can come up with a template for the nose:

Fig. 16 - A template that can be used to find a nose on the face. In practice you would have more than one template for the nose so you could detect different shapes of noses and smaller parts of it.

And the corresponding heatmap for detecting where the nose could be:

Fig. 17 - The "convolution" of the "nose" template from Figure 16 on the face image. Note that while the largest value is at the center, the fact that only one square is dark red implies that this is not a very reliable template.

And finally a mouth template and its heatmap:

Fig. 18 - A simple template that can be used for long, flat part of the mouth.
Fig. 19 - The "convolution" of the "mouth" template from Figure 18 on the face image. There is a large region of dark red, which the computer would likely assume corresponds to the mouth.

• Each convolution (template) corresponds to a different "question" that the computer asks an image ("Is this area a(n) eye/nose/mouth?) several times. The minimum score for something to be considered a specific feature, like a face, isn't usually the same for another feature, like a nose.
• The convolution can have more complex rules. For example, in detecting eyes, the middle pixel could be worth more points than the other ones.
• I will add more notes if I can think of important ones.

### Further Computing

At this point, using the same process as the "convolution" above, but on the "convoluted" images from Figures 13, 15, and 17, the computer will look to see if the combinations of facial features and their locations match up to what a face should be. Features can be rotated, magnified/shrunk, or shifted up/down/left/right, so there will likely be more than 1 template for each of: eyes, nose, mouth. (e.g., a sideways picture of a face should still be recognized as a face)

Additionally, detecting the face may not even be the final task. If you wanted to detect, for example, a monkey, recognizing the face might be part of the task, but you would have higher levels of the "flowchart" that would combine the facial recognition with other recognized object like a body, tail, arms and legs.

## How are These Networks Created?

While it is nice to have something that can detect images, creating them can be a difficult task for 3 reasons:

1. A lot of images are needed to "train" the network to recongize them.
2. The exact structure of the network is often very complicated, especially when color is used.
3. Training images, especially large ones, takes a lot of time and computational power.

If you are lucky, you will start with a network that someone else already created, and it can detect most important features of an image that you would need. In that case, all you have to do is train the network to connect the features to the labels you have on images.

If you are not so lucky, you will have to train the whole network, which can easily be at least hundreds of thousands of interconnected values that are initially random. The values that you "train" correspond to the numbers on the flowcharts in previous sections.

The process is complicated, but in layman's terms, you check to see how well the current network can label an image, and then you make very tiny corrections to each value in the network to make the predicted label slightly better. The values are small for the same reason you don't draw a straight line through a maze: the "solution" requires changing directions many times over short distances.

Usually images used to train the algorithm will be shifted left/right/up/down, grown/shrunk, rotated, and sometimes even smudged/warped. This is a way of artificially increasing the effective amount of images one has

## Summary

Computers can use neural networks, which are essentially a type of flowchart, to recognize images, which are grids of numbers. They use layers of "templates" to detect simple objects, such as lines, dots, and curves, in images and checks to see if a combination of them forms a more complicated object, such as a face.

## Sketching Recognition Project

If you want to see a fun application of image recognition, check out my sketching and guessing app on this website. You can draw something on a digital canvas, submit it, and have the computer guess what it is. It is being updated actively.

## Thanks

I would like to thank Daniel Jumper for providing feedback throughout the writing process, and for recommending draw.io for creating the flowcharts.

Tags: