April 06, 2016
Last Updated 11/01/2016
How do computers recognize complicated images? There are a few algorithms for doing so, but an increasingly popular one is known as a Convolutional Neural Network, which is a specific type of Neural Network. I will explain the following in order to hopefully grant readers a fundamental understanding of what that is:
A neural network is fundamentally a flowchart. Imagine that you want to determine whether or not you should buy a soda. Here are some things you might take into consideration
If the drink is too unhealthy or expensive, or if you don't like it, you probably don't want to buy it. You can create a flowchart to tell if you'll buy it or not.
This flowchart is fairly straightforward, but what if you have some answers that lead to "Yes, I'll buy a soda" and others that lead you to "No, I won't buy a soda"? You can assign points to each answer and add them up. After that, you set a cutoff value for the total and use that to determine if you want to buy a soda or not.
Since "No" always results in adding zero points to the score, we can remove it and simplify the flowchart as seen below:
In the above example, "points" are added to the question "Should I buy the drink?", and there is a minimum number of points needed for you to answer "yes". Specifically, you need to think the drink is tasty, and it can't be too expensive or unhealthy. If you were a bit less picky, you may lower your requirements, and the graph could look like this:
In the above example, as long as either
you would decide to buy the drink.
Now, each of these questions is a bit subjective. Imagine you wanted to make a decision based off objective observations. You may ask three other questions:
Using this information, you may come up with a flowchart that looks like this:
You could tweak the values to your liking, or add/remove connections to your liking, and then it could easily describe your thought process behind buying a soda. Both the two- and three-layer charts above are examples of neural networks. In this case, you are asking yes/no questions, each which add points to a score for questions above it. If the total score for a question is at least a certain value, you answer "yes". Otherwise, the answer is "no".
A computer generally represents an image using pixels, which are tiny squares that usually contain 1 to 4 numbers. If a pixel has one number, then it is either black-and-white or grayscale. For 3 numbers, it is often made of red, green, and blue (RGB), and four 4 numbers, it is often RGB plus "alpha" (RGBA), which describes how transparent something is. Here are examples of each of these using a picture of my friend's cat, Judy.
For demonstration purposes, I will be using black and white images, but the process generalizes to grayscale and colored images. An example of how a computer may put values on such an image can be seen below in an example of a 5 pixel by 5 pixel image beside the same image with the value representing the color for each pixel.
Imagine that you want to determine if an image is a face. You would probably look for eyes, a nose, and a mouth. In a similar manner to the flowcharts I have above, you could make another flowchart that looksl like this:
One could obviously look for more features of a face, like eyebrows, eyelashes, etc., but if you see eyes, nose, and a mouth, that is usually sufficient to determine if something is a face.
How does a computer know what a "nose", "eye", or "mouth" is, though? In the same spirit as the soda-buying example, you can add another "layer" of objects, which has more objective criteria for determining what objects are:
The million-dollar question: What does a computer actually do to determine if something is a dot, a line, or some other arbitrary shape?
Answer: A computer can use templates (AKA "convolutions") of shapes or parts of shapes by overlapping them with parts of an image and seeing how closely they match. This is typically done by taking a relatively small template (maybe 7-15 pixels in height & width) and testing every possible placement of the template over the image.
Look at the 5x5 grid with the red and white boxes.
Consider overlapping this on top of part of an image, calculating the following result as a "score":
With the following image of a face, we can calculate scores at various 5x5 samples of it
"Scoring" near the eyes:
"Scoring" near the nose:
"Scoring" near the mouth:
If we repeat this over the entire 25 by 25 image, we get a 21 by 21 grid (because the 5 by 5 grid hits the borders after it moves 21 spaces from the top left). This heatmap shows us where the computer recognizes the template the most.
The computer can detect the locations of the eyes by looking at where the values are the highest (the darkest red color). Similarly, we can come up with a template for the nose:
And the corresponding heatmap for detecting where the nose could be:
And finally a mouth template and its heatmap:
At this point, using the same process as the "convolution" above, but on the "convoluted" images from Figures 13, 15, and 17, the computer will look to see if the combinations of facial features and their locations match up to what a face should be. Features can be rotated, magnified/shrunk, or shifted up/down/left/right, so there will likely be more than 1 template for each of: eyes, nose, mouth. (e.g., a sideways picture of a face should still be recognized as a face)
Additionally, detecting the face may not even be the final task. If you wanted to detect, for example, a monkey, recognizing the face might be part of the task, but you would have higher levels of the "flowchart" that would combine the facial recognition with other recognized object like a body, tail, arms and legs.
While it is nice to have something that can detect images, creating them can be a difficult task for 3 reasons:
If you are lucky, you will start with a network that someone else already created, and it can detect most important features of an image that you would need. In that case, all you have to do is train the network to connect the features to the labels you have on images.
If you are not so lucky, you will have to train the whole network, which can easily be at least hundreds of thousands of interconnected values that are initially random. The values that you "train" correspond to the numbers on the flowcharts in previous sections.
The process is complicated, but in layman's terms, you check to see how well the current network can label an image, and then you make very tiny corrections to each value in the network to make the predicted label slightly better. The values are small for the same reason you don't draw a straight line through a maze: the "solution" requires changing directions many times over short distances.
Usually images used to train the algorithm will be shifted left/right/up/down, grown/shrunk, rotated, and sometimes even smudged/warped. This is a way of artificially increasing the effective amount of images one has
Computers can use neural networks, which are essentially a type of flowchart, to recognize images, which are grids of numbers. They use layers of "templates" to detect simple objects, such as lines, dots, and curves, in images and checks to see if a combination of them forms a more complicated object, such as a face.
If you want to see a fun application of image recognition, check out my sketching and guessing app on this website. You can draw something on a digital canvas, submit it, and have the computer guess what it is. It is being updated actively.
I would like to thank Daniel Jumper for providing feedback throughout the writing process, and for recommending draw.io for creating the flowcharts.