Teaching computers to understand what they see is the subject that keeps all the computer vision engineers awake. Even though a lot of progress has been accomplished in Image Recognition field over the past few years, there are a lot of puzzle pieces still missing that should fit together to get a complete and clear picture on how to teach machines to make sense of what they see.
For a long time Image Classification was not considered as a statistical problem until a partial solution came from the Machine Learning field under the name of Neural Networks, in particular, Convolutional Neural Networks (CNN). CNN is a special type of Artificial Neural Networks that offer human-like results in image classification tasks. This article explains how the human brain reconstructs the visual world, how machines learn to understand visuals and what the applications of Image Classification are.
Image Recognition: How hard can it be?
Image Recognition is the process of identifying what an image depicts. For humans interpreting the visual world comes easy. When humans see something, there is an inherent understanding of what it is. In most cases, there is no need for a conscious study of the object to make sense of it. However, for computers, it is a challenging task because they can only manipulate digits. For example, a 3x3 square point on Albert Einstein’s forehead, for a computer is a 3-dimensional matrix where each dimension represents one of the primary colors: red, green, and blue. Even though humans can interpret images in a fraction of seconds, a complex cognitive process occurs in the visual cortex of the brain. The visual cortex is divided into layers (V1-V8), and it processes visual information coming from the eyes. When a stimulus is present at the receptive field, its representation first reaches the V1 layer, or in other words the neurons in the area of V1 layer fire first. This layer is a map which preserves the spatial information of the stimulus in the world and also detects its edges. V1 layer is strongly connected to V2 layer, which in turn is involved in discriminating shapes, orientations, colors, and other low-level features. Higher-level visual features involve the brain’s understanding of the context and relationship of the images and are only perceived in the higher layers, such as V6-V8. Let’s say, the perceived stimulus is your dad. The object detection itself is accomplished in layer V1. However, the semantic information is only perceived in the layers V6-V8. It is important to stress that what each layer is responsible for, is almost always related to controversy as the research brings more and more discoveries over time. However, it is a fact that the higher the layer is, the more abstract the presentation becomes. Apart from this high-level architecture, on the micro level, the neuron’s mechanics has been applied to simulate the processes in visual cortex layers. In particular, each neuron receives input from the dendrites and based on complex non-linearity which is applied to its input will fire, if the summed non-linear input overcomes some threshold. Although this explanation, is very simplified, it was enough for a research to invent the first Artificial Neural Network. Inspired by the human visual system, engineers tried to replicate this process with machines. To enable computers to understand objects, it was necessary to create a system that would extract high-level features from visual “stimuli” by using only numerical manipulations. That’s when Convolutional Neural Nets come into place. When fed with enough of clean and well-defined data, CNN allows extracting high-level, common features for each category the data encompass.


