The field of Computer Vision has been experiencing significant momentum since AI introduced Deep Neural Networks and in particular Convolutional Neural Networks (CNNs). Although CNNs were invented a while ago (in 1968), their full potential remained hidden until recently. The development of computationally powerful computers allowed to experiment with CNNs and tap into their real value.
In 2012, Alex Krizhevsky designed a CNN called AlexNet which was trained using large scale image dataset (ImageNet) and ran using GPU. The results were so promising that since then the Computer Vision field has been taken over by Deep Neural Nets research. In fact, many new CNNs architectures are introduced every year, and Deep Learning has become a buzzword.
Given the fact that creating a CNN architecture which would perform well is not a trivial problem but requires proper scientific knowledge, the progress which has been witnessed during the last years proves the importance of this technology.
In particular, such computer vision problems as image tagging, object detection, and image generation have been tremendously improved thanks to Convolutional Neural Networks. First, this new approach eliminated the need to engineer features which were used to solve those problems before. Second, the results produced using Deep Neural Networks outperformed the old-fashioned techniques.
So, let’s take a look at the most common technologies that are powered by CNNs.
- Image Tagging
- Reverse Image Search
- Image Captioning
- Object Detection
- Image Segmentation / Semantic Segmentation
- Image Denoising
- Image Generation
1. Image Tagging
What it is
Image tagging is the technology based on CNNs which enables a computer to assign a category to an image.
When to use it
Image tagging can be used with unstructured datasets to actually structure them.
How it works
- We provide input data in the form of batches of images into the first convolutional layer.
- A convolutional layer performs cross-correlation to find neurons (features) which are more important in identifying the category, an image belongs to.
- A pooling (subsampling) layer reduces the number of neurons produced in the previous convolutional layer, to avoid memorization and biases. This helps to make a model more robust, so it performs accurate on unseen data.
- Depending on the CNNs architecture, we might need to repeat two previous processes multiple times.
- Finally, we have a fully connected layer. It connects every neuron to every other neuron to produce predictions.
- The output then is the probability for an image to belong to every category in our dataset.
Business use cases
Companies seeking to organize their massive datasets into meaningful for them categories can take advantage of this technology. Its applications are extensive from identifying defects on a product line to diagnosing diseases from MRI scans. Another example is to apply image tagging to improve product discovery. Content management platforms, like ProcessMaker IDP, leverage machine vision to streamline labeling of large visual datasets for retail companies.
2. Reverse Image Search
What it is
Reverse Image Search is a method to extract the image representations using CNNs and compare them with one another to find conceptually similar images.
When to use it
Reverse Image Search is used to find similar images in an unstructured data space.
How it works
Reverse Image Search extracts the image representations from the latest convolutional layer in Neural Net. Then, these representations are compared to each other using some distance metrics.
Business use cases
Reverse Image Search is the simplest way to group fast image datasets into conceptually “correct” categories. Additionally, this can be considered as a way to cluster the images.
3. Image Captioning
What it is
Image Captioning enables computers to generate image descriptions.
When to use it
Image Captioning can be used when we are interested in representing the image content in words.
How it works
Image Captioning can be conceived in the encoder-decoder framework. First, image embeddings are extracted by using pre-trained CNNs (encoding step), and further, the embeddings are used as input to Long Short Term Memory (LSTM, a type of neural network which can process sequences of data and therefore is used for text datasets) networks which learn to decode the embeddings into text.
- An image is inserted into CNNs to extract feature maps which are abstract representations of the image.
- LSTM then uses these feature maps to produce the distribution of words given the input. LSTM samples then the next word from the distribution and the process repeats itself until the caption is ready.
- It is important to stress at this point that these different feature maps provide us the points of interest in the image (i.e., attention).
Business use cases
Image Captioning can be used in the blind assistance systems, image metadata generation systems, and robotics.
4. Object Detection
What it is
Object Detection is the technology that identifies not only what object is depicted in an image/video but also where its position is.
When to use it
Object Detection is used in cases when the position of a particular object/subject is requested. It is a tracking technology.
How it works
CNNs is the primary technology here to extract the regions of interest which are then categorized, and the bounding boxes are derived.
- Feature Pyramid Net (FPN) uses the inherent multi-scale pyramidal hierarchy of deep CNNs to create feature pyramids which help in detecting objects at different scales
- Attached to FPN are two subnets, the top one is used to predict classes, and the bottom is used for bounding box regression
Important to say here that this approach is only one of many existing for object detection.
Business use cases
Facial detection is one of the most common uses cases of Object Detection technology. It can be utilized as a security measure to let only certain people inside the office building or to recognize and tag your friends on Facebook. Last year Instagram added a new feature based on this technology designed to make it easier for visually impaired people to use its platform. This feature uses object recognition technology to generate a description of photos. While scrolling the app everyone using screen readers can hear the list of items that photo contains.
5. Image Segmentation / Semantic Segmentation
What it is
Image Segmentation is a technology which can segment an image into conceptual parts but contrary to object detection, here every pixel in an image is assigned a category.
When to use it
Image Segmentation can be used to locate objects and their boundaries.
How it works
Usually, the algorithms employed in such tasks are based on convolution-deconvolution methods. For example, one algorithm is using CNNs to create feature maps, but at the same time, subsampling layers are introduced to keep the whole process computationally feasible. The computational burden lies in fact that the classification decision is done per pixel. For this reason, by reducing the neurons, computational efficiency can be improved. The next step though is to apply transpose convolution during which the network is trained to reconstruct the previously reduced neurons.
Business use cases
This technology is mainly used in medical imaging, GeoSensing, and precision agriculture.
6. Image Denoising
What it is
Image Denoising is the technology which uses self-supervised learning methods to generate images without noise or blurring. It is based on the Autoencoders algorithms which learn to encode the images in a lower feature space and decode them generating data distribution of interest.
When to use it
Image Denoising can be used with some success to remove noise or blurring from images.
How it works
The algorithm tries first to encode the input data into a lower number of dimensions (compression) and then reconstructs it back to latent feature space representation (decoding). In a more formal language, the encoder learns to approximate the identity function by using fewer dimensions. Therefore, this technique is also suitable for dimension reduction purposes. In the context of image denoising, we can set the convolutional autoencoder to learn to generate high-quality images by providing the algorithm with low-quality against ground-truth high-quality images. In this way, the decoder will try to learn how to represent the input in higher-quality.
Business use cases
Applications, like Let’sEnhance.io, use this technology to improve the quality and resolution of images.
7. Image Generation
What it is
Generative Adversarial Networks (GANs) is a type of unsupervised learning which learns to generate realistic images.
When to use it
This technique can be used in applications which generate photorealistic images. For example, it can be used in interior or industrial design or computer games scenes.
How it works
When generating an image, we want to be able to sample from a complex, high-dimensional space which is impossible to do directly. Instead, we can examine this space by using CNN. GANs do this in a manner of a game.
- First, given a random noise, we use a simple generative network to generate fake images which together with training samples are sent to the discriminative network.
- Then, the purpose of the discriminative network is to discern which of the images are fake and which are real.
- If we can fake the discriminative network, then we will have succeeded with finding a proper distribution from which we can generate realistic images.
Business use cases
With proper training, GANs provide a more precise and sharper 2D texture image magnitudes. Its quality is higher, while the level of details and colors remains unchanged. NVIDIA uses this technology to transform sketches into photorealistic landscapes.