High speed and high accuracy barcode scanner developed with reference to the AI model of Google Meet virtual background.
This article is also available here.(Japanese)
In my previous blog, I introduced how to run AI models for virtual backgrounds of Google Meet with Tensorflow Lite (TFLite) in a browser. So, in this article, I would like to introduce one of the applications of this technology, which is a lightweight, high-speed barcode scanner.
This is how it works. Multiple barcodes facing different directions can be read at high speed.
Lightweight Semantic Segmentation Model
The AI model used in the virtual background is called the Semantic Segmentation model. This model classifies what is in each pixel of an image. For example, as shown in the figure below, if you input the image of a cute cat on the left, the model will output the image of the cat and the background as shown in the middle. This can be used for a human to identify the area of the human and the background, and then replace the background area with another image to realize a virtual background.
While the accuracy of AI models is often emphasized, their lightness is also important when they need to be processed in real time, such as in virtual backgrounds, or when they are run on devices with limited computing resources. Many researchers are still working to improve the accuracy and reduce the weight of AI models, and are devising new AI model architectures.
The architecture of Semantic Segmentation models often consists of two components (encoder and decoder). According to the Google AI Blog, Google Meet uses a lightweight model called MobileNetV3-small for the encoder, as shown in the figure below. The decoder is symmetric to MobileNetV3-small, but no details are given.
So I studied the MobileNetV3 paper which is used as the encoder, and found that it also mentioned Semantic Segmentation. There is also a description of the decoder, and it says that it uses a module called Lite R-ASPP to speed up the process.
Note: In the paper, the encoder is called the backbone, and the decoder is called the segmentation head.
These may help you to create a lightweight Semantic Segmentation model.
Home-made lightweight Semantic Segmentation model for virtual backgrounds.
So, we designed the architecture based on the above information and trained it.
The accuracy of the segmentation is shown in the figure below. (1) is the result of the lightweight Semantic Segmentation model we created. (2) and (3) are the models for the virtual background of Google Meet. (4) is the model for Selfie Segmentation published by Google. The models in (4) and (2) and (3) have almost the same architecture. In parentheses is the resolution of the input image to the model.
We can see that the accuracy of (1) is a little worse than the others, as parts of the body are blurred. Generally speaking, the accuracy of an AI model is highly dependent on the quality and quantity of the teacher data used for training. In this case, we did our best to collect images of people for annotation (and Augmentation), but even so, I think the amount of annotations is far less than what Google can provide. I believe that this is the main reason for the poor accuracy. Also, there may be a certain amount of Google’s secret trick in the architecture. And there is a possibility that there is a bug in the model we created.
The processing time per frame is shown below. The processing time includes not only the inference by the model but also the drawing process. The time in parentheses is the processing time when simd is enabled. Considering the resolution, I think the processing time is almost the same as the Google Meet model. (I have no idea why (1) is faster than (2) with simd)
I think the above can be briefly summarized as follows.
- The accuracy is several steps inferior to the Google Meet model.
(This is a rough feeling that has not been quantitatively evaluated.)
- The processing speed is almost the same as Google Meet mode.
The main purpose of the virtual background is to protect the user’s privacy. In light of this, I think the accuracy of the Semantic Segmentation model should be at a certain high level. If you do not expect to be able to collect a large amount of high-quality teacher data, it is better to use the Selfie Segmentation model described above as (3), which is licensed under Apache-2.0 (model card).
Note: The license of the Segmentation model of Google Meet has been changed from Apache-2.0. We do not know how long the Apache-2.0 license will continue to be available, so please be careful when using it!
This demo is available at the following URL.
Application of a home-made lightweight Semantic Segmentation model to a barcode scanner
We found that using our own lightweight Semantic Segmentation model as a virtual background was not a good idea, as there are other alternatives with higher accuracy. On the other hand, we found it to be quite fast. Therefore in the usecase where some accuracy is sufficient, this model could still be leveraged.
One example is the application to barcode scanners.
As far as I can tell from the source code of various barcode scanners available on Github, the barcode scanners start from the upper left corner of the image and sequentially detect the edges, and decode the barcode from there. Therefore the processing time will increase when trying to read a barcode from a large image. If a single barcode is to be scanned, an early return can be made after the barcode has been scanned, but if multiple barcodes are to be scanned, the image must be scanned all the way to the end, so this effect is expected to occur easily (figure below).
Note: The ZBar, for example, starts from the top-left corner of the image and successively detects edges using a derivative filter, and decodes the barcode when an edge is detected (image scan (src1) ⇛ edge detection and decoding call (src2)). This process is done in both portrait and landscape orientation. (It was a bit difficult to understand, so I apologize if my understanding is wrong.)
The idea this time is to eliminate the barcode detection process in the areas where barcodes do not exist by cutting out the areas where barcodes are likely to exist using a lightweight Semantic Segmentation model of my own creation as preprocessing (see the figure below). From a slightly different point of view, it is like replacing a part of the sequential edge detection process with a process that detects the area of a barcode by calculating it all at once using optimized matrix operations of Tensorflow Lite (+XNNPACK). Since this project is intended to be run in a browser, enabling simd, which allows multiple calculations to be performed simultaneously, is expected to be particularly effective.
In terms of accuracy, even if we cut out the parts without barcodes, we can determine the existence of barcodes based on whether or not the barcode scanner can read them. The assumption is that it will be okay even if the accuracy is low to some extent.
I have created a barcode scanner (web version) using my own lightweight Semantic Segmentation as a preprocessor. The barcode scanner is a wasm version of the OSS ZBar. To make it easier for the barcode scanner to read, I added a preprocessing step to correct the tilt of the area detected by Semantic Segmentation.
Here is how it behaves when run against a FullHD image. On the left is (1) the version with Semantic Segmentation as preprocessing, in the middle is (2) the version with SIMD enabled Semantic Segmentation as preprocessing, and on the right is (3) the version without preprocessing (ZBar is simply wasmized). The ‘processing time’ below the animation is the processing time (msec) per frame. The device used is Pixel4. Note that the overlay processing is not done because ZBar alone cannot detect the barcode area.
(1) and (2) are much more accurate and run faster. Since the barcode scanning itself uses the same software (ZBar), the reason for the higher accuracy of (1) and (2) is to correct the tilt of the area of barcode.
(You can also see in youtube, https://youtu.be/Lv5dr2KD0H8 )
This is what it looks like in HD. The processing speed is higher, but the detection accuracy is a little lower. This is because the resolution is lower. In general, there is a required size (resolution) for scanning barcodes (see reference), and the closer you get to this size, the harder it is to scan. This is probably the reason why the accuracy of HD is lower.
(You can also see in youtube, https://youtu.be/2aYugJze0UE)
The processing time per frame is shown below. For both FullHD and HD, (1) and (2) are faster than (3). And (2) with simd enabled is faster than (1). Note that (1) and (2) are faster than (3) even though they include an additional process to correct the tilt of the barcode. As for (3), the processing time for FullHD and HD is almost proportional to the total number of pixels, as expected. On the other hand, for (1) and (2), the processing time does not increase that much from HD to FullHD. Whether this can be extrapolated needs to be verified, but it is likely that the advantage of (1) and (2) increases when processing images with higher resolutions.
From the above, it can be said that preprocessing with lightweight Semantic Segmentation has enabled us to create a barcode scanner that is faster and more accurate than the conventional barcode scanner (WASM version). In other words, depending on the use case, a model using the lightweight Semantic Segmentation architecture may work well even if the accuracy is low to some extent.
This demo is available at the following URL.
This time, by using lightweight Semantic Segmentation as a preprocessing method for the barcode scanner to cut out areas where barcodes are likely to be found, we were able to eliminate part of the subsequent barcode detection process for the barcode scanner, thereby speeding up the entire process. We believe that this is the effect of replacing the sequential edge detection process with a barcode detection process based on optimized matrix operations. If this is correct, we can modify the barcode scanner to detect barcodes by matrix calculation, without using AI (DNN) in preprocessing. In my opinion, if this is possible, it is right. However, rewriting the internal processing of software that already exists as a established technology, such as a barcode scanner, is quite costly and carries a high risk of degrading. So, I think it is more practical and less costly to preprocess a lightweight process using Tensorflow Lite (+XNNPACK), which provides optimized matrix arithmetic processing as a framework.
There is also the question of whether simding the ZBar will speed up the process.I actually tried to build it with the simd option, but it had almost no effect, so I decided to skip it this time. There is some talk about not SIMDing on your own(Japanese) as much as possible, but I think it is because you need to write code with that in mind to some extent in order to really get the effect of simd.
I noticed that Google has generously provided instructions on how to make a lightweight Semantic Segmentation model, so I tried to make my own. As a result, I found that I could create a fairly fast model, although the accuracy is not great. I thought that this model would be more suitable for use cases where a certain level of accuracy is sufficient, such as pre-processing for barcode scanners, rather than use cases that require a highly accurate model, such as virtual backgrounds, so I implemented it experimentally. As a result, we were able to realize a barcode scanner that is faster and more accurate than conventional barcode scanners. The application to barcode scanners is just one example of where our own lightweight Semantic Segmentation can be applied. I believe that there are other places where fast Semantic Segmentation with some accuracy can be used. I will continue to experiment with it whenever I come up with something.
A demo of the above barcode scanner is stored in the following repository.
I used images of people, videos of people, and background images from this site.