Google Meet Virtual Background with Amazon Chime SDK

9 min readJan 19, 2021

Note:
This article is also available here.(Japanese)
https://cloud.flect.co.jp/entry/2021/01/19/130105

Note: Additional verification on the M1 Mac has been added. I also updated BodyPix to reflect the change in processing time.[28/Mar./2021]

Introduction

In my previous post, I showed how to achieve virtual backgrounds using the Video Processing API of Amazon Chime SDK. In this post, we used Body Pix for segmentation of people and backgrounds. This time, I would like to implement and evaluate the virtual background function using the Google Meet model, which is more lightweight.

Here is what we have created.

Google Meet Virtual Background

Google AI Team published an article about Goole Meet’s virtual background in the Google AI Blog in October 2020. Despite the fact that it is browser-based app (not a native app), the virtual background feature is fast and accurate. This has become a hot topic. In order to achieve this high speed and accuracy, various ideas have been incorporated. The outline is as follows.

This function is built on a framework called MediaPipe. In this virtual background function, XNNPACK, which efficiently handle floating point computation in neural networks, and TFLite are used to run the model for segmentation. Also, by reducing the resolution of the image input to the model, the amount of computation in the model is reduced to speed up inference. The model has an encoder-decoder architecture that uses MobileNetV3-Small as the encoder, and the Encorder (MobileNetV3-Small) part is tuned by the NAS to run with less computational resources. As a result of reducing the resolution of the input to the model, the accuracy of the segmentation is expected to deteriorate. To compensate for this, post-processing such as Joint Bilateral Filtering and Light Wrapping are used. This post-processing can be skipped on devices that do not have high computing power.

For more details, please refer to the Google AI Blog.

Model conversion from tflite to tensorflowjs

As mentioned above, Google Meet’s virtual backgrounds were created as a solution for MediaPipe. Although Media Pipe itself is available as OSS, the Javascript source code is not planned to be released(Issue). Some of them are becoming available as APIs (Issue, Issue), but the virtual background feature is not yet available.

On the other hand, the segmentation model (tflite format) used inside the virtual background feature is released under the Apache-2.0 license(UPDATE[2021/2nd/Feb.]: This licence seems to be changed.). This model is only a part of Google Meet virtual background. So it is not possible to achieve the same performance as Google Meet by using this model. However, as mentioned above, this model is tuned to run with less computational resources, and you may benefit from it.

A model in tflite format cannot be used in Tensorflowjs as it is, but it can be converted for use in Tensorflowjs by extracting the weights from it and replacing the operation with a Tensorflow standard one. This procedure is described by PINTO. Please refer to PINTO’s blog(Japanese) for a detailed explanation of this procedure. In this article, I would like to create a virtual background function using a model created by PINTO, although it is not an official tensorflowjs model.

Virtual Background with Google Meet Segmentation Model TFJS

Google Meet Segmentation Model provides three models with different input resolutions (128x128, 96x160, 144x256). Here are the results using each model. The upper row shows the results without post-processing, and the lower row shows the results with post-processing. Note that the post-processing is a roughimplementation by me.

In the 128x128 image, the top part of the woman’s head on the left side is a little distorted, but the 96x160 image and the 144x256 image improve gradually. After post-processing, I think it is almost indistinguishable.

Next, let’s see compararison to BodyPix, which has a variety of parameters, all of which I used as default values. The resolution of the input image is variable and there is no default value, so I set it to 480x640 (because it is used in the BodyPix repository sample). For the backbone, we have tried both MobilenetV1 and Resnet50. The images in (1) are from the Google meet model 128x128, with no post-processing. Images (2) and (3) are taken with BodyPix. The image in (4) uses Google meet model 144x256 and has been post-processed.

(2) MobilenetV1’s BodyPix is a little rough and the background is visible. (3) ResNet50 removes the background quite nicely, as expected. However, (1) Google meet (128x128, no post-processing) is a little jagged, but I think it is as good as ResNet50. And it’s not fair to include post-processing in the comparison, but I think (4) Google meet (144x256 model with post-processing) looks the best among these.

We won’t be comparing IOUs this time, so we can’t make quantitative judgments, but just by looking at the images, I think you can see that the accuracy is quite high.

Evaluation Processing Time

We would like to take a look at the processing time. The figure below shows an overview of the virtual background process. First, the image from the camera is tensorized and uploaded to the GPU. Then, pre-processing such as normalization is performed. After the pre-processing, inference is executed and the results are downloaded to the CPU. Post-processing to the results is executed, and then rendering is performed.

For evaluating processing time, we would like to evaluate the following three ranges. (a) is the model inference time only. (b) is the pre-processing time (e.g., normalization) and the time to download the inference results in the GPU. (c) is the processing time including post-processing. This time, we will compare with BodyPix, but BodyPix hides part (a), so we cannot get it. In this case, we will compare (b) and (c).

I used a MacBook Pro 2019 2.4Ghz QuadCore Intel Corei5 to measure the processing time, while Google AI Blog evaluated it using a MacBook Pro 2018 2.2GHz 6Core Intel Corei7. The environment is a little different, but it may be useful for reference. This time, the results were as follows.(I have updated the results since I updated the version of TFJS and BodyPix, and the performance of BodyPix has improved. I have also improved the post-processing of GoogleMeet.)

For (a), it was 8~9ms for all Google Meet models, which is in close agreement with the Google AI Blog, which also says that inference is done in about 8.3ms. (To be honest, this measurement only measures the time it takes for the predict function to return control, and I don’t fully understand if the GPU side processing is completely finished within this time, or if it’s running asynchronously. <- With additional verification, it is starting to look like it is correct that it is done asynchronously.)

For (b), the Google Meet model took between 22.0 and 28.5 ms. Looking at the details, it seems that the process of downloading the GPU’s processing results to the CPU takes time proportional to the amount of data. The BodyPix MNv1 took 55.2m, and the BodyPix Resnet50 took 114.4m. However, this may be due to the fact that we are inputting 640x480 images, which have a higher resolution, and the results will have a similar resolution, so it may be taking longer to download the results. As a new case, we tried inputting a 300x300 resolution image into BodyPix MNv1 (case (6)). As a result, we were able to reduce the processing time to 34.5ms. However, the result is degraded by the reduced resolution. The following figure shows the result. You can see a lot of the background.

(c) is around 40 ms for all models. The post-processing takes about the same amount of time as the inference, but this is a rugh implementation, so there is still room for improvement. However, it can still process at roughly 25FPS. BodyPix performs almost as well in the case (6), but as mentioned above, the accuracy is a bit rough.

As mentioned earlier, the virtual background of Google Meet assumes that post-processing can be skipped on low-resource devices. The processing time with skipping is also shown in the line (3)-p.p. The processing time is 23~27ms, which is about 60% faster than (6). As mentioned above, the segmentation is quite accurate even in this case.

From the above, using the TFJS version of Google Meet’s segmentation model (even without using the MediaPipe framework), we can say the following.

On low spec devices: By skipping post-processing, virtual backgrounds can be achieved with higher quality and about 60% faster than BodyPix.
On a high-spec device: By implementing post-processing, it is possible to achieve a much higher quality virtual background, even if the speed is not much different from BodyPix.
BodyPix assumes a low-resolution input of about 300x300 using MNv1.

Additional Evaluation

After I published this blog, I received a suggestion that an M1 Mac that uses UMA (UnifiedMemoryArchtecture) would be even faster because it would eliminate the overhead of memory transfer from the GPU to the CPU. I see! So I tried it out on my own Mac Book Air (M1 model). The results are as follows. The overall speedup is as expected. On the other hand, with UMA, I was hoping that the difference between (a) and (b) would almost disappear when using Google Meet, but unfortunately it still remains. This result suggests that the longer time in (b) compared to (a) may not be due to the overhead of copying memory from the GPU to the CPU, but rather due to waiting for the GPU to synchronize its processing.

Demo

You can use demo I used for evaluation.

Virtual Background with Amazon Chime SDK

The implementation of virtual backgrounds using the Amazon Chime SDK’s Video Processing API has been described in a previous article; it can be done by replacing BodyPix with the GoogleMeet model, so we will not describe the implementation here. If you want to know the details, please refer to the repository described below. The operation looks like the following.

Repository

Soruce code for the demo is in this repository

w-okada/image-analyze-workers

This repository is the zoo of image processing webworkers for javascript. You can use these workers as npm package…

github.com

Amazon Chime SDK Demo is here

w-okada/chime-videoprocess-ts

This is the demo of the Amazon chime SDK JS Video Processing APIs. The number of sessin is up to 4, because the main…

github.com

Finally

This time, we used the TFJS version of Google Meet’s segmentation model and implemented the virtual background function using Amazon Chime SDK. As a result of comparison and evaluation with BodyPix, we found that it is possible to realize faster and higher quality virtual backgrounds than BodyPix (although it is far from the original Google Meet’s virtual background function that uses MediaPipe).

Note that, as mentioned above, this time we are not using the official model, but a TFJS conversion of the official TFLite model by a volunteer. Please note that use is at your own risk only. In particular, if you want to incorporate it into a product, you need to pay special attention and check the quality. And we have no experience with inference on WASM. Please note that we will need to make some more improvements.