Build TFLite Wasm/SIMD and run Google Meet Virtual Background

Note:
This article is also available here.(Japanese)
https://cloud.flect.co.jp/entry/2021/03/29/131955

Introduction

I previously posted an article on running Google Meet’s virtual background model (Segmentation Model) with Tensorflowjs. In this article, I will try to run it with TFLite, which is built as a Wasm, in order to further improve performance.

As a result, I did quite well.

Review of the previous works and performance improvement

In my previous article, I found that we could expect roughly 1.6x speedup compared to MediaPipe’s BodyPix, a commonly used virtual background feature. However, it seemed to be difficult to achieve the amazing speed of the original Google Meet. According to the Google AI blog, the original Google Meet seems to be running on a TFLite Wasm. I think this method is faster because there is no overhead of copying from GPU to CPU. So, we will try to run TFLite Wasm.

Note: The Google Meet virtual background model is no longer available under APACHE-2.0 due to a license change (see here for the history). In general, license changes do not apply retroactively to earlier works (see reference(Japanese)), but if you wish to reproduce the model using this blog as a reference, please obtain and use the model at your own risk.

Wasm and Emscripten

Wasm (WebAssembly) is a binary format code that can be executed on a web browser (reference). In particular, programs written in C and C++ are often compiled using a series of toolchains called Emscripten, and since TFLite is written in C++, it will be converted to TFLite using Emscripten.

Build TFLite as a Wasm

TFLite is configured to be built using a build tool called Bazel. I didn’t know this, but it seems that Emscripten only started supporting Bazel quite recently, last September (see reference). Since Bazel can be used in Emscripten, I would like to build with it. Also, OpenCV, which is used for pre-processing and post-processing, will also be build as a Wasm.

The procedure is as follows.

  1. Install Emscripten
  2. Build OpenCV as a Wasm
  3. Clone Tensorflow and MediaPipe
  4. Write a code
  5. Build TFLite as a Wasm

The following is a description of the work. The source code used in this project is available in the repository below. Please refer to the repository for details. The following steps are based on the Dockerfile in the repository, so it is recommended to build the container from the Dockerfile if you want to reproduce the actual process.

Install Emscripten

In the official documentation, it is recommended to install Emscripten using emsdk. We will follow this recommendation. We will use the latest version, v2.0.14.

Build OpenCV as a Wasm

Clone the OpenCV source code from github and build it using emsdk. Use the latest version, v4.5.1. It seems that emscripten is installed in emsdk/upstream/emscripten with the above steps. Create a config file with this path. This command will create the build_wasm folder, so go in there and make it.

Clone Tensorflow and MediaPipe

The source code for TFLite is included in the Tensorflow repository, so let’s get the Tensorflow source. Clone the source code of Tensorflow, which should be the latest version, v2.4.1. Also, clone the source code for MediaPipe. You will need to configure the toolchain in the BUILD file of Tensorflow. Note that some of the modules that Tensorflow depends on are not designed for the version of Emscripte that we will use. For this reason, it is necessary to replace some of the dependent modules with the latest ones. Please refer to the Dockerfile in the repository for these changes.

Write code

A template for building with Bazel in Emscripten is provided in <emsdk_dir>/bazel. You can copy and edit the contents of this folder, and refer to the official documentation for how to create a Bazel build environment. You will need to create WORKSPACE to list dependent modules, BUILD and .bazelrc to list compilation options. In order to communicate with Javascript, it is important to prepare a function to get the memory address. The functions that need to be called from Javascript should be given EMSCRIPTEN_KEEPALIVE.

In the example below, (1) returns the address of the buffer where the model binary is stored and (2) returns the address of the buffer containing the input image.

Build TFLite as a Wasm

Execute the following command in the folder where you copied <emsdk_dir>/bazel.

If you want to enable simd, build it with the following option.

The output file is archived, so extract it to a location of your choice.

Call from Javascript

From index.html or so load .js generated above steps.

Then, when you call EXPORT_NAME in the BUILD file, the TFLite wasm will be loaded. (1) Assumption that -s EXPORT_NAME=createTFLiteModule is set in the Build file.

Enable SIMD

SIMD is an experimental feature of Chrome. In order to use it, you need to do an origin-trail. Generate a SIMD token here, and put it in the meta tag of index.html.

Evaluate the accuracy

The model for the Google Meet virtual background is basically the same as the one used in the previous article. The results will be almost identical, so we will only show the results for the 96x160 model as a representative example.

Evaluate the processing time

Now I would like to evaluate the processing time, which is the real purpose of this article. The size of the image and the computer used are the same as in the previous article. Also, the range of processing to be measured this time is as follows, following the previous article. However, due to implementation reasons, some parts of the post-processing is included in (b), so the TFLite version is at a slight disadvantage for (b). Also, due to implementation reasons, we have not measured (a).

This time, the results were as follows.

(1)-(4) are copies of the previous results. (5)-(7) are the processing times using TFLite Wasm. Data in parentheses are with SIMD enabled.

In the range of (b), the processing time is much shorter than that of (1)~(3) using Tensorflowjs. In (5) and (6), which are relatively lightweight models, the processing time is less than 10ms. Compared to this, the processing time of (7) is much longer, but it is still 65% of the processing time of (3). If SIMD is enabled, the processing time is further reduced. In (5) and (6), the processing time is 5~6ms. This is astounding. (7) can also be processed in less than 10ms, which means that it can easily process more than 100 frames per second.

In the range of (c), the processing time for both (5)-(7) is +6ms to the processing time in (b). When SIMD is enabled, the increase in processing time for (5) and (6) is larger than that for (7). The reason for this is not known at present. The post-processing part in © is created using a method of the HTMLCanvas element, but it may be that a wait is applied internally. For ©-p.p., which only performs masking and displays it on HTMLCanvas, the difference in processing time between (7) and (5) and (6) is slightly reduced as well, so it may be that a certain amount of waiting is performed for each HTMLCanvas process (I’m completely guessing).

We found that we were able to achieve a significant speedup compared to the Tensorflowjs version. A graphical comparison of part (b) looks like this.

Evaluate the processing time2

We also tried the same evaluation on a Mac Book Air with the M1 chip. The overall speed is faster, but the basic trend is the same.

What happens in larger models?

In the experiment using the Google Meet virtual background model, it was found that the processing time was faster than that using Tensorflowjs, as described above. However, as can be seen in the results for the 144x256 model, the reduction in processing time becomes smaller as the scale of the model increases. This intuitively shows that the larger the model, the more advantageous it is to use a GPU that is good at parallel processing. So, I would like to check what happens when I process a larger model using TFLite wasm

In this case, I experimented with the White-box-Cartoonization model used in this article.

The processing range to be measured is from passing the image to Tensorflowjs/TFLite to displaying the processing result. In short, this is the range corresponding to (c) of the Google Meet virtual background model above. The resolution of the image is 192x192. The result is as follows.

Tensorflowjs (WebGL) is still the fastest, by far the fastest against TFLite Wasm. However, surprisingly, TFLite WasmSimd is also quite fast, and MBP is not able to make such an overwhelming difference. Google’s Blog says that SIMD can be made even faster by making it multithreaded, so maybe the day will come when we can use models of a certain scale without using WebGL.

Demo

You can see a demo here.

  • Google Meet Virtual Background
  • White-box-Cartoonization

Finally

This time, we built TFLite as a Wasm (Wasm SIMD) to run a model of the Google Meet virtual background. We measured the processing time and found that the Google Meet virtual background model, which is a lightweight model, can be run much faster than Tensorflowjs. Experiments with a larger model also showed that SIMD enabled inference was quite fast, although not as fast as Tensorflowjs. It seems that further speedups can be expected in the future.

Repository

The source code used in this blog can be found in this repository. See tfl001_google-meet-segmentation and tfl002_white-box-cartoonization.

https://github.com/w-okada/image-analyze-workers

Acknowledgements

I used images of people, videos of people, and background images from this site.

https://pixabay.com/ja/

Software researcher and engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store