Noise suppression with Amazon Chime SDK

6 min readFeb 16, 2021

Note:
This article is also available here.(Japanese)
https://cloud.flect.co.jp/entry/2021/02/16/113455

Introduction

In my previous post, I showed how to achieve virtual backgrounds using the Video Processing API of Amazon Chime SDK. In this post, I intrduce noise suppression feature provided in Amazon Chime SDK, as known as Voice Focus. In this article, I would like to show how to use and then take a look at this effect.

This video shows the actual application of Voice Focus (00:11-). You can see that the noise is suppressed quite nicely.

Voice Focus

Voice Focus is a feature that has been available in Amazon Chime (application) since around August of last year, but only became available in SDK is November of last year (Ref).

According to the official documentation, it seems that noise suppression has been included in the previous SDK, and now it has been enhanced by deep learning/machine learning to remove noise such as wind, fans, water, lawn mowers, barking dogs, typing, flipping paper, etc. Personally, I remember that the previous version did not remove most of the noise. But I don’t want to pursue that too much, because I might be misremembering, it’s in the past, and above all, the performance of this Voice Focus is pretty good.

As stated in the official documentation, there may be use cases where you want to play environmental sounds in some situations. Also, it is CPU-intensive, so depending on the specs of your PC, it may not be suitable to use. For this reason, Voice Focus is not enabled by default, so if you want to use it, you need to enable it with the intention of using it. In the following, I will briefly explain how to enable it, and then look at its effects and how much load it actually is.

Spec of Voice Focus

In Voice Focus, you can set the specifications of the noise reduction with parameters.Basically, the default is fine, but you can change it according to your application.Typical parameters are as follows.

usagePreference
You can select how the denoising process is affected by processes other than denoising, such as user-initiated tab switching. ‘interactivity’ option produces a smoother output. On the other hand if you can accept the risk of glitching, you can select ‘quality’ which produces higher quality processing. The default setting is ‘interactivity’.
variant
You can select the quality of the output. There are ‘C100’, ‘C50’, ‘C20’, ‘C10’, and ‘auto’, with ‘C100’ having the highest quality but also the highest processing cost. C10' has the lowest quality but also the lowest processing cost. auto’ is selected automatically by the SDK. The default is ‘auto’.

In addition, you can choose to use SIMD or webworker and more.

Implementation

The flow of the setup process to use Voice Focus is as shown below.

In the Amazon Chime SDK, you can specify the DeviceID of the microphone or HTML5 MediaStream as the source of the audio input (chooseAudioInputDevice), but if you want to use Voice Focus, specify VoiceFocusTransformDevice. This VoiceFocusTransformDevice is a virtual audio input device that outputs the noise removed from the audio input of the microphone or MediaStream. The VoiceFocusTransformDevice is created from an instance of the VoiceFocusDeviceTransformer, which is instantiated using the specs described above as input.

Before using Voice Focus, make sure that it can be used. This check should be done in the following two steps.

(1) Check if the environment can be used and if some initialization processes can be performed.
(2) Check if the desired processing can be done in more detail, including specifications.

The code would look something like the following. In (a), check (1), and then in (b), create an instance of the noise reduction VoiceFocusDeivceTransformer with the spec information as input. Then in (c), check if the instance is supported.

if (await VoiceFocusDeviceTransformer.isSupported() === false) { // <------ (a)
    console.log("Voice Focus is not supported in this browser.")
    return
}
this.voiceFocusDeviceTransformer = await VoiceFocusDeviceTransformer.create(suppressionSpec) // <----- (b)
if (this.voiceFocusDeviceTransformer.isSupported(spec) === false) { // <------ (c)
    console.log("The input spce of Voice Focus is not supported in this browser.")
    return
}

After that, create a VoiceFocusTransformDevice and specify it as chooseAudioInputDevice, and you are done.

this.voiceFocusTransformDevice = await this.voiceFocusDeviceTransformer.createTransformDevice(device)
this.meetingSession.audioVideo.chooseAudioInputDevice(this.voiceFocusTransformDevice)

It’s very easy.

Effect of noise suppression

Now, let’s see the effect of noise suppression.
As shown in the figure below, we set up a client created with the Amazon Chime SDK on the sending and receiving sides, respectively, assuming an actual application in use. We then used movie as the video input and audio input on the sending side, and recorded the received data on the receiving side.

The following video shows the noise reduction by changing the variant for each hour. The other parameters are left at their default values. The video used for this test was borrowed from Youtube. (See Acknowledgments for the original video).

Timeline

00:00~00:11 original
00:11~00:20 applying Voice Focus (C10)
00:20~00:35 applying Voice Focus (C20)
00:35~00:44 applying Voice Focus (C50)
00:44~00:54 applying Voice Focus (C100)

The first 11 seconds is the original audio. The sound of the rain is very bad. After that, the variant is changed to ‘C10’, ‘C20’, ‘C50’, and ‘C100’ about every 10 seconds. With the low quality C10, the sound of rain is almost suppressed. I can hear a little bit of machine noise in the voice, but I don’t think it’s a problem at all for video conferencing. To be honest, I can’t tell the difference after ‘C20’. I’m sure it’s getting better little by little. In any case, I found that it does a very high level of noise reduction.

CPU load

Since the quality of Voice Focus and CPU load is a trade-off, I would like to see the CPU load when I change the quality of each. For this evaluation, I used a ThinkPad (corei5 8350U 1.7GH).

From top to bottom, these are the CPU usage rates of Chome when ‘C100’, ‘C50’, ‘C20’, ‘C10’, and ‘None (unused)’ are selected. There seems to be a difference of about 5% between ‘C100’ and ‘C10’. This time, the difference was 5% in the evaluation on a Thinkpad (notePC), but it may be a bigger difference on a smartphone with more limited CPU performance. Together with the quality mentioned above, I think the ‘C10’ may be a good choice for smartphones.

That’s all. This is a summary of the results of using Voice Focus.

The noise reduction by Voice Focus was quite powerful.
Changing the quality parameters did not make a big difference.
On the other hand, there was a clear difference in CPU usage. (This is because I only did a quantitative evaluation of CPU usage.)
Basically, the lowest quality ‘C10’ seems to be useful enough.

Of course, if someone who can distinguish sounds more delicately listens to it, they may notice a big difference in quality, and I think the quality parameters should be adjusted according to the use case, so it is better to check it yourself.

Coffee has run out. please help me

Repository

The source code for the demo used in this evaluation is available in the following repository.

w-okada/flect-chime-sdk-demo

This is a video conference system with amazon chime sdk and its component library. This software is based on the demo…

github.com

Finally

In this article, I tried to remove noise from voice using the Voice Focus feature of the Amazon Chime SDK. I found that it removes noise quite well. Also, it seems dangerous to run in high quality mode (‘c100’) uniformly because high quality mode has a certain CPU load. Probably, it would be better to run it in low quality mode (‘C10’) for basics, and use high quality mode depending on the use case.

Acknowledgments

We used this “Creative Commons Attribution License (Permission to Reuse)” video for the demo.