BGM/SE with Amazon Chime SDK

Note:
This article is also available here.(Japanese)
https://cloud.flect.co.jp/entry/2021/03/01/130602

Introduction

In the previous article, we showed how to use Voice Focus in Amazon Chime SDK for JS to remove noise. In this article, I would like to take it a step further and show how to add BGM or SE after denoising. I guess this is the audio version of the virtual background that I have introduced several times before(ref).

What I will create is something like the video below. Instead of the sound of rain, I added some space-like BGM (00:16-). If you change the background as well, you can change the situation from a person in heavy rain to a person talking in space in a matter of seconds.

In this article, I will not explain the details of Voice Focus, the noise reduction feature of Amazon Chime SDK. Please refer to the previous article.

Audio Input and Mixing in Amazon Chime SDK

I started working with the Amazon Chime SDK just about a year ago. Since that time, I’ve found the Amazon Chime SDK to be very scalable and flexible in that it can handle HTML5 MediaStream as input(ref). Naturally, Audio Input can also handle MediaStream as input.

So, to add BGM or SE to audio (microphone input, etc.), you can use the Web Audio API to mix background music to the original audio, then extract the MediaStream from it, and set it as Audio Input in the Amazon Chime SDK. The Amazon Chime SDK provides an API called mixIntoAudioInput that does this internally.

The Problem when using with Voice Focus

OK that’s the end of the matter… No, actually, there is a problem with this method. If you use the mixIntoAudioInput method and Voice Focus together, Voice Focus will cleanly remove BGM and SE as noise. Ah, well….good performance. Since there are no details in the documentation, let’s check the source code of DefaultDeviceController which provides mixIntoAudioInput.

mixIntoAudioInput(stream: MediaStream): MediaStreamAudioSourceNode {
<snip...>
node = DefaultDeviceController.getAudioContext().createMediaStreamSource(stream); // <---(1)
node.connect(this.getMediaStreamOutputNode()); // <---(2)
<snip...>
}

(1) creates an input node for the MediaStream to be mixed, and (2) connects it to an output node. The output node is retrieved with getMediaStreamOutputNode. Let’s also take a look at getMediaStreamOutputNode.

private getMediaStreamOutputNode(): AudioNode {
return this.transform?.nodes?.start || this.getMediaStreamDestinationNode();
}

When VoiceFocus is used, its input node (the input node of the transform device) is returned as the node for output. As a result, as shown in the figure below, the BGM and SE inputs to be mixed will be connected to the Voice Focus input node (red line). If you don’t use VoiceFocus, it will be directly connected to DestinationNode. (Blue line)

In other words, if the BGM or SE input in mixIntoAudioInput is recognized as noise by VoiceFocus, it will be removed. This means that the BGM or SE that you have added will be cleanly ignored. This is a big problem.

This is an internal process that is not described in the official documentation, so it is possible that the process will change in the future. But at least for now, you can’t use mixIntoAudioInput to replace the BGM.

One possible solution is to inherit from DefaultDeviceController and override mixIntoAudioInput. In part (2) of minxIntoAudioInput above, you can replace this.getMediaStreamOutputNode with this.getMediaStreamDestinationNode to get the output node directly. However, since getMediaStreamDestinationNode is a private method, it cannot be called in the inheritance. We are stuck.

Do I have to give up? Can’t I take the professor to space? No, let’s think about it some more.

Flexibility of Amazon Chime SDK

What I want to revisit here is the point that MediaStream can be used as Audio Input. The point is that no matter what kind of processing you do, if you can output it as a MediaStream, you can input it into the Amazon Chime SDK. So I thought it would be a good idea to remove noise and mix the BGM before inputting it into the Amazon Chime SDK. In other words, we can make the following configuration.

In order to do this, we need a noise suppression function. Ofcourse we can use that function of the Amazon Chime SDK has, Voice Focus. In the previous article, we directly specified the TransformDevice created by Voice Focus as audioInput. This time, we will extract the output from the TransformDevice and mix it with the BGM using Web Audio API. Then, input the result of the mix (MediaStream) into Amazon Chime SDK as audioInput.

Demo

The following video is the result of actual operation. We can remove noise and add BGM.

Repository

The source code for the demo used in this evaluation is available in the following repository.

https://github.com/w-okada/flect-chime-sdk-demo

Noise Suppression (Voice Focus) and virtual background can be set from the Setting menu bar at the top of the screen. BGM/SE can also be played from BGM/SE in the menu.

In addition to the functions described here, the application in this repository also includes chat and whiteboard functions. Cognito integration has also been implemented.

Finally

This time, I tried to add BGM/SE with Amazon Chime SDK. If you use it with Voice Focus at the same time in the normal way, it will remove the BGM/SE, so I hooked the output of Voice Focus and added BGM/SE to it. As you can see in the demo, I think I got it right. In the future, mixIntoAudioInput may be improved to allow us add BGM/SE more directly, but for now, we can do that with this way.

Coffee has run out. please help me

Acknowledgments

We used this “Creative Commons Attribution License (Permission to Reuse)” video for the demo.

https://www.youtube.com/watch?v=6gBtE-n8j2E

BGM is from this site.

https://otologic.jp

Image is from this site.

https://www.irasutoya.com/

Software researcher and engineer