This article is also available here.(Japanese)
In this system, the game screens of up to 15 users are distributed in real time. For this reason, a certain amount of network bandwidth is assumed on the clients. In addition, the CPU load on the browser increases due to the need to decode the video as it is received.
In this article, I would like to introduce an idea to make it possible for users with limited network bandwidth or CPU to receive and show the videos when distributing such a large amount of video data.
The following figure shows the result of the idea. The red arrow indicates the amount of data received, which was reduced from 4.4Mbps to 1.6Mbps.
Data Transfer of Amazon Chime SDK JS
Amazon Chime SDK JS, like many other online conferencing systems, uses WebRTC to send and receive video. Since WebRTC and SFU have been explained in many places, we will only introduce the key points here. For more information on WebRTC and SFU, please refer to here(Japanese) and here(Japanese) for example.
Originally, WebRTC is a P2P technology that allowed clients (browsers) to communicate with each other to send and receive media data. There is no problem for one-to-one use between two users, but it causes a bit of a problem when used between three or more users.
The figure below shows the case of an online conference with three users. (a) shows the diagram when P2P is used. As shown in (a), with P2P, each client has to send media data to the other two clients. Also, each client needs to receive media data from the other two clients. In this way, data is exchanged on a full mesh between each client, and a large amount of data has to be sent and received depending on the number of clients.
The technology used to alleviate this problem is SFU or MCU (Multi-point Control Unit). In these technologies, the data is sent via a server to reduce the amount of data transferred. Figure (b) above shows a case where SFU is used, which unifies the data sent from each client to the server. And the data sent from the server to each client is sent as is without any special processing. Therefore, the amount of data received by each client is the same as in P2P. On the other hand, the MCU shown in (c) compose the data received from each client at the server into a single stream of data, and sends it to each client.
A comparison of P2P, SFU, and MCU is as follows: SFU has a low load on the server, but the data transfer and video decoding load on the client increases with the number of users. On the other hand, MCUs perform advanced processing on the server, so the load on the server is concentrated, but the amount of data transfer and the load on each client can be kept constant.
SFU and MCU have their advantages and disadvantages, but SFU seems to be the current choice.
Controlling the data transfer in Amazon Chime SDK JS
With Amazon Chime SDK JS, which employs SFU, each clients need to receive data in proportion to the number of other clients, as described above. Amazon Chime SDK JS has a feature to mitigate this by selecting the source client to stop receiving (Figure (a) below). However, once you stop receiving, you will not be able to see the video of that client at all until you resume. This makes it unsuitable for situations where you want to see all users’ game screens in real time, such as Among us game play.
In addition, Amazon Chime SDK JS can change the quality (resolution and bit rate) of the video data to be sent. By building in a function to change the quality of the sending side according to the status of the receiving side, it is possible to control the amount of data received by each client (Figure (b) below). However, in this way, all users will be affected by the quality change. This may be annoying for clients with enough network bandwidth and CPU.
So I’d like to think of another way.
Idea: Video compose and resend with dedicated client
This time, we’ll set up a client (Dummy Client) separate from the client used by the user, and use that client to compose multiple video data into a single video data. Then, the conposed video data is sent back to the Amazon Chime server (SFU in the context of this article) for transmission to each user’s client. In other words, it will behave like a pseudo-MCU.
Each user can select from two options of method of receiving data.
- Receive data as it is sent from each user’s client as usual.
- Receiving the above composed video data
This allows each user to change the amount of data to be received depending on the network and CPU limitation.
By the way, one of the concerns with this idea is the time lag of the video. Due to the configuration of this idea, it is inevitable that there will be an overhead of one round trip communication between the client and the server that compose the video data and the video compose process. How much the network load will actually be reduced and how much the time lag of the video will be is unknown until we try it.
So, I would like to create a demo and experiment with it.
Structure of the experiment
The time lag of the video depends on the network between each user’s client, Amazon Chime’s server, and the client that composes the video. In this case, we will run the video compose client on Fargate, based on the assumption that it would be better to run it within the same Amazon network. The clients for each user will be run outside the Amazon network, keeping in mind the actual use case. Specifically, we will run them on GCP. In order to monitor the amount of data transfer, we will also launch a client dedicated to receiving video on a local PC.
I would like to use the Among Us screen, but I couldn’t prepare enough accounts for 15 users, so I am experimenting with 4 accounts (4 users). I think you can probably see the effect even with this. I would also like to check the behavior with 16 users’ camera images to see the effect more clearly.
Result of the experiment
Case 1. Amongus
I measured the network usage when I used four screens of Among Us. Please see the Youtube video below. From 0:00 to 0:20, the data from each client is received as it is; from 0:20 to 50, the composedvideo is received.
The figure below captures the amount of data received. On the left is the situation when the data from each client is received as is. On the right is the situation when receiving the synthesized data. It can be seen that the speed has been reduced from about 4.4Mbps to about 1.5Mbps.
Case 2. Video conference
Assuming a normal video conference, I measured the network usage by receiving 16 camera images. We also looked at the time lag caused by the merging of the data. This time, we used dummy images for the camera images, and one user used a stopwatch screen as a pseudo camera image to measure the time lag. Please see the following Youtube video. From 0:00 to 0:07, each client’s data is received as it is. At 0:07–0:30, each client stops receiving data, and you can see the data rate gradually decreasing. 0:30–0:55, the composed video is received. 1:15–1:50, you can see the time lag.
The figure below captures the amount of data received. On the left is the situation when the data from each client is received as is. On the right is the situation when receiving the synthesized data. It can be seen that the data transfer has been reduced from about 5.0Mbps to about 1.0Mbps.
The figure below shows the time lag. On the left is the image received directly from the SFU, and on the right is the combined image. There is a time lag of about 1 to 2 seconds. Also, as you can see in the Youtube video, the synthesized video has an FPS of only about 3.
From the above experiments, we found that the amount of data received can be greatly reduced by this idea. We also found that the time lag can be reduced to about 1–2 seconds. This time lag may be a little difficult to use in video conferencing because the timing of the other party’s facial expression and voice may be significantly different or there may be a subtle pause. On the other hand, for game streaming such as Among us’s screen streaming, it seems to be usable enough since it is hard to feel discomfort even if the audio and video do not match perfectly. I have actually played Among us using this system several times, and I have not received any feedback of discomfort in this regard.
We have received feedback that the FPS is only about 3, which is a bad point. One of the main reasons for this is that the processing load for compositing the video is too high for Fargate to handle it. In fact, using a local core i9 machine, it’s a bit better. Also, the FPS will improve as the number of participating users decreases. For example, if you have 4 users, you can get a reasonable FPS, as you can see in the Among Us youtube above.
Since the purpose of this idea is to reduce the amount of data received by changing the quality of the video received according to the user’s environment, we will not pursue the improvement of FPS too much at this time. However, we would like to consider reducing the weight of the processing a little more as a future task.
I also mentioned above that it might be a little difficult to use in video conferencing due to the time lag. However, when we think about actual use cases, there will basically be one user who is speaking (active user). If the system is designed so that only the active user receives the uncomposed video, and the rest of the users receive the combined data, this problem will be improved considerably. This is the situation where the user appears on the left side of the Youttube video above where the time lag was measured.
The demos used in this experiment are stored in the repository below.
Please clone it with the branch name blog002-reduce-traffic. Please refer to the Readme of the repository for deployment instructions.
$ git clone https://github.com/w-okada/flect-chime-sdk-demo.git -b blog002-reduce-traffic
$ cd flect-chime-sdk-demo/
In this article, I introduced an idea to enable users with limited network bandwidth and CPU to receive and display video in use cases where a lot of video data is exchanged in applications using Amazon Chime SDK JS. Due to the structure of the idea, there will be a time lag, but depending on the use case, this may not be a major problem. Also, even if there is a problem as described in the discussion above, there may be a workaround.
On the other hand, as for the FPS degradation, it is true that the environment of the experiment was a bit limited, but it seems that the load of the video synthesis process is too high, which is something that should be improved. We would like to make improvements in the future to make the system even better.
Recording or capturing Amazon Chime SDK meetings with the demo in this blog may be subject to laws or regulations regarding the recording of electronic communications. It is your and your end users’ responsibility to comply with all applicable laws regarding the recordings, including properly notifying all participants in a recorded session, or communication that the session or communication is being recorded, and obtain their consent.