Hello everyone. My name is Mao Jinxin and my English name is Jason. I am the Audio Solution Manager at Espressif System. I'm glad to share my content with you today. Before we get started, allow me to introduce our team. Our team focuses on in-depth research of audio, video, and the image algorithms.
Design multimedia software development frameworks and integrating artificial intelligence technology to develop multimedia product solutions for smart home, children's education, multimedia entertainment, and other fields. Today, I will share the three parts for you guys, marketing overview, multimedia framework, and multimedia solutions. Let's look at the first part, marketing overview.
Let's share what we call the multimedia products. First of all, it's an audio product or video product. It has the latest artificial intelligence such as voice and image recognition technology. Its application scenarios include smart home appliances, children's education and life entertainment and other aspects.
It may be an upgrade of a traditional products or a newly designed product. Among traditional products, smartphone appliances are the most typical. Let's take a look at two products. Smart washing machine, you can use voice control to
easily operate it , and enjoy a smarter and more convenient laundry process. It can be integrated with the smart home system to realise remote control and monitoring of laundry progress. In addition to the traditional alarm clock, the smart alarm clock also has practical functions such as weather information, schedule reminder, and so on. It can set music as an alarm according to your preference to help you wake up easily. It has a voice control function which allows you to stop the alarm without any switch. The best representative of the new product
is the Wi-Fi smart speaker. It has a built-in smart assistant which can provide you with information, play music, and control smart home devices through voice commands. The speaker has a high-quality audio output providing an immersive music experience and clear calls. It also integrates various entertainment such as intelligent QA, story explanation, weather, forecast, etc.
To meet your needs, another type is Smart Toy, which is designed for children focusing on safety and education. It has a functions such as interactive games, story explanations, and children's songs to stimulate interest in learning. It can provide useful educational content through voice interactive to help children developing social skills. In the field of the video, home security system is an important application scenario of audio and video.
Whether you are at home or not, you can remotely monitor and control the system through your mobile phone application so that your family and property are always under security protection. Here are some of our commonly used home safety products. The Smart Cat's Eye is our home security system's first line to defense. It has the high resolution camera, night vision, face recognition, and other functions. It can monitor the dynamics of the door in real time when you are not at home to ensure your family's safety.
The home camera is an important part of the home security system, covering the key areas of the home, such as the living room, porch, and garden. The camera can provide real-time video monitoring and round-the-clock protection So that you can pay attention to the safety of your family and property all the time. The Smart Door Lock is a new category, especially Wi-Fi Smart Door Lock. The Smart Door Lock supports many unlocking methods, such as passwords, fingerprints, and mobile phone applications, which can help you avoid the hassle of dealing with traditional keys, the high resolution camera on the lock can monitor the unsafe people at the door in real time. And the video of abnormal behaviour will be transmitted to your phone or cloud
for safe door access management. Now let me introduce our multimedia development framework. The multimedia software framework consists of four layers, protocols, middle-level API, audio and video codec, and hardware abstraction layer. The ADF offers basic elements, pipeline components, and button service, as well as Wi-Fi and OTA connection services. AEC comes from ESP-SR, which is Remove Echoes in the core and improves the audio and video quality.
The protocol includes ESP-RTC, RTSP, RTMP, etc. The audio codec includes G711, AAC, OPUS, etc. The image codec includes MJPEG, MPEG-4(H264), etc. MJPEG H.264, etc. The process includes Resample, EQ, etc. This section will introduce the commonly used codecs, which include image codec, video encoder, and AAC encoder.
The JPEG encoder is used to convert raw image data into JPEG format for transmission and storage. The original JPEG encoding has a low frame rate and uses a lot of memory. We used the API instruction on ESP32-S3 to optimise the JPEG encoder, improving its performance by about two to three times. The JPEG encoder has the following features. Right here, they input support RGB888, RGBA, YCbYCr, GRAY, YCbY2YCrY2 and the rotation supports 0, 19, 180, 270 degrees The table on the right is our performance test result.
The QVGA's FPS is about 54. The VGA is about 14. Same as the JPEG encoder, the performance of the JPEG decoder has also been improved by about 2 to 3 times. The JPEG decoder has the following features.
The output support RGB888, RGB565 (Big End) and RGB565 (Little End) the rotation support 0, 90, 180, 270 degrees. The table on the right is our performance test results. The QVGA is about 70 FPS.
The VGA is about 16 FPS. The MPEG-4 AVC is a video encoder whose original implementation is unsuitable for lightweight MCU. We are reducing memory usage by modifying the code structure and optimization CPU usage to replacing time-consuming C code with faster assembly code. We aim to make it compatible with ESP32-S3 so that it can run efficiently on this platform. Currently, the following features are supported. At here. The table on the right is our performance test results. QVGA
resolution information, the input format is YUV422, the FPS is 10.86, they're using 1M byte memory. AAC is commonly used for audio encoding in video stream. And we have developed the ESP-AAC encoder. It supports the following features. AAC low complexity profile is supported.
The sample rate supports 8 kilo Hz to 96 kilo Hz. Model and dual channel are supported. Bit per sample is 16 bits. The constant bit rate is from 12 kbps to 160 kbps and supports no ADTS header encoding.
The table on the right is our performance test results. The memory used 52 kilobytes. Next, I will introduce the protocols in the multimedia application we have developed, such as ESP-RTC, ESP-RTSP, and ESP-RTMP. The first one is ESP-RTC we developed, which is a lightweight real-time communication protocol that is suitable for voice call, video call, and voice conference. The interface is straightforward and it offers a flexible API.
The software structure is illustrated in the diagram on the right, the main features includes the SIP version 2.0, Transport layer support UDP, TCP and the TLS. The audio codec supports G711A, G711U and OPUS. video codec supports MJPEG and H.264. The grey part of this picture in the
function we are developing such as WebRTC, ICE, and PLC. The ESP-RTSP protocol is developed by Espressif system based on the standard RTSP. It's an application-layer protocol used for controlling the delivery of real-time data, such as audio and video over IP networks. It's used TCP or UDP as a transport protocol and typically operate on top of the RTP media stream. The ESP-RTSP is perfect for IoT applications that need real-time media communication with low latency, like video monitor and audio streaming. The ESP-RTSP has those features. It supports the RTSP version 1.0, it supports the pusher server and client.
The public method has OPTIONS, DESCRIBE, SETUP, PLAY, and TEARDOWN. It supports the most popular RTSP servers such as EasyDarwin. The last protocol is ESP-RTMP. RTMP is an Adobe protocol for real-time multimedia stream. It's used for live streaming, interactive media, and video platforms.
TCP is used with it, along with other protocols like AMF for data serialization. The RTMP allows for low latency streaming and dynamic bit rate changes, making it suitable for real-time applications like live video broadcasting, online games, and interactive web applications. The ESP-RTMP is developed based on RTMP version 3. They have three roles-- publish (rtmp_push), player (rtmp_src), server (rtmp_server). The RTMP server that receives the audio, video, and data stream from the source and distribute them to clients or other servers. RTMP player that connects to the RTMP server to receive and replace the live stream or interactive content.
The RTMP publish is the source of audio, video, or data stream, such as a camera or an encoder that captures and sends the content to RTMP server. The ESP-RTMP supports audio codec includes PCM, G711, AAC, and MP3. The image codec supports H264 and the MJPEG. In particular, the MJPEG is not supported in the standard RTMP protocol.
This section first discusses the application products of Smartphone and then introduce the application solution we have already supported. Smart Homes also known as Home Automation, utilize advanced technologies to provide people with increased convenience, comfort, and security. Here are some common applications scenarios for smart homes.
The first one, Home Security. Smart Home employ a range of security features, such as smart door locks, surveillance cameras, motion sensors and window sensors along home owner to monitor their property remotely and remotely control access to their homes. The second one is smart home applications. like smart washing machine, air condition, smart lighting, and coffee machine can connect to the internet and be remotely controlled. For instance, user can receive a notification about the progress of laundry or start their conditioner before arriving home or adjust brightness level based on natural light.
The second one is voice assistants. Smart Home can integrate with popular virtual assistants like Amazon Alexa, Google Assistant, Apple Home Kit, enabling voice control for a wide range of connected devices and services. Through smartphone apps or voice assistants, home owner can remotely control aspects of their homes, including security, lighting, temperature, and appliances, providing increased convenience and peace of mind.
The following solutions showcase how using Smart Home technology can greatly enhance the overall quality of life and provide a secure and convenient environment for those residing in the home. The first solution is video call, which can apply for smart door locks, surveillance cameras, baby monitors and other similar products. Our video call project is based on ESP-RTC, which supports MJPEG and H.264 for image compression and AAC of G711, etc, for audio compression. It also used the AEC, NS, and other technologies to improve the sounds quality of calls.
At the same time, it's a very lightweight project and can run ESP32 and ESP32-S3 platforms. It also supports video calls from device to device or from device to app and PC. In the video call project, the image collected by the camera will be encoded by motion JPEG or H.264 and transmitted to the receiver. After receiving the data, it will be decoded and displayed. At the same time, the collected audio data is processed by AEC and then encoded and transmitted to the receiver.
Next, the data is decoded and played on the receiver. This project is a full-duplex working mode. In other words, it will have at most one image encoding, and one audio decoding at the same time.
Its complexity is imaginable. So we can meet this situation by combining the performance optimization of motion JPEG or H.264 and AAC introduced earlier on ESP32-S3. To view the code of the video talk project, please go to ADF's ESP-RTC project. In this picture, you can see the “esp_rtc_init” configuration process code.
In this project, the buttons on the board have the following functions. Press the play button to call this special ID. Press the record button to answer the incoming call. Press the mute button to enter the call.
Press the set button to enter the meeting room if the server supports it. You can adjust the voice volume by using volume plus or volume minus button. If you have any, have an Android phone, you can also download ESP-RTC apk, After installation, configure your own server port and then you can let the device and mobile phone have a video conversation.
The picture on the right is a screenshot of the call of the mobile phone application. Here, we introduce an RTMP-based application solution that used ESP32-S3-Korvo-2 to development board for live streaming on YouTube. This development board serves as the RTMP client using AAC for audio encoding and H.264 for video encoding to upload the capture video content on the YouTube server.
This is the software diagram where audio and image go through AV recorder and sent to the server using RTMP protocol. The image on the right is a live scene on the computer, a video of 2022 Espressif Developer Conference is being played, while the ESP32-S3-Korvo-2 is capturing the computer screen and pushing it to YouTube for live streaming. For the code of this example, please refer to ADF's RTMP project for instructing on how to use this project. You can click here you can use a live video by clicking here. In this application, you can also configure JPEG for image encoding, although, this may reduce its university as most servers do not support a motion JPEG or video stream. The encoding speed of JPEG at the different resolutions is shown in these two tables.
The difference between these two table in whether the camera output is JPEG or YUA format data. In the application scenarios of smart door locks, People used to have to go to the door to check and unlock the door after the doorbell rang. However, with the invention of the remote control panel with the door lock, people can now unlock and view visitors from anywhere in the room, and the door lock can connect multiple remote control panels. In the diagram on the right, the server has established a broadcasting stream and a chat stream with the two clients respectively. The chat stream is responsible for transmitting, some control information. While the broadcast stream is-- while the broadcast stream serves as the channel for pushing audio and video data, the SSDP in the diagram is used to discover the working of clients and the servers.
and it allows for additional or removal of the panels as needed. In the actual application process, the server is smart door lock, while the client is the door remote panel. This picture illustrates the initialised process of the RTMP server. It first calls 'start_rtmp_server' to start the RTMP server, and then creates the pusher and the puller channels. On the panel side, it will first look at the server and join it then open the puller channel.
This is a software diagram. Well, data captured by the camera and the ADC is processed through AV_Recorder. And then sent via RTMP to receiver. The receiver processes to received audio and video data for rendering. The code for this solution has not been released yet. If you are interested, please contact our sales teams.
That's all. The above content is my presentation. Thanks for watching. If you are interested in any of the solution or have any questions, please feel free to contact us through GitHub or our sales teams.
2023-09-21