libaom Case Study & Coding Tools Analysis (part 2)
This article comprises two parts. In part one I focused on providing analysis of the AV1 coding tools that in this part 2 will become the basis for examining libaom to determine which can be most useful in the pursuit of better coding performance (speed) and efficiency (bitrate savings).
Some video engineers are claiming that there are too many parameters provided by libaom and though there certainly are many choices and this can lead to confusion about which to use, I’ve performed much of the analysis and hope that this post will be useful as you consider your own unique application and use case.
As illustrated in Figure 4, much progress has been made by the libaom development community since Jan 2019 in both coding efficiency and speed. It should also be noted that during the past few months there have been a large number of libaom code submissions dedicated to code refactoring. Documentation and comments are also constantly being added to the repository, making the codebase more legible and easy to understand.
libaom Optimization - General Mode Optimization
In libaom, there are more than 100 speed-related features in the codebase, such as:
- Sequence/frame level speed features
- Global motion speed features
- Partition search speed features
- Motion search speed features
- Inter mode search speed features
- Intra mode search speed features
- Interpolation filter search speed features
All the speed features in libaom have been defined through the use of a dedicated Speed feature data structure in the codebase, allowing the developers and reviewers to easily track the implementation of each feature.
Temporal Filtering for Denoising
Temporal filtering is used in libaom for denoising. It is noted that the scheme of temporal filtering for denoising is not part of the AV1 standard, rather, it is part of an encoding optimization solution. Hence this scheme can be used in other encoder optimizations as well.
Before digging further into this scheme, there is a need to understand the frame structure in AV1. As shown in Figure 3-1, in a GF group, a hierarchical layer is adopted and frames are allocated to different layers.
The frame of ALTREF plays the most significant role in a GF group, as it is the last frame in one GF group, hence it can be used as a backward prediction reference for all frames in front of it. Meanwhile, it is the first frame in the next GF group, hence it can be used as a forward prediction reference for all frames following it. ALTREF is the frame that has been referenced the most across all frames. Therefore, the encoding of ALTREF is very important and its quality will have an impact on many other frames in the sequence.
Temporal filtering has been firstly applied to ALTREF before its encoding. Altogether seven frames are involved in the temporal filtering, including 3 adjacent frames prior to the current ALTREF, 3 right after it, and the ALTREF itself.
A temporal domain search is first conducted, followed by a noise strength estimation. Weighted summarization is then used to implement temporal filtering. As denoising is applied to the ALTREF frame, the denoised version is obviously different from the original source frame. Hence a frame named OVERLAY frame is introduced to compensate for the difference between the encoded ALTREF version and its corresponding original source.
This source frame is positioned as ALTREF and is in fact compressed twice. The version of ALTREF is used as a reference, whereas the OVERLAY is the final show frame for display. Hence, there is an additional cost in compressing this frame.
Such additional cost also exists at the decoder side. For instance, for a GF group of 16 frames that adopt an ALTREF and an OVERLAY, the decoder must decode 17 frames instead of 16.
With the evolution of libaom, temporal filtering for denoising is also applied to KEY frames. Again seven adjacent frames are involved in the temporal filtering, adopting similar procedures such as searching and the weighted sum.
For a KEY frame, it is not quite reasonable to encode an extra OVERLAY because as for the decoder, the KEY frame provides the synchronization point, and in the “trick mode” it’s possible that only a single frame will be decoded. Remember, if a KEY frame is a “no-show,” it can cause problems with some players.
Consequently, the use of temporal filtering on KEY frames will inevitably cause an initial distortion on those KEY frames, since the denoised version has been treated as the source version for the temporally filtered KEY frames. We will examine how much gain can be obtained by KEY frame temporal filtering a little later.
From Table 7, you will see that an overall BD-rate gain of 8.67% was achieved, and this is even larger than many tools introduced in AV1 can reach. Meanwhile, at speed level CPU3, the use of temporal filtering consumes a relatively small CPU usage, approximately 5% of all the CPU encoding consumption. After applying temporal denoising, video scenes become smoother and more amenable for adopting faster encoding tools without sacrificing image quality.
Note that temporal filtering does not always bring a positive impact. A counterexample of this is provided in the last row in Table 7 where for the video clip of “wikipedia”, which is a screen capture video, most of the motion present is screen rolling motion.
As shown in Figure 5, a frame is illustrated with its motion vectors overlaying the source coded frame. This frame is positioned as number three in the encoding order as shown in Figure 3, having two references, namely the KEY frame as its forward prediction reference and ALTREF for its backward prediction reference, where both of these references are under through the use of temporal filtering.
In Figure 5(b), without applying temporal filtering, the estimated motion fields are relatively clean, showing the scrolling down motion overall. However, when temporal filing is active, new noises are introduced and wild motion vectors can be observed as shown in the yellow-framed area in Figure 5(a). After temporal filtering is applied, we can see a loss in coding efficiency as illustrated in Table 7.
Because of this, it would be useful for scene adaptive or region based adaptive solutions to be used when adopting temporal filtering denoising. Though clearly, not every block is suitable for this encoding algorithm.
Resolution Differentiated Optimization
Resolution differentiated optimization is closely related to block partition strategies. The maximum partition size in libaom is 128×128, whereas the smallest partition size is 4×4. As shown in Figure 6, we have tried turning off all the partitions with a size smaller than 8×8.
It can be observed that over the designated test set, an encoder speedup has been achieved over various bitrate settings, corresponding to different cq-level setups. For smaller cq-levels, the result will be higher bitrates. Whereas, when a larger speedup is observed, there is an approximate 9% speedup gain for the cq-level of 20.
For lower bitrates associated with larger cq-levels, the speedup will be less, indicating that the probability of using smaller partitions is small. This is mainly because more speed features adopted in libaom are being turned on at a lower bitrate range, and a smaller partition search may have been skipped as a result of using these speed features.
When evaluating partition choices, this is conducted in a top-down manner in libaom, i.e. starting from the largest allowed partition, say 128×128, and iterating down the partition tree to smaller partitions.
The speed features in libaom have early termination strategies that will hold off smaller partition search if the encoder decides no further coding gains can be achieved. Proactively turning off smaller partitions could have an impact on lower bitrate or smaller cq-level setups, as smaller partitions may be turned off, regardless.
As shown in Table 8, skipping smaller partitions will have the least impact on 1080p videos with respect to coding efficiency. With a decrement of the resolution, coding efficiency will be downgraded compared to when all the partitions are switched on.
Note that for screen content there is a big efficiency advantage as a result of skipping smaller partitions. This is because screen content characteristics vary across different blocks even within one frame. Some areas of the screen may have very fine details regardless of their high resolution, which is in need of smaller partitions to capture the detail needed to achieve efficient encoding.
Qindex Based Optimization
Q based optimization strategies have been recently added to libaom. The reason behind this is to address different coding tools that may manifest different coding gains for varying bitrates. Especially for higher bitrate use cases, certain coding tools may not be able to generate the larger coding gains possible at lower bitrates.
As shown in Figure 7, when we proactively turn off two types of partitions, namely ab-partitions and 1to4-partitions, note that both of these are new coding tools added to AV1.
We will compare the RD-performance, represented by the blue curve in the figure, against the baseline where both partitions are left on denoted by the red curve. The RD results are collected from our own AWCY platform. We owe a great big thank you to Thomas Daede of Mozilla for his work in developing this amazing tool (2).
When turning off the two partition types, you will see that there is barely any coding efficiency loss for Q ranges of 32~20 for the high bitrate range that is shown by the large overlap between the red curve and the blue curve. libaom proactively turns off the two new partitions specifically on q-level of 20.
We have conducted further experiments with different partition subset selection strategies and observed the resulting coding performance over different Q values. As shown in Figure 8 through Figure 10., we constrained the maximum partition size, specified by the parameter –max-partition-size to 64, 32, and 16, respectively. When doing this and by proactively skipping larger partition types, the coding performance is not impacted significantly over lower Q values for higher bitrate ranges. Important insight: This indicates that higher bitrates favor smaller partitions since it requires more detail.
The coding efficiency also does not get affected much for QP level of 20, while at the same time, encoder speedup is possible. Nonetheless, over larger Q values, i.e. on lower bitrate ranges, coding efficiency is dropped even though an improvement in the encoder speed is achieved.
One interesting thing is that when –max-partition-size is set to 16, meaning libaom turns off all larger partitions >16×16, there is no speedup at a Q level of 63. But, there is a sacrifice in coding efficiency. This shows that sometimes when doing less, it does not mean that we’ll achieve a faster speed.
Therefore, when we design speed features and tune encoder parameters, there is a need to pay attention to the Q levels. It’s important to adopt Q adaptive strategies.
Content Adaptive Optimization & Screen Content
In recognition that screen content is a special content category requiring different tools than synthetic or real-life video, the AV1 provides specific tools for handling the encoding of screen content starting with the detection of screen content.
Coding of Screen Content
libaom has adopted specific strategies for handling encoding of screen content. New content categories may be added, such as video conferencing, gaming, and security, as this content presents unique characteristics from other content categories, making it worth pursuing specific encoding algorithms to optimize the coding performance specifically for these content types.
Screen Content – Hash Motion Search
Hash motion search is a technique used in libaom when detected as screen content. It uses crc32 to generate hash tables of each possible block size in each possible full pixel position.
The calculation and memory used to generate the tables are significant, so currently it is only used in the intrabc mode for a keyframe. As illustrated in Figure 11, with hash motion search enabled, many more motion vectors were found in the intrabc mode which resulted in 20% of bits being saved, 514006 bits compared to 688350 bits without hashme.
Quality Metric Guided Optimization
SSIM Tuned Optimization
libaom tune options include:
- AOM_TUNE_PSNR (default)
VMAF Tuned Optimization
Visionular is open to tuning the encoder parameters and developing quality metric specific optimization algorithms provided we align the effort with the optimization of subjective evaluation techniques like MOS.
How the Visionular Aurora1 AV1 Encoder Compares
The purpose of this article is to show that libaom is a very capable encoder, which if one understands the AV1 encoding toolset can allow for good encoding performance and bitrate efficiency.
Aurora1 vs. libaom
As a video technology engineering company that has been focused on AV1 from the very beginning, starting with our work at Google, we’ve built an AV1 encoder called Aurora1 that is able to achieve a 5 to 10% BD-rate gain against libaom at the same speed, using all objective quality metrics (PSNR, SSIM, VMAF).
This is especially remarkable when you consider that Aurora1 is able to achieve very high quality while encoding 40% faster and using 40% less CPU with the same or better RD performance. The secret to this performance is in the extensive optimizations we’ve implemented and that Aurora1 provides a higher number of speed modes than libaom, allowing it to meet VOD, live streaming, and low-delay real-time communications requirements. Aurora1 delivers superior subjective quality for both user-generated content and professionally produced content with superior Film Grain preservation.
Aurora1 vs. x265
Since many services are deploying HEVC today, and knowing that AV1 is a more complex video standard, it’s instructive to look at how Aurora1 compares against x265 slower. Here, Aurora1 is able to achieve an SSIM-based BD-rate gain >24%, with an encoding speedup >2x for VOD.
For the live streaming use case, Aurora1 is able to deliver 1080p60 live streaming by taking full advantage of the powerful multi-core processors available from AMD such as the new 32-core 3970x Threadripper and 7742 Epyc, while maintaining superior RD performance. With Aurora1 you can get speed without sacrificing quality and efficiency.
Aurora1 and low-latency RTC applications
For extreme low-latency workflows, Aurora1 delivers 720p30 real-time encoding with less than 25% CPU usage on Intel i5/i7 processors with quality typical of on-device camera captured content such as video conferencing and real-time communication applications, while maintaining excellent RD performance. For screen content, Aurora1 can encode 1080p25 using just 25% of the Intel i5/i7 CPU, while demonstrating up to a 75% BD-rate savings under all objective quality metrics (PSNR, SSIM, VMAF), compared with x264 operating in its superfast zerolatency mode.
 AWCY (Are We Compressed Yet): https://arewecompressedyet.com/ . Originally a Daala video codec quality evaluation platform, and then evolved to a thorough video encoder quality evaluation platform, adopted by AOM for both AV1 and AV2 development. Owner: Thomas Daede