Analysis of AV1 Encoding Tools & libaom Case Study
This article comprises two parts. in part one I focus on providing analysis of the AV1 coding tools that in part 2 will become the basis for exploring with libaom which are most useful in the pursuit of better coding performance (speed) and efficiency (bitrate savings).
As video engineers and codec practitioners know, coding tools must show a reasonable advantage in the area of bitrate efficiency to be adopted. However, we must also consider processing cost for the tool as there are some tools that could yield efficiency benefits, but remain impractical based on the use case.
Let’s dig into libaom as the case study for analyzing AV1’s coding tools. Table 1 reveals no fewer than 35 control flags that can be toggled on or off in libaom, while Table 2 shows the libaom parameters used for testing.
A few notes on testing:
We chose libaom CPU3 because it is a medium speed preset and we are mainly focused on the High Delay mode. We turned off the modes that will directly affect the QP setup inside libaom.
The videos that were used for testing are common open-source clips that are available here. You can find the entire list in Table 3.
In Figure 1 you can see the partition types of AV1 as compared to VP9. Table 4 demonstrates that the AV1 partition types generate the largest coding gain. At CPU level 3, the newly added partitions AB partition, and 1:4 partition do not generate noticeable larger gains, hence they can be considered to be turned off, especially for those fast speeds or very low delay (real-time) speed level modes.
Single Motion Modes
In Table 5 you will find the three main single motion modes that are found in libaom, namely warped motion, obmc, and inter-intra. One motion mode excluded in our analysis is the global motion mode, whose pre-analysis has been turned off in libaom CPU3. This is mainly due to the consideration of a trade-off between coding efficiency and encoder complexity.
On the other hand, inter-intra is usually considered as a compound mode – combining inter- and intra- for the encoding of one block. Considering this mode only includes one motion vector for one block, we list it as one of the single motion mode sets.
From Table 5, for the test set of objective-1-fast, it can be seen that warped motion is a coding tool that demonstrates the biggest cost-performance ratio, where a coding gain of 1.07% can be achieved by using just 2% of the CPU’s resources.
In figure 2, we have two frame samples from two different video clips in the test set of objective-1-fast that contributed the largest gain of 1.07% for the mode of warped motion. Further, it can be observed that these two videos contain many scenes of rotation motions. Specifically, the Netflix clip in Figure 2(a) provides a gain of ~4%, whereas the clip of blue_sky in Figure 2(b) yields a ~8% gain.
It is noted if we change to another test set that the above coding tools may present completely different cost-performance numbers. Thus it is completely reasonable that different coding tools may be selected to adapt to different content scenarios.
Compared to single motion modes, there are many more compound modes that have been included. In essence, any pair of references can be composed of one compound mode. In AV1, for each superblock, there are 28 single motion modes, whereas the number of compound modes increases to 128.
In libaom CPU3, many speedup algorithms have been proposed, out of which the majority speedup features all come from the following simple idea: Check the results of the single motion evaluation, and determine which compound is worth evaluating and further what modes under those specific compound modes should be considered. For compound modes with a strong likelihood of not resulting in a noticeable coding gain, they will be skipped.
In particular, one compound coding tool, namely one-sided compound uses a pair reference from the same prediction direction, or the pair of references are positioned on the same side of the current frame. As shown in Figure 3, in the High Delay mode, libaom has in general included a golden frame (GF) group, or also referred to as GOP, comprising 16 frames.
Figure 3 shows how a hierarchical structure can be adopted in many cases. It can be seen that for all the 16 frames except for the ALTREF frame, the very last frame in the current GOP, has two-sided references. Therefore, a two-sided compound instead of a one-sided compound should be better-taken advantage of as larger coding gains can usually be achieved.
For ALTREF, all reference frames are positioned at a distance of more than 16 frames away in the forward direction. Hence for the High Delay scenario, a one-sided compound does not contribute much to the coding efficiency for all the frames within the GOP.
For very low delay (zero-latency) scenarios, all frames have their references positioned from the same direction – the forward prediction without the use of any single bidirectional predicted frame. The use of a one-sided compound can then be considered. It is also seen from Ryan Lei’s presentation  that for the low delay scenarios a one-sided compound can be quite helpful, resulting in a ~6% coding efficiency increase.