MBtree paper

更新时间:2023-05-07 16:59:29 阅读: 评论:0

A novel macroblock-tree algorithm for high-performance optimization of
dependent video coding in H.264/A VC
Jason Garrett-Glar
Department of Computer Science
Harvey Mudd College
Abstract
A novel fast heuristic algorithm for approximating rate-distortion-optimal
dependent video coding is prented.  Previous algorithms that solve this problem
assume constant quantizer within each frame and rely on impractically slow
approaches, such as dozens of repeated encodes of each frame.  This novel macroblock-
tree approach provides PSNR improvements of up to 1.2db and SSIM improvements of
up to 2.3db over existing fast ratecontrol algorithms at very low computational cost.
Macroblock-tree has been implemented in and committed to the open source
H.264/A VC encoder x264.
1. Introduction
Intelligent bit allocation across a quence of frames is critical to achieving high rates of compression in video coding.  The standard approach to optimizing this tradeoff is rate-distortion optimization.[1]  However, finding a rate-distortion-optimal solution for bit allocation across multiple dependent frames is typically infeasible due to its high complexity.
Most existing solutions, both heuristic and optimal, have impractically high complexity or provide only a small compression improvement.  Furthermore, despite their high complexity, most existing solutions still assume a constant quantizer within each frame.  We propo a novel macroblock-tree algorithm to optimize per-block quantizer lection across multiple dependent frames at negligible computational cost.
This paper is organized as follows.  Section 2 provides background for the problem of rate-distortion-
optimal ratecontrol and existing heuristics.  Section 3 gives a high-level overview of the macroblock-tree algorithm and its purpo. Section 4 explains the lookahead framework of x264 which was ud as a basis for our implementation of the macroblock-tree algorithm.  Section 5 introduces the macroblock-tree algorithm itlf.  Section 6 contains an analysis of the typical conquences of the algorithm. Perceptual considerations related to the macroblock-tree are discusd in ction 7. Numerical quality results are prented in ction 8, with performance analysis in ction 9.  The paper is concluded in ction 10.
2. Background
The simplest possible ratecontrol method is one which, given some t of constraints, attempts to target a constant quantizer.  It was realized very early in the development of video coding techniques that this was suboptimal: by varying the quantizers of frames using rate-distortion optimization techniques, one could improve quality, usually measured in the form of Peak Signal-to-Noi Ratio (PSNR) or sometimes Structural Similarity (SSIM), at a given rate.
Early algorithms to optimize this problem were typically brute-force, trying many quantizers for each frame in an attempt to pick the best one.  The advent of inter-frame compression complicated the ma
tter, as the number of possible quantizer combinations grew exponentially and frames could no longer be optimized independently.  This was solved using Viterbi algorithms, as in Ramchandran et al.[2]
However, while Viterbi made the optimal solution tractable, such algorithms were still very slow and in some cas still had exponential worst-ca convergence time.  One “fast” algorithm by Sermadevi et al had a runtime of O(Q*N*M), where Q is the number of quantizers to arch, N is the number of frames, and M is the number of frames affected by a change in allocation to any given frame.[3]  Even as improved in Toivonen et al, this class of algorithms still typically took dozens or hundreds of encode calls per frame[4], putting it out of the reach of most practical encoders.
Nevertheless, many simple heuristics have been derived from this rearch and ud in practical encoders.  One
described in Ramchandran et al is that the “... I-frame is the most important of the group of pictures and must not be compromid,” in other words, that I-frames should be given higher quality than other frames.[2]  Another common heuristic is assigning lower quality to unreferenced B-frames, as their pixels are not reud for prediction.
The heuristics are extremely common in modern video encoders.  x264 in particular us an I-frame offt of 1.4: 1.4x higher quality than P-frames, measured in linearized quantizer scale.  In H.264, this maps to approximately -3 QP. Similarly, x264 us a B-frame offt of 1.3, or 1.3x lower quality than P-frames, approximately +2 QP.
Another common heuristic is known as“quantizer curve compression”,or“qcomp”.qcomp attempts to compensate for the variance in RD curves among frames without the complexity of calculating the actual RD curves.  It does this by leveraging the correlation between the inter residual of a frame and its importance for predicting future frames.
Typically inter prediction is less uful in ctions of video with high inter residual, and thus the value of a higher quality reference frame is lower.  As such, qcomp adjusts the quality of frames in inver proportion to their inter residual. This algorithm was originally invented for u in libavcodec's MPEG video encoder.  x264's implementation of qcomp measures the Sum of Absolute Hadamard-Transformed Differences (SATD) residuals of frames, performs a Gaussian blur over the residuals to limit local variation, then multiplies the quality of all frames by (SATD residual)0.4.  Combined with heuristics such as constant I-frame and B-frame offts, qcomp helps approximate the effect of a much slower RD-optimal ratecontrol algorithm with negligible computational cost.[5]
In 2006, the algorithm described by Toivonen (colloquially dubbed RDRC, or Rate-Distortion Rate Control) was implemented in x264.[6]  Testing showed significant quality improvements at constant rate, in some cas upwards of 1db PSNR.  This demonstrated that there was still a large gap between the qcomp fast approximation and the optimal solution.
One weakness common to every algorithm mentioned so far is that they all operate on a per-frame level as oppod to a per-block level, ignoring variation within a frame.  Ramchandran et al briefly mentioned a possible extension of their algorithm to this, but even when ignoring inter-block dependencies within a frame, their modification introduces an additional O(4M) complexity factor.
3.High-level overview of macroblock-tree
The purpo of the macroblock-tree algorithm is to estimate the amount of information that each macroblock contributes to the prediction of future frames.  This information allows macroblock-tree to weight the quality of each macroblock bad on its contribution.To do this,macroblock-tree works in the opposite direction of prediction, propagating information from future frames back to the current frame to be encoded.
In order to do this, macroblock-tree needs to know various pieces of information, or at least approxi
mations thereof.  First, it must know the frame types of the future frames to be analyzed.  Second, it must know the motion vectors of the frames.  Third, it must know how much information is to be propagated at each step, which will be calculated bad on the inter and intra costs.  The x264 lookahead, described next, is how macroblock-tree gets this information.
4.The x264 lookahead
x264 has a complex lookahead module designed to estimate the coding cost of frames that have not yet been analyzed by the main encoder module.  It us the estimations to make a variety of decisions, such as adaptive B-frame placement, explicit weighted prediction, and bit allocation for buffer-constrained ratecontrol.  For performance reasons, it operates on a half-resolution version of the frame and calculates SATD residuals only,doing no quantization or reconstruction.[5]
The core of the lookahead is the x264_slicetype_frame_cost function, which is called repeatedly to calculate the cost of a frame given p0,p1, and b values.  p0 is the list-0 (past) reference frame of the frame to be analyzed.  p1 is the list-1 (future) reference frame of the frame to be analyzed.  b is the frame to be analyzed.  If p1 is equal to b, the frame is inferred to be a P-frame.  If p0 is equal to b, the frame is inferred to be an I-frame.  As  x264_slicetype_frame_cost may be called repeatedly on t
he same arguments as part of the algorithms that u it, the results of each call are cached for future usage.[7] x264_slicetype_frame_cost operates by calling x264_slicetype_mb_cost for each macroblock in the frame.  As the frame is half-resolution, each “macroblock” is 8x8 pixels instead of 16x16.  x264_slicetype_mb_cost performs a motion arch for each reference frame (past for P-frames, past and future for B-frames).  This motion arch is typically a hexagon
motion arch with subpel refinement, as described in Zhu et al.[8]
For B-frames it also checks a few possible bidirectional modes: a mode similar to H.264/A VC's “temporal direct”, the zero motion vector,and a mode using the motion vectors resulting from the list0and list1motion arches. x264_slicetype_mb_cost also calculates an approximate intra cost.  All of the costs are stored for potential future usage. This is important for macroblock-tree, which will need this information for its calculations.
The results of this analysis are ud primarily in a Viterbi algorithm for adaptive B-frame placement.  The output of this Viterbi algorithm is not merely the next frame-type to u, but also a plan for the frame types to u for the next N frames, where N is the size of the lookahead.  This plan is effectively a queue: it changes over time as frames are pulled from one end and encoded using the s
pecified frame types, frames are added to the other end as new frames enter the encoder, and the plan is recalculated.  The existence of this plan is important for macroblock-tree: it means that any algorithm that needs to know frame types for future frames has a reasonably accurate estimation of what they will be, even if the gop structure isn't constant.
As a result of this, macroblock-tree is aware of the frame types of the next N frames, approximate motion vectors and mode decisions, and inter/intra mode costs (in the form of SATD scores).  The computational cost of this is effectively zero, as this data was already being calculated for other purpos within the encoder.  Even so, with respect to total encoding time, the computational cost of the lookahead is low.
5.The Macroblock-tree algorithm
In order to perform the aforementioned estimation of information propagation, macroblock-tree keeps track of a propagate_cost for each macroblock in each lookahead frame: a numerical estimate, in units of SATD residual, of how much future residual depends on that macroblock.  This is initialized to zero for all frames in the lookahead.
Macroblock-tree begins its operation in the last (futuremost) minigop in the lookahead, first operating
on B-frames, then P-frames.  It then works its way backwards to the first frame in the lookahead (the next frame to be encoded).  In this way, macroblock-tree works backwards: it “propagates” dependencies backwards in time.  Thus, when we “propagate” a dependency, we are operating in the opposite direction of the actual dependency itlf: moving information from a frame to its reference frames.
For each frame, we run the propagate step on all macroblocks.  The propagate step of macroblock-tree operates as follows:
1.For the current macroblock, we load the following variables:
◦intra_cost: the estimated SATD cost of the intra mode for this macroblock.
◦inter_cost: the estimated SATD cost of the inter mode for this macroblock.  If this value is greater than intra_cost, it should be t to intra_cost.
◦propagate_in: the propagate_cost for the current macroblock.  It is intentional that  propagate_cost is zero for the first frame that propagate is run on, as no information has been collected yet for that frame.
2.We calculate the fraction of information from this macroblock to be propagated to macroblocks in its reference
frame(s), called propagate_fraction.  This is approximated by the formula1 – intra_cost / inter_cost.  As an example, if the inter cost for a macroblock is 80% of the intra cost for that macroblock, we say that 20% of the information in that macroblock is sourced from its reference frame(s).This is a clearly a very rough approximation, but is fast and simple.
3.The total amount of information that depends on this macroblock is equal to (intra_cost + propagate_in).  This is
the sum of all future dependencies (up to the edge of the lookahead) and the intra cost of the current macroblock.
We multiply this by propagate_fraction, resulting in the approximate amount of information that should be propagated to this macroblock's reference frames, propagate_amount.
4.We split propagate_amount among the macroblocks in its reference frame that are ud to predict the current
block.  The splitting is weighted bad on the number of pixels ud from each macroblock to predict the current macroblock.  This can be calculated bad on the motion vector of the current macroblock.  If the block has two reference frames, as in the ca of biprediction, the propagate_amount is split between the two equally, or if weighted B-frame prediction is enabled, according to the biprediction weight.  The properly split portions of
propagate_amount are then added to the propagate_cost of each of the macroblocks ud for prediction.  Note that, if we ignore the effects of interpolation filters for simplicity, at most 4 macroblocks in each frame can be ud for prediction of the current macroblock.
The result of this process is that the propagate_cost values of the references of the current frame have been incread bad on the contents of the current frame.  By repeating this in rever order for all the frames in the lookahead, we approximate the contribution of each macroblock in the next to the quality of the rest of the frames in the lookahead.
Finally, the finish step is applied to the macroblocks in any frame we wish to acquire final quantizer deltas for. This is typically the next few frames to be encoded.  The finish step of macroblock-tree operates with the same inputs as propagate, but instead outputs an H.264 quantizer delta:
Macroblock QP Delta = -strength * log2((intra_cost + propagate_cost) / intra_cost)
where strength is an arbitrary factor derived from experimentation.  Testing suggests that 2 is a near-optimal value for most videos.  As the H.264 quantizer scale doubles precision every 6 Qps, this means that optimal quantizer precision appears to scale roughly with the cube-root of the resulting (1 + propagate_cost / intra_cost) value ud in the finish step.
It should be noted that this algorithm results in unreferenced frames having QP deltas entirely of zero, as nothing is ever propagated to them:
Macroblock QP Delta = -strength * log2((intra_cost + 0) / intra_cost) = -strength * log2(1) = 0
This is intentional: by the logic of macroblock-tree, all unreferenced macroblocks have equal (and thus minimal) value.
Macroblock-tree derives its name from the fact that its operation can be reprented as a tree structure over the macroblocks of a video where the nodes are macroblocks and the edges are prediction dependencies.  The weights of the edges map to the propagate_fraction value.  Despite the name, as en above, macroblock-tree can be implemented without any mapping the lookahead
into an explicit tree structure – the motion vectors and costs are sufficient to implicitly store the information for the tree.
It is important to note that x264 already has a variance-bad adaptive quantization (V AQ) algorithm implemented. Macroblock-tree adds quantizer deltas on top of the effects of the existing adaptive quantization algorithm.  While it might be tempting to consider adaptive quantization a part of macroblock-tree to inflate its SSIM gain, they are two parate algorithms and V AQ will not be covered or benchmarked in this paper.  This is important when comparing to other ratecontrol work that optimizes for SSIM.
6.Analysis
Macroblock-tree has a number of consistent effects with regard to bit distribution.  One of the is the effect on B-frame quantizers.  As mentioned in the introduction, B-frame quantizers are typically derived via an offt from the P-frame quantizer.  Macroblock-tree is effectively the opposite: unreferenced B-frame quantizers are always the same, whereas the neighboring P-frame quantizers vary.
This result of this is effectively an adaptive B-frame quantizer offt.  In areas of high motion, B-fram
es tend to get quantizers not much higher than that of nearby P-frames.  In areas of low motion, the quantizer difference is much higher: +4-6 QP or more in some cas.  It is important to note that, in x264, the regular B-frame quantizer offts are disabled when macroblock-tree is on, since they rve the same role.
One might assume similarly that macroblock-tree can replace keyframe quantizer offts. However,testing suggested this was not the ca: macroblock-tree typically lowered quantizers for an entire scene, not solely the first frame in the scene.  Keyframe quantizer offts remained uful for PSNR, and so were kept in x264.  Algorithmic derivation of keyframe quantizer offts likely requires some form of lookahead quantization, as in Schmitsch et al.[9] Though macroblock-tree is too complex to allow a clod-form general solution, some insight can be derived from a solution for a special ca input:all-zero motion vectors,no B-frames,constant intra_cost,and constant propagate_fraction.  Though such an input is entirely unrealistic, it gives insight into the way that macroblock-tree scales as a function of propagate_cost.
We define macroblock-tree as follows:
Let TREE[N] be the propagate_cost after propagating through N frames.  Let Y be the constant intra_cost and Z be the constant propagate_fraction, range 0-1.
TREE[0] = 0
TREE[N] = (TREE[N-1] + Y)*Z
This can trivially be shown to converge to Y * (Z/(1-Z)) for large N.  Accordingly, the quantizer delta of macroblock-tree in this ca scales as follows:
As macroblock-tree redistributes bits both within frames and across the frames of a video, it is particularly likely to have perceptual conquences, both positive and negative.  There are two primary categories of the impacts that we have noticed during visual comparisons: motion-adaptive quantization and pre-echo.
Decread visual quality in higher-motion ctions of a video is the perceptual interpretation of qcompress and other similar algorithms.  Macroblock-tree performs the same role, except localized within each frame.  Thus, unlike qcompress, macroblock-tree will not lower the quality of static ctions of a frame merely becau other portions of the frame are temporally complex.  This is particularly important in the ca of static backgrounds and overlaid graphics.  In this n, macroblock-tree is a motion-adaptive quantization algorithm focud on coding efficiency.
“Pre-echo” is a natural conquence of the design of macroblock-tree.  A macroblock's quality depends on how much it is referenced in the future; if that macroblock will soon be occluded (as in the ca of a moving object) or completely replaced (as in the ca of a scene change), macroblock-tree will reduce its quality.
Typically this “pre-echo” is only visible in the couple frames immediately prior to the occlusion/scene
change and is almost entirely hidden by inter prediction.  Furthermore, prior work by Lee et al suggests that scene changes cau a backwards temporal masking effect that helps hide such artifacts.[10]  This backwards temporal masking effect lasts a few tens of milliconds, enough to cover one or two frames, and is believed to be caud by the processing delay in the human visual system.[11] This agrees with our own informal visual testing of macroblock-tree, which also suggests that such artifacts are visually invisible.  As such, the bits saved in such macroblocks are free to be ud elwhere in the video.
There is one particular ca where pre-echo is visible: that of forced keyframes.  The naïve macroblock-tree

本文发布于:2023-05-07 16:59:29,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/549673.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图