Diary Of An x264 Developer

http://x264dev.multimedia.cx/

Les articles publiés sur le site

  • Announcing the world’s fastest VP8 decoder : ffvp8

    24 juillet 2010, par Dark Shikariffmpeg, google, speed, VP8

    Back when I originally reviewed VP8, I noted that the official decoder, libvpx, was rather slow.  While there was no particular reason that it should be much faster than a good H.264 decoder, it shouldn’t have been that much slower either!  So, I set out with Ronald Bultje and David Conrad to make a better one in FFmpeg.  This one would be community-developed and free from the beginning, rather than the proprietary code-dump that was libvpx.  A few weeks ago the decoder was complete enough to be bit-exact with libvpx, making it the first independent free implementation of a VP8 decoder.  Now, with the first round of optimizations complete, it should be ready for primetime.  I’ll go into some detail about the development process, but first, let’s get to the real meat of this post: the benchmarks.

    We tested on two 1080p clips: Parkjoy, a live-action 1080p clip, and the Sintel trailer, a CGI 1080p clip.  Testing was done using “time ffmpeg -vcodec {libvpx or vp8} -i input -vsync 0 -an -f null -”.  We all used the latest SVN FFmpeg at the time of this posting; the last revision optimizing the VP8 decoder was r24471.

    Parkjoy graphSintel graph

    As these benchmarks show, ffvp8 is clearly much faster than libvpx, particularly on 64-bit.  It’s even faster by a large margin on Atom, despite the fact that we haven’t even begun optimizing for it.  In many cases, ffvp8′s extra speed can make the difference between a video that plays and one that doesn’t, especially in modern browsers with software compositing engines taking up a lot of CPU time.  Want to get faster playback of VP8 videos?  The next versions of FFmpeg-based players, like VLC, will include ffvp8.  Want to get faster playback of WebM in your browser?  Lobby your browser developers to use ffvp8 instead of libvpx.  I expect Chrome to switch first, as they already use libavcodec for most of their playback system.

    Keep in mind ffvp8 is not “done” — we will continue to improve it and make it faster.  We still have a number of optimizations in the pipeline that aren’t committed yet.

    Developing ffvp8

    The initial challenge, primarily pioneered by David and Ronald, was constructing the core decoder and making it bit-exact to libvpx.  This was rather challenging, especially given the lack of a real spec.  Many parts of the spec were outright misleading and contradicted libvpx itself.  It didn’t help that the suite of official conformance tests didn’t even cover all the features used by the official encoder!  We’ve already started adding our own conformance tests to deal with this.  But I’ve complained enough in past posts about the lack of a spec; let’s get onto the gritty details.

    The next step was adding SIMD assembly for all of the important DSP functions.  VP8′s motion compensation and deblocking filter are by far the most CPU-intensive parts, much the same as in H.264.  Unlike H.264, the deblocking filter relies on a lot of internal saturation steps, which are free in SIMD but costly in a normal C implementation, making the plain C code even slower.  Of course, none of this is a particularly large problem; any sane video decoder has all this stuff in SIMD.

    I tutored Ronald in x86 SIMD and wrote most of the motion compensation, intra prediction, and some inverse transforms.  Ronald wrote the rest of the inverse transforms and a bit of the motion compensation.  He also did the most difficult part: the deblocking filter.  Deblocking filters are always a bit difficult because every one is different.  Motion compensation, by comparison, is usually very similar regardless of video format; a 6-tap filter is a 6-tap filter, and most of the variation going on is just the choice of numbers to multiply by.

    The biggest challenge in an SIMD deblocking filter is to avoid unpacking, that is, going from 8-bit to 16-bit.  Many operations in deblocking filters would naively appear to require more than 8-bit precision.  A simple example in the case of x86 is abs(a-b), where a and b are 8-bit unsigned integers.  The result of “a-b” requires a 9-bit signed integer (it can be anywhere from -255 to 255), so it can’t fit in 8-bit.  But this is quite possible to do without unpacking: (satsub(a,b) | satsub(b,a)), where “satsub” performs a saturating subtract on the two values.  If the value is positive, it yields the result; if the value is negative, it yields zero.  Oring the two together yields the desired result.  This requires 4 ops on x86; unpacking would probably require at least 10, including the unpack and pack steps.

    After the SIMD came optimizing the C code, which still took a significant portion of the total runtime.  One of my biggest optimizations was adding aggressive “smart” prefetching to reduce cache misses.  ffvp8 prefetches the reference frames (PREVIOUS, GOLDEN, and ALTREF)… but only the ones which have been used reasonably often this frame.  This lets us prefetch everything we need without prefetching things that we probably won’t use.  libvpx very often encodes frames that almost never (but not quite never) use GOLDEN or ALTREF, so this optimization greatly reduces time spent prefetching in a lot of real videos.  There are of course countless other optimizations we made that are too long to list here as well, such as David’s entropy decoder optimizations.  I’d also like to thank Eli Friedman for his invaluable help in benchmarking a lot of these changes.

    What next?  Altivec (PPC) assembly is almost nonexistent, with the only functions being David’s motion compensation code.  NEON (ARM) is completely nonexistent: we’ll need that to be fast on mobile devices as well.  Of course, all this will come in due time — and as always — patches welcome!

    Appendix: the raw numbers

    Here’s the raw numbers (in fps) for the graphs at the start of this post, with standard error values:

    Core i7 620QM (1.6Ghz), Windows 7, 32-bit:
    Parkjoy ffvp8: 44.58 0.44
    Parkjoy libvpx: 33.06 0.23
    Sintel ffvp8: 74.26 1.18
    Sintel libvpx: 56.11 0.96

    Core i5 520M (2.4Ghz), Linux, 64-bit:
    Parkjoy ffvp8: 68.29 0.06
    Parkjoy libvpx: 41.06 0.04
    Sintel ffvp8: 112.38 0.37
    Sintel libvpx: 69.64 0.09

    Core 2 T9300 (2.5Ghz), Mac OS X 10.6.4, 64-bit:
    Parkjoy ffvp8: 54.09 0.02
    Parkjoy libvpx: 33.68 0.01
    Sintel ffvp8: 87.54 0.03
    Sintel libvpx: 52.74 0.04

    Core Duo (2Ghz), Mac OS X 10.6.4, 32-bit:
    Parkjoy ffvp8: 21.31 0.02
    Parkjoy libvpx: 17.96 0.00
    Sintel ffvp8: 41.24 0.01
    Sintel libvpx: 29.65 0.02

    Atom N270 (1.6Ghz), Linux, 32-bit:
    Parkjoy ffvp8: 15.29 0.01
    Parkjoy libvpx: 12.46 0.01
    Sintel ffvp8: 26.87 0.05
    Sintel libvpx: 20.41 0.02

  • VP8 : a retrospective

    13 juillet 2010, par Dark ShikariDCT, VP8, speed

    I’ve been working the past few weeks to help finish up the ffmpeg VP8 decoder, the first community implementation of On2′s VP8 video format.  Now that I’ve written a thousand or two lines of assembly code and optimized a good bit of the C code, I’d like to look back at VP8 and comment on a variety of things — both good and bad — that slipped the net the first time, along with things that have changed since the time of that blog post.

    These are less-so issues related to compression — that issue has been beaten to death, particularly in MSU’s recent comparison, where x264 beat the crap out of VP8 and the VP8 developers pulled a Pinocchio in the developer comments.  But that was expected and isn’t particularly interesting, so I won’t go into that.  VP8 doesn’t have to be the best in the world in order to be useful.

    When the ffmpeg VP8 decoder is complete (just a few more asm functions to go), we’ll hopefully be able to post some benchmarks comparing it to libvpx.

    1.  The spec, er, I mean, bitstream guide.

    Google has reneged on their claim that a spec existed at all and renamed it a “bitstream guide”.  This is probably after it was found that — not merely was it incomplete — but at least a dozen places in the spec differed wildly from what was actually in their own encoder and decoder software!  The deblocking filter, motion vector clamping, probability tables, and many more parts simply disagreed flat-out with the spec.  Fortunately, Ronald Bultje, one of the main authors of the ffmpeg VP8 decoder, is rather skilled at reverse-engineering, so we were able to put together a matching implementation regardless.

    Most of the differences aren’t particularly important — they don’t have a huge effect on compression or anything — but make it vastly more difficult to implement a “working” VP8 decoder, or for that matter, decide what “working” really is.  For example, Google’s decoder will, if told to “swap the ALT and GOLDEN reference frames”, overwrite both with GOLDEN, because it first sets GOLDEN = ALT, and then sets ALT = GOLDEN.  Is this a bug?  Or is this how it’s supposed to work?  It’s hard to tell — there isn’t a spec to say so.  Google says that whatever libvpx does is right, but I doubt they intended this.

    I expect a spec will eventually be written, but it was a bit obnoxious of Google — both to the community and to their own developers — to release so early that they didn’t even have their own documentation ready.

    2.  The TM intra prediction mode.

    One thing I glossed over in the original piece was that On2 had added an extra intra prediction mode to the standard batch that H.264 came with — they replaced Planar with “TM pred”.  For i4x4, which didn’t have a Planar mode, they just added it without replacing an old one, resulting in a total of 10 modes to H.264′s 9.  After understanding and writing assembly code for TM pred, I have to say that it is quite a cool idea.  Here’s how it works:

    1.  Let us take a block of size 4×4, 8×8, or 16×16.

    2.  Define the pixels bordering the top of this block (starting from the left) as T[0], T[1], T[2]…

    3.  Define the pixels bordering the left of this block (starting from the top) as L[0], L[1], L[2]…

    4.  Define the pixel above the top-left of the block as TL.

    5.  Predict every pixel <X,Y> in the block to be equal to clip3( T[X] + L[Y] – TL, 0, 255).

    It’s effectively a generalization of gradient prediction to the block level — predict each pixel based on the gradient between its top and left pixels, and the topleft.  According to the VP8 devs, it’s chosen by the encoder quite a lot of the time, which isn’t surprising; it seems like a pretty good idea.  As just one more intra pred mode, it’s not going to do magic for compression, but it’s a cool idea and elegantly simple.

    3.  Performance and the deblocking filter.

    On2 advertised for quite some that VP8′s goal was to be significantly faster to decode than H.264.  When I saw the spec, I waited for the punchline, but apparently they were serious.  There’s nothing wrong with being of similar speed or a bit slower — but I was rather confused as to the fact that their design didn’t match their stated goal at all.  What apparently happened is they had multiple profiles of VP8 — high and low complexity profiles.  They marketed the performance of the low complexity ones while touting the quality of the high complexity ones, a tad dishonest.  More importantly though, practically nobody is using the low complexity modes, so anyone writing a decoder has to be prepared to handle the high complexity ones, which are the default.

    The primary time-eater here is the deblocking filter.  VP8, being an H.264 derivative, has much the same problem as H.264 does in terms of deblocking — it spends an absurd amount of time there.  As I write this post, we’re about to finish some of the deblocking filter asm code, but before it’s committed, up to 70% or more of total decoding time is spent in the deblocking filter!  Like H.264, it suffers from the 4×4 transform problem: a 4×4 transform requires a total of 8 length-16 and 8 length-8 loopfilter calls per macroblock, while Theora, with only an 8×8 transform, requires half that.

    This problem is aggravated in VP8 by the fact that the deblocking filter isn’t strength-adaptive; if even one 4×4 block in a macroblock contains coefficients, every single edge has to be deblocked.  Furthermore, the deblocking filter itself is quite complicated; the “inner edge” filter is a bit more complex than H.264′s and the “macroblock edge” filter is vastly more complicated, having two entirely different codepaths chosen on a per-pixel basis.  Of course, in SIMD, this means you have to do both and mask them together at the end.

    There’s nothing wrong with a good-but-slow deblocking filter.  But given the amount of deblocking one needs to do in a 4×4-transform-based format, it might have been a better choice to make the filter simpler.  It’s pretty difficult to beat H.264 on compression, but it’s certainly not hard to beat it on speed — and yet it seems VP8 missed a perfectly good chance to do so.  Another option would have been to pick an 8×8 transform instead of 4×4, reducing the amount of deblocking by a factor of 2.

    And yes, there’s a simple filter available in the low complexity profile, but it doesn’t help if nobody uses it.

    4.  Tree-based arithmetic coding.

    Binary arithmetic coding has become the standard entropy coding method for a wide variety of compressed formats, ranging from LZMA to VP6, H.264 and VP8.  It’s simple, relatively fast compared to other arithmetic coding schemes, and easy to make adaptive.  The problem with this is that you have to come up with a method for converting non-binary symbols into a list of binary symbols, and then choosing what probabilities to use to code each one.  Here’s an example from H.264, the sub-partition mode symbol, which is either 8×8, 8×4, 4×8, or 4×4.  encode_decision( context, bit ) writes a binary decision (bit) into a numbered context (context).

    8×8: encode_decision( 21, 0 );

    8×4: encode_decision( 21, 1 ); encode_decision( 22, 0 );

    4×8: encode_decision( 21, 1 ); encode_decision( 22, 1 ); encode_decision( 23, 1 );

    4×4: encode_decision( 21, 1 ); encode_decision( 22, 1 ); encode_decision( 23, 0 );

    As can be seen, this is clearly like a Huffman tree.  Wouldn’t it be nice if we could represent this in the form of an actual tree data structure instead of code?  On2 thought so — they designed a simple system in VP8 that allowed all binarization schemes in the entire format to be represented as simple tree data structures.  This greatly reduces the complexity — not speed-wise, but implementation-wise — of the entropy coder.  Personally, I quite like it.

    5.  The inverse transform ordering.

    I should at some point write a post about common mistakes made in video formats that everyone keeps making.  These are not issues that are patent worries or huge issues for compression — just stupid mistakes that are repeatedly made in new video formats, probably because someone just never asked the guy next to him “does this look stupid?” before sticking it in the spec.

    One common mistake is the problem of transform ordering.  Every sane 2D transform is “separable” — that is, it can be done by doing a 1D transform vertically and doing the 1D transform again horizontally (or vice versa).  The original iDCT as used in JPEG, H.263, and MPEG-1/2/4 was an “idealized” iDCT — nobody had to use the exact same iDCT, theirs just had to give very close results to a reference implementation.  This ended up resulting in a lot of practical problems.  It was also slow; the only way to get an accurate enough iDCT was to do all the intermediate math in 32-bit.

    Practically every modern format, accordingly, has specified an exact iDCT.  This includes H.264, VC-1, RV40, Theora, VP8, and many more.  Of course, with an exact iDCT comes an exact ordering — while the “real” iDCT can be done in any order, an exact iDCT usually requires an exact order.  That is, it specifies horizontal and then vertical, or vertical and then horizontal.

    All of these transforms end up being implemented in SIMD.  In SIMD, a vertical transform is generally the only option, so a transpose is added to the process instead of doing a horizontal transform.  Accordingly, there are two ways to do it:

    1.  Transpose, vertical transform, transpose, vertical transform.

    2.  Vertical transform, transpose, vertical transform, transpose.

    These may seem to be equally good, but there’s one catch — if the transpose is done first, it can be completely eliminated by merging it into the coefficient decoding process.  On many modern CPUs, particularly x86, transposes are very expensive, so eliminating one of the two gives a pretty significant speed benefit.

    H.264 did it way 1).

    VC-1 did it way 1).

    Theora (inherited from VP3) did it way 1).

    But no.  VP8 has to do it way 2), where you can’t eliminate the transpose.  Bah.  It’s not a huge deal; probably only ~1-2% overall at most speed-wise, but it’s just a needless waste.  What really bugs me is that VP3 got it right — why in the world did they screw it up this time around if they got it right beforehand?

    RV40 is the other modern format I know that made this mistake.

    (NB: You can do transforms without a transpose, but it’s generally not worth it unless the intermediate needs 32-bit math, as in the case of the “real” iDCT.)

    6.  Not supporting interlacing.

    THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU.

    Interlacing was the scourge of H.264.  It weaseled its way into every nook and cranny of the spec, making every decoder a thousand lines longer.  H.264 even included a highly complicated — and effective — dedicated interlaced coding scheme, MBAFF.  The mere existence of MBAFF, despite its usefulness for broadcasters and others still stuck in the analog age with their 1080i, 576i , and 480i content, was a blight upon the video format.

    VP8 has once and for all avoided it.

    And if anyone suggests adding interlaced support to the experimental VP8 branch, find a straightjacket and padded cell for them before they cause any real damage.

  • VP8 : a retrospective

    13 juillet 2010, par Dark ShikariDCT, speed, VP8

    I’ve been working the past few weeks to help finish up the ffmpeg VP8 decoder, the first community implementation of On2′s VP8 video format.  Now that I’ve written a thousand or two lines of assembly code and optimized a good bit of the C code, I’d like to look back at VP8 and comment on a variety of things — both good and bad — that slipped the net the first time, along with things that have changed since the time of that blog post.

    These are less-so issues related to compression — that issue has been beaten to death, particularly in MSU’s recent comparison, where x264 beat the crap out of VP8 and the VP8 developers pulled a Pinocchio in the developer comments.  But that was expected and isn’t particularly interesting, so I won’t go into that.  VP8 doesn’t have to be the best in the world in order to be useful.

    When the ffmpeg VP8 decoder is complete (just a few more asm functions to go), we’ll hopefully be able to post some benchmarks comparing it to libvpx.

    1.  The spec, er, I mean, bitstream guide.

    Google has reneged on their claim that a spec existed at all and renamed it a “bitstream guide”.  This is probably after it was found that — not merely was it incomplete — but at least a dozen places in the spec differed wildly from what was actually in their own encoder and decoder software!  The deblocking filter, motion vector clamping, probability tables, and many more parts simply disagreed flat-out with the spec.  Fortunately, Ronald Bultje, one of the main authors of the ffmpeg VP8 decoder, is rather skilled at reverse-engineering, so we were able to put together a matching implementation regardless.

    Most of the differences aren’t particularly important — they don’t have a huge effect on compression or anything — but make it vastly more difficult to implement a “working” VP8 decoder, or for that matter, decide what “working” really is.  For example, Google’s decoder will, if told to “swap the ALT and GOLDEN reference frames”, overwrite both with GOLDEN, because it first sets GOLDEN = ALT, and then sets ALT = GOLDEN.  Is this a bug?  Or is this how it’s supposed to work?  It’s hard to tell — there isn’t a spec to say so.  Google says that whatever libvpx does is right, but I doubt they intended this.

    I expect a spec will eventually be written, but it was a bit obnoxious of Google — both to the community and to their own developers — to release so early that they didn’t even have their own documentation ready.

    2.  The TM intra prediction mode.

    One thing I glossed over in the original piece was that On2 had added an extra intra prediction mode to the standard batch that H.264 came with — they replaced Planar with “TM pred”.  For i4x4, which didn’t have a Planar mode, they just added it without replacing an old one, resulting in a total of 10 modes to H.264′s 9.  After understanding and writing assembly code for TM pred, I have to say that it is quite a cool idea.  Here’s how it works:

    1.  Let us take a block of size 4×4, 8×8, or 16×16.

    2.  Define the pixels bordering the top of this block (starting from the left) as T[0], T[1], T[2]…

    3.  Define the pixels bordering the left of this block (starting from the top) as L[0], L[1], L[2]…

    4.  Define the pixel above the top-left of the block as TL.

    5.  Predict every pixel <X,Y> in the block to be equal to clip3( T[X] + L[Y] – TL, 0, 255).

    It’s effectively a generalization of gradient prediction to the block level — predict each pixel based on the gradient between its top and left pixels, and the topleft.  According to the VP8 devs, it’s chosen by the encoder quite a lot of the time, which isn’t surprising; it seems like a pretty good idea.  As just one more intra pred mode, it’s not going to do magic for compression, but it’s a cool idea and elegantly simple.

    3.  Performance and the deblocking filter.

    On2 advertised for quite some that VP8′s goal was to be significantly faster to decode than H.264.  When I saw the spec, I waited for the punchline, but apparently they were serious.  There’s nothing wrong with being of similar speed or a bit slower — but I was rather confused as to the fact that their design didn’t match their stated goal at all.  What apparently happened is they had multiple profiles of VP8 — high and low complexity profiles.  They marketed the performance of the low complexity ones while touting the quality of the high complexity ones, a tad dishonest.  More importantly though, practically nobody is using the low complexity modes, so anyone writing a decoder has to be prepared to handle the high complexity ones, which are the default.

    The primary time-eater here is the deblocking filter.  VP8, being an H.264 derivative, has much the same problem as H.264 does in terms of deblocking — it spends an absurd amount of time there.  As I write this post, we’re about to finish some of the deblocking filter asm code, but before it’s committed, up to 70% or more of total decoding time is spent in the deblocking filter!  Like H.264, it suffers from the 4×4 transform problem: a 4×4 transform requires a total of 8 length-16 and 8 length-8 loopfilter calls per macroblock, while Theora, with only an 8×8 transform, requires half that.

    This problem is aggravated in VP8 by the fact that the deblocking filter isn’t strength-adaptive; if even one 4×4 block in a macroblock contains coefficients, every single edge has to be deblocked.  Furthermore, the deblocking filter itself is quite complicated; the “inner edge” filter is a bit more complex than H.264′s and the “macroblock edge” filter is vastly more complicated, having two entirely different codepaths chosen on a per-pixel basis.  Of course, in SIMD, this means you have to do both and mask them together at the end.

    There’s nothing wrong with a good-but-slow deblocking filter.  But given the amount of deblocking one needs to do in a 4×4-transform-based format, it might have been a better choice to make the filter simpler.  It’s pretty difficult to beat H.264 on compression, but it’s certainly not hard to beat it on speed — and yet it seems VP8 missed a perfectly good chance to do so.  Another option would have been to pick an 8×8 transform instead of 4×4, reducing the amount of deblocking by a factor of 2.

    And yes, there’s a simple filter available in the low complexity profile, but it doesn’t help if nobody uses it.

    4.  Tree-based arithmetic coding.

    Binary arithmetic coding has become the standard entropy coding method for a wide variety of compressed formats, ranging from LZMA to VP6, H.264 and VP8.  It’s simple, relatively fast compared to other arithmetic coding schemes, and easy to make adaptive.  The problem with this is that you have to come up with a method for converting non-binary symbols into a list of binary symbols, and then choosing what probabilities to use to code each one.  Here’s an example from H.264, the sub-partition mode symbol, which is either 8×8, 8×4, 4×8, or 4×4.  encode_decision( context, bit ) writes a binary decision (bit) into a numbered context (context).

    8×8: encode_decision( 21, 0 );

    8×4: encode_decision( 21, 1 ); encode_decision( 22, 0 );

    4×8: encode_decision( 21, 1 ); encode_decision( 22, 1 ); encode_decision( 23, 1 );

    4×4: encode_decision( 21, 1 ); encode_decision( 22, 1 ); encode_decision( 23, 0 );

    As can be seen, this is clearly like a Huffman tree.  Wouldn’t it be nice if we could represent this in the form of an actual tree data structure instead of code?  On2 thought so — they designed a simple system in VP8 that allowed all binarization schemes in the entire format to be represented as simple tree data structures.  This greatly reduces the complexity — not speed-wise, but implementation-wise — of the entropy coder.  Personally, I quite like it.

    5.  The inverse transform ordering.

    I should at some point write a post about common mistakes made in video formats that everyone keeps making.  These are not issues that are patent worries or huge issues for compression — just stupid mistakes that are repeatedly made in new video formats, probably because someone just never asked the guy next to him “does this look stupid?” before sticking it in the spec.

    One common mistake is the problem of transform ordering.  Every sane 2D transform is “separable” — that is, it can be done by doing a 1D transform vertically and doing the 1D transform again horizontally (or vice versa).  The original iDCT as used in JPEG, H.263, and MPEG-1/2/4 was an “idealized” iDCT — nobody had to use the exact same iDCT, theirs just had to give very close results to a reference implementation.  This ended up resulting in a lot of practical problems.  It was also slow; the only way to get an accurate enough iDCT was to do all the intermediate math in 32-bit.

    Practically every modern format, accordingly, has specified an exact iDCT.  This includes H.264, VC-1, RV40, Theora, VP8, and many more.  Of course, with an exact iDCT comes an exact ordering — while the “real” iDCT can be done in any order, an exact iDCT usually requires an exact order.  That is, it specifies horizontal and then vertical, or vertical and then horizontal.

    All of these transforms end up being implemented in SIMD.  In SIMD, a vertical transform is generally the only option, so a transpose is added to the process instead of doing a horizontal transform.  Accordingly, there are two ways to do it:

    1.  Transpose, vertical transform, transpose, vertical transform.

    2.  Vertical transform, transpose, vertical transform, transpose.

    These may seem to be equally good, but there’s one catch — if the transpose is done first, it can be completely eliminated by merging it into the coefficient decoding process.  On many modern CPUs, particularly x86, transposes are very expensive, so eliminating one of the two gives a pretty significant speed benefit.

    H.264 did it way 1).

    VC-1 did it way 1).

    Theora (inherited from VP3) did it way 1).

    But no.  VP8 has to do it way 2), where you can’t eliminate the transpose.  Bah.  It’s not a huge deal; probably only ~1-2% overall at most speed-wise, but it’s just a needless waste.  What really bugs me is that VP3 got it right — why in the world did they screw it up this time around if they got it right beforehand?

    RV40 is the other modern format I know that made this mistake.

    (NB: You can do transforms without a transpose, but it’s generally not worth it unless the intermediate needs 32-bit math, as in the case of the “real” iDCT.)

    6.  Not supporting interlacing.

    THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU.

    Interlacing was the scourge of H.264.  It weaseled its way into every nook and cranny of the spec, making every decoder a thousand lines longer.  H.264 even included a highly complicated — and effective — dedicated interlaced coding scheme, MBAFF.  The mere existence of MBAFF, despite its usefulness for broadcasters and others still stuck in the analog age with their 1080i, 576i , and 480i content, was a blight upon the video format.

    VP8 has once and for all avoided it.

    And if anyone suggests adding interlaced support to the experimental VP8 branch, find a straightjacket and padded cell for them before they cause any real damage.

  • How to cheat on video encoder comparisons

    21 juin 2010, par Dark ShikariH.264, benchmark, stupidity, test sequences

    Over the past few years, practically everyone and their dog has published some sort of encoder comparison.  Sometimes they’re actually intended to be something for the world to rely on, like the old Doom9 comparisons and the MSU comparisons.  Other times, they’re just to scratch an itch — someone wants to decide for themselves what is better.  And sometimes they’re just there to outright lie in favor of whatever encoder the author likes best.  The latter is practically an expected feature on the websites of commercial encoder vendors.

    One thing almost all these comparisons have in common — particularly (but not limited to!) the ones done without consulting experts — is that they are horribly done.  They’re usually easy to spot: for example, two videos at totally different bitrates are being compared, or the author complains about one of the videos being “washed out” (i.e. he screwed up his colorspace conversion).  Or the results are simply nonsensical.  Many of these problems result from the person running the test not “sanity checking” the results to catch mistakes that he made in his test.  Others are just outright intentional.

    The result of all these mistakes, both intentional and accidental, is that the results of encoder comparisons tend to be all over the map, to the point of absurdity.  For any pair of encoders, it’s practically a given that a comparison exists somewhere that will “prove” any result you want to claim, even if the result would be beyond impossible in any sane situation.  This often results in the appearance of a “controversy” even if there isn’t any.

    Keep in mind that every single mistake I mention in this article has actually been done, usually in more than one comparison.  And before I offend anyone, keep in mind that when I say “cheating”, I don’t mean to imply that everyone that makes the mistake is doing it intentionally.  Especially among amateur comparisons, most of the mistakes are probably honest.

    So, without further ado, we will investigate a wide variety of ways, from the blatant to the subtle, with which you too can cheat on your encoder comparisons.

    Blatant cheating

    1.  Screw up your colorspace conversions.  A common misconception is that converting from YUV to RGB and back is a simple process where nothing can go wrong.  This is quite untrue. There are two primary attributes of YUV: PC range (0-255) vs TV range (16-235) and BT.709 vs BT.601 conversion coefficients.  That sums up to a total of 4 possible different types of YUV.  When people compare encoders, they often use different frontends, some of which make incorrect assumptions about these attributes.

    Incorrect assumptions are so common that it’s often a matter of luck whether the tool gets it right or not.  It doesn’t help that most videos don’t even properly signal which they are to begin with!  Often even the tool that the person running the comparison is using to view the source material gets the conversion wrong.

    Subsampling YUV (aka what everyone uses) adds yet another dimension to the problem: the locations which the chroma data represents (“chroma siting”) isn’t constant.  For example, JPEG and MPEG-2 define different positions.  This is even worse because almost nobody actually handles this correctly — the best approach is to simply make sure none of your software is doing any conversion.  A mistake in chroma siting is what created that infamous PSNR graph showing Theora beating x264, which has been cited for ages since despite the developers themselves retracting it after realizing their mistake.

    Keep in mind that the video encoder is not responsible for colorspace conversion — almost all video encoders operate in the YUV domain (usually subsampled 4:2:0 YUV, aka YV12).  Thus any problem in colorspace conversion is usually the fault of the tools used, not the actual encoder.

    How to spot it: “The color is a bit off” or “the contrast of the video is a bit duller”.  There were a staggering number of “H.264 vs Theora” encoder comparisons which came out in favor of one or the other solely based on “how well the encoder kept the color” — making the results entirely bogus.

    2.  Don’t compare at the same (or nearly the same) bitrate. I saw a VP8 vs x264 comparison the other day that gave VP8 30% more bitrate and then proceeded to demonstrate that it got better PSNR. You would think this is blindingly obvious, but people still make this mistake!  The most common cause of this is assuming that encoders will successfully reach the target bitrate you ask of them — particularly with very broken encoders that don’t.  Always check the output filesizes of your encodes.

    How to spot it: The comparison lists perfectly round bitrates for every single test, as opposed to the actual bitrates achieved by the encoders, which will never be exactly matching in any real test.

    3.  Use unfair encoding settings. This is a bit of a wide topic: there are many ways to do this.  We’ll cover the more blatant ones in this part.  Here’s some common ones:

    a.  Simply cheat. Intentionally pick awful settings for the encoder you don’t like.

    b.  Don’t consider performance. Pick encoding settings without any regard for some particular performance goal.  For example, it’s perfectly reasonable to say “use the best settings possible, regardless of speed”.  It’s also reasonable to look for a particular encoding speed target.  But what isn’t reasonable is to pick extremely fast settings for one encoder and extremely slow settings for another encoder.

    c.  Don’t attempt match compatibility options when it’s reasonable to do so. Keyframe interval is a classic one of these: shorter values reduce compression but improve seeking.  An easy way to cheat is to simply not set them to the same value, biasing towards whatever encoder has the longer interval.  This is most common as an accidental mistake with comparisons involving ffmpeg, where the default keyframe interval is an insanely low 12 frames.

    How to spot it: The comparison doesn’t document its approach regarding choice of encoding settings.

    4.  Use ratecontrol methods unfairly. Constant bitrate is not the same as average bitrate — using one instead of the other is a great way to completely ruin a comparison.  Another method is to use 1-pass bitrate mode for one encoder and 2-pass or constant quality for another.  A good general approach is that, for any given encoder, one should use 2-pass if available and constant quality if not (it may take a few runs to get the bitrate you want, of course).

    Of course, it’s also fine to run a comparison with a particular mode in mind — for example, a comparison targeted at streaming applications might want to test using 1-pass CBR.  Of course, in such a case, if CBR is not available in an encoder, you can’t compare to that encoder.

    How to spot it: It’s usually pretty obvious if the encoding settings are given.

    5.  Use incredibly old versions of encoders. As it happens, Debian stable is not the best source for the most recent encoding software.  Equally, using recent versions known to be buggy.

    6.  Don’t distinguish between video formats and the software that encodes them. This is incredibly common: I’ve seen tests that claim to compare “H.264″ against something else while in fact actually comparing “Quicktime” against something else.  It’s impossible to compare all H.264 encoders at once, so don’t even try — just call the comparison “Quicktime versus X” instead of “H.264 versus X”.  Or better yet, use a good H.264 encoder, like x264 and don’t bother testing awful encoders to begin with.

    Less-obvious cheating

    1.  Pick a bitrate that’s way too low. Low bitrate testing is very effective at making differences between encoders obvious, particularly if doing a visual comparison.  But past a certain point, it becomes impossible for some encoders to keep up.  This is usually an artifact of the video format itself — a scalability limitation.  Practically all DCT-based formats have this kind of limitation (wavelets are mostly immune).

    In reality, this is rarely a problem, because one could merely downscale the video to resolve the problem — lower resolutions need fewer bits.  But people rarely do this in comparisons (it’s hard to do it fairly), so the best approach is to simply not use absurdly low bitrates.  What is “absurdly low”?  That’s a hard question — it ends up being a matter of using one’s best judgement.

    This tends to be less of a problem in larger-scale tests that use many different bitrates.

    How to spot it: At least one of the encoders being compared falls apart completely and utterly in the screenshots.

    Biases towards, a lot: Video formats with completely scalable coding methods (Dirac, Snow, JPEG-2000, SVC).

    Biases towards, a little: Video formats with coding methods that improve scalability, such as arithmetic coding, B-frames, and run-length coding.  For example, H.264 and Theora tend to be more scalable than MPEG-4.

    2.  Pick a bitrate that’s way too high. This is staggeringly common mistake: pick a bitrate so high that all of the resulting encodes look absolutely perfect.  The claim is then made that “there’s no significant difference” between any of the encoders tested.  This is surprisingly easy to do inadvertently on sources like Big Buck Bunny, which looks transparent at relatively low bitrates.  An equally common but similar mistake is to test at a bitrate that isn’t so high that the videos look perfect, but high enough that they all look very good.  The claim is then made that “the difference between these encoders is small”.  Well, of course, if you give everything tons of bitrate, the difference between encoders is small.

    How to spot it: You can’t tell which image is the source and which is the encode.

    3.  Making invalid comparisons using objective metrics. I explained this earlier in the linked blog post, but in short, if you’re going to measure PSNR, make sure all the encoders are optimized for PSNR.  Equally, if you’re going to leave the encoder optimized for visual quality, don’t measure PSNR — post screenshots instead.  Same with SSIM or any other objective metric.  Furthermore, don’t blindly do metric comparisons — always at least look at the output as a sanity test.  Finally, do not claim that PSNR is particularly representative of visual quality, because it isn’t.

    How to spot it: Encoders with psy optimizations, such as x264 or Theora 1.2, do considerably worse than expected in PSNR tests, but look much better in visual comparisons.

    4.  Lying with graphs. Using misleading scales on graphs is a great way to make the differences between encoders seem larger or smaller than they actually are.  A common mistake is to scale SSIM linearly: in fact, 0.99 is about twice as good as 0.98, not 1% better.  One solution for this is to use db to compare SSIM values.

    5.  Using lossy screenshots. Posting screenshots as JPEG is a silly, pointless way to worsen an encoder comparison.

    Subtle cheating

    1.  Unfairly pick screenshots for comparison. Comparing based on stills is not ideal, but it’s often vastly easier than comparing videos in motion.  But it also opens up the door to unfairness.  One of the most common mistakes is to pick a frame immediately after (or on) a keyframe for one encoder, but which isn’t for the other encoder.  Particularly in the case of encoders that massively boost keyframe quality, this will unfairly bias in favor of the one with the recent keyframe.

    How to spot it: It’s very difficult to tell, if not impossible, unless they provide the video files to inspect.

    2.  Cherry-pick source videos. Good source videos are incredibly hard to come by — almost everything is already compressed and what’s left is usually a very poor example of real content.  Here’s some common ways to bias unfairly using cherry-picking:

    a.  Pick source videos that are already heavily compressed. Pre-compressed source isn’t much of an issue if your target quality level for testing is much lower than that of the source, since any compression artifacts in the source will be a lot smaller than those created by the encoders.  But if the source is already very compressed, or you’re testing at a relatively high quality level, this becomes a significant issue.

    Biases towards: Anything that uses a similar transform to the source content.  For MPEG-2 source material, this biases towards formats that use the 8x8dct or a very close approximation: MPEG-1/2/4, H.263, and Theora.  For H.264 source material, this biases towards formats that use a 4×4 transform: H.264 and VP8.

    b.  Pick standard test clips that were not intended for this purpose. There are a wide variety of uncompressed “standard test clips“.  Some of these are not intended for general-purpose use, but rather exist to test specific encoder capabilities.  For example, Mobile Calendar (“mobcal”) is extremely sharp and low motion, serving to test interpolation capabilities.  It will bias incredibly heavily towards whatever encoder uses more B-frames and/or has higher-precision motion compensation.  Other test clips are almost completely static, such as the classic “akiyo”.  These are also not particularly representative of real content.

    c.  Pick very noisy content. Noise is — by definition — not particularly compressible.  Both in terms of PSNR and visual quality, a very noisy test clip will tend to reduce the differences between encoders dramatically.

    d.  Pick a test clip to exercise a specific encoder feature. I’ve often used short clips from Touhou games to demonstrate the effectiveness of x264′s macroblock-tree algorithm.  I’ve sometimes even used it to compare to other encoders as part of such a demonstration.  I’ve also used the standard test clip “parkrun” as a demonstration of adaptive quantization.  But claiming that either is representative of most real content — and thus can be used as a general determinant of how good encoders are — is of course insane.

    e.  Simply encode a bunch of videos and pick the one your favorite encoder does best on.

    3.  Preprocessing the source. A encoder test is a test of encoders, not preprocessing.  Some encoding apps may add preprocessors to the source, such as noise reduction.  This may make the video look better — possibly even better than the source — but it’s not a fair part of comparing the actual encoders.

    4.  Screw up decoding. People often forget that in addition to encoding, a test also involves decoding — a step which is equally possible to do wrong.  One common error caused by this is in tests of Theora on content whose resolution isn’t divisible by 16.  Decoding is often done with ffmpeg — which doesn’t crop the edges properly in some cases.  This isn’t really a big deal visually, but in a PSNR comparison, misaligning the entire frame by 4 or 8 pixels is a great way of completely invalidating the results.

    The greatest mistake of all

    Above all, the biggest and most common mistake — and the one that leads to many of the problems mentioned here –  is the mistaken belief that one, or even a few tests can really represent all usage fairly.  Any comparison has to have some specific goal — to compare something in some particular case, whether it be “maximum offline compression ignoring encoding speed” or “real-time high-speed video streaming” or whatnot.  And even then, no comparison can represent all use-cases in that category alone.  An encoder comparison can only be honest if it’s aware of its limitations.

  • How to cheat on video encoder comparisons

    21 juin 2010, par Dark Shikaribenchmark, H.264, stupidity, test sequences

    Over the past few years, practically everyone and their dog has published some sort of encoder comparison.  Sometimes they’re actually intended to be something for the world to rely on, like the old Doom9 comparisons and the MSU comparisons.  Other times, they’re just to scratch an itch — someone wants to decide for themselves what is better.  And sometimes they’re just there to outright lie in favor of whatever encoder the author likes best.  The latter is practically an expected feature on the websites of commercial encoder vendors.

    One thing almost all these comparisons have in common — particularly (but not limited to!) the ones done without consulting experts — is that they are horribly done.  They’re usually easy to spot: for example, two videos at totally different bitrates are being compared, or the author complains about one of the videos being “washed out” (i.e. he screwed up his colorspace conversion).  Or the results are simply nonsensical.  Many of these problems result from the person running the test not “sanity checking” the results to catch mistakes that he made in his test.  Others are just outright intentional.

    The result of all these mistakes, both intentional and accidental, is that the results of encoder comparisons tend to be all over the map, to the point of absurdity.  For any pair of encoders, it’s practically a given that a comparison exists somewhere that will “prove” any result you want to claim, even if the result would be beyond impossible in any sane situation.  This often results in the appearance of a “controversy” even if there isn’t any.

    Keep in mind that every single mistake I mention in this article has actually been done, usually in more than one comparison.  And before I offend anyone, keep in mind that when I say “cheating”, I don’t mean to imply that everyone that makes the mistake is doing it intentionally.  Especially among amateur comparisons, most of the mistakes are probably honest.

    So, without further ado, we will investigate a wide variety of ways, from the blatant to the subtle, with which you too can cheat on your encoder comparisons.

    Blatant cheating

    1.  Screw up your colorspace conversions.  A common misconception is that converting from YUV to RGB and back is a simple process where nothing can go wrong.  This is quite untrue. There are two primary attributes of YUV: PC range (0-255) vs TV range (16-235) and BT.709 vs BT.601 conversion coefficients.  That sums up to a total of 4 possible different types of YUV.  When people compare encoders, they often use different frontends, some of which make incorrect assumptions about these attributes.

    Incorrect assumptions are so common that it’s often a matter of luck whether the tool gets it right or not.  It doesn’t help that most videos don’t even properly signal which they are to begin with!  Often even the tool that the person running the comparison is using to view the source material gets the conversion wrong.

    Subsampling YUV (aka what everyone uses) adds yet another dimension to the problem: the locations which the chroma data represents (“chroma siting”) isn’t constant.  For example, JPEG and MPEG-2 define different positions.  This is even worse because almost nobody actually handles this correctly — the best approach is to simply make sure none of your software is doing any conversion.  A mistake in chroma siting is what created that infamous PSNR graph showing Theora beating x264, which has been cited for ages since despite the developers themselves retracting it after realizing their mistake.

    Keep in mind that the video encoder is not responsible for colorspace conversion — almost all video encoders operate in the YUV domain (usually subsampled 4:2:0 YUV, aka YV12).  Thus any problem in colorspace conversion is usually the fault of the tools used, not the actual encoder.

    How to spot it: “The color is a bit off” or “the contrast of the video is a bit duller”.  There were a staggering number of “H.264 vs Theora” encoder comparisons which came out in favor of one or the other solely based on “how well the encoder kept the color” — making the results entirely bogus.

    2.  Don’t compare at the same (or nearly the same) bitrate. I saw a VP8 vs x264 comparison the other day that gave VP8 30% more bitrate and then proceeded to demonstrate that it got better PSNR. You would think this is blindingly obvious, but people still make this mistake!  The most common cause of this is assuming that encoders will successfully reach the target bitrate you ask of them — particularly with very broken encoders that don’t.  Always check the output filesizes of your encodes.

    How to spot it: The comparison lists perfectly round bitrates for every single test, as opposed to the actual bitrates achieved by the encoders, which will never be exactly matching in any real test.

    3.  Use unfair encoding settings. This is a bit of a wide topic: there are many ways to do this.  We’ll cover the more blatant ones in this part.  Here’s some common ones:

    a.  Simply cheat. Intentionally pick awful settings for the encoder you don’t like.

    b.  Don’t consider performance. Pick encoding settings without any regard for some particular performance goal.  For example, it’s perfectly reasonable to say “use the best settings possible, regardless of speed”.  It’s also reasonable to look for a particular encoding speed target.  But what isn’t reasonable is to pick extremely fast settings for one encoder and extremely slow settings for another encoder.

    c.  Don’t attempt match compatibility options when it’s reasonable to do so. Keyframe interval is a classic one of these: shorter values reduce compression but improve seeking.  An easy way to cheat is to simply not set them to the same value, biasing towards whatever encoder has the longer interval.  This is most common as an accidental mistake with comparisons involving ffmpeg, where the default keyframe interval is an insanely low 12 frames.

    How to spot it: The comparison doesn’t document its approach regarding choice of encoding settings.

    4.  Use ratecontrol methods unfairly. Constant bitrate is not the same as average bitrate — using one instead of the other is a great way to completely ruin a comparison.  Another method is to use 1-pass bitrate mode for one encoder and 2-pass or constant quality for another.  A good general approach is that, for any given encoder, one should use 2-pass if available and constant quality if not (it may take a few runs to get the bitrate you want, of course).

    Of course, it’s also fine to run a comparison with a particular mode in mind — for example, a comparison targeted at streaming applications might want to test using 1-pass CBR.  Of course, in such a case, if CBR is not available in an encoder, you can’t compare to that encoder.

    How to spot it: It’s usually pretty obvious if the encoding settings are given.

    5.  Use incredibly old versions of encoders. As it happens, Debian stable is not the best source for the most recent encoding software.  Equally, using recent versions known to be buggy.

    6.  Don’t distinguish between video formats and the software that encodes them. This is incredibly common: I’ve seen tests that claim to compare “H.264″ against something else while in fact actually comparing “Quicktime” against something else.  It’s impossible to compare all H.264 encoders at once, so don’t even try — just call the comparison “Quicktime versus X” instead of “H.264 versus X”.  Or better yet, use a good H.264 encoder, like x264 and don’t bother testing awful encoders to begin with.

    Less-obvious cheating

    1.  Pick a bitrate that’s way too low. Low bitrate testing is very effective at making differences between encoders obvious, particularly if doing a visual comparison.  But past a certain point, it becomes impossible for some encoders to keep up.  This is usually an artifact of the video format itself — a scalability limitation.  Practically all DCT-based formats have this kind of limitation (wavelets are mostly immune).

    In reality, this is rarely a problem, because one could merely downscale the video to resolve the problem — lower resolutions need fewer bits.  But people rarely do this in comparisons (it’s hard to do it fairly), so the best approach is to simply not use absurdly low bitrates.  What is “absurdly low”?  That’s a hard question — it ends up being a matter of using one’s best judgement.

    This tends to be less of a problem in larger-scale tests that use many different bitrates.

    How to spot it: At least one of the encoders being compared falls apart completely and utterly in the screenshots.

    Biases towards, a lot: Video formats with completely scalable coding methods (Dirac, Snow, JPEG-2000, SVC).

    Biases towards, a little: Video formats with coding methods that improve scalability, such as arithmetic coding, B-frames, and run-length coding.  For example, H.264 and Theora tend to be more scalable than MPEG-4.

    2.  Pick a bitrate that’s way too high. This is staggeringly common mistake: pick a bitrate so high that all of the resulting encodes look absolutely perfect.  The claim is then made that “there’s no significant difference” between any of the encoders tested.  This is surprisingly easy to do inadvertently on sources like Big Buck Bunny, which looks transparent at relatively low bitrates.  An equally common but similar mistake is to test at a bitrate that isn’t so high that the videos look perfect, but high enough that they all look very good.  The claim is then made that “the difference between these encoders is small”.  Well, of course, if you give everything tons of bitrate, the difference between encoders is small.

    How to spot it: You can’t tell which image is the source and which is the encode.

    3.  Making invalid comparisons using objective metrics. I explained this earlier in the linked blog post, but in short, if you’re going to measure PSNR, make sure all the encoders are optimized for PSNR.  Equally, if you’re going to leave the encoder optimized for visual quality, don’t measure PSNR — post screenshots instead.  Same with SSIM or any other objective metric.  Furthermore, don’t blindly do metric comparisons — always at least look at the output as a sanity test.  Finally, do not claim that PSNR is particularly representative of visual quality, because it isn’t.

    How to spot it: Encoders with psy optimizations, such as x264 or Theora 1.2, do considerably worse than expected in PSNR tests, but look much better in visual comparisons.

    4.  Lying with graphs. Using misleading scales on graphs is a great way to make the differences between encoders seem larger or smaller than they actually are.  A common mistake is to scale SSIM linearly: in fact, 0.99 is about twice as good as 0.98, not 1% better.  One solution for this is to use db to compare SSIM values.

    5.  Using lossy screenshots. Posting screenshots as JPEG is a silly, pointless way to worsen an encoder comparison.

    Subtle cheating

    1.  Unfairly pick screenshots for comparison. Comparing based on stills is not ideal, but it’s often vastly easier than comparing videos in motion.  But it also opens up the door to unfairness.  One of the most common mistakes is to pick a frame immediately after (or on) a keyframe for one encoder, but which isn’t for the other encoder.  Particularly in the case of encoders that massively boost keyframe quality, this will unfairly bias in favor of the one with the recent keyframe.

    How to spot it: It’s very difficult to tell, if not impossible, unless they provide the video files to inspect.

    2.  Cherry-pick source videos. Good source videos are incredibly hard to come by — almost everything is already compressed and what’s left is usually a very poor example of real content.  Here’s some common ways to bias unfairly using cherry-picking:

    a.  Pick source videos that are already heavily compressed. Pre-compressed source isn’t much of an issue if your target quality level for testing is much lower than that of the source, since any compression artifacts in the source will be a lot smaller than those created by the encoders.  But if the source is already very compressed, or you’re testing at a relatively high quality level, this becomes a significant issue.

    Biases towards: Anything that uses a similar transform to the source content.  For MPEG-2 source material, this biases towards formats that use the 8x8dct or a very close approximation: MPEG-1/2/4, H.263, and Theora.  For H.264 source material, this biases towards formats that use a 4×4 transform: H.264 and VP8.

    b.  Pick standard test clips that were not intended for this purpose. There are a wide variety of uncompressed “standard test clips“.  Some of these are not intended for general-purpose use, but rather exist to test specific encoder capabilities.  For example, Mobile Calendar (“mobcal”) is extremely sharp and low motion, serving to test interpolation capabilities.  It will bias incredibly heavily towards whatever encoder uses more B-frames and/or has higher-precision motion compensation.  Other test clips are almost completely static, such as the classic “akiyo”.  These are also not particularly representative of real content.

    c.  Pick very noisy content. Noise is — by definition — not particularly compressible.  Both in terms of PSNR and visual quality, a very noisy test clip will tend to reduce the differences between encoders dramatically.

    d.  Pick a test clip to exercise a specific encoder feature. I’ve often used short clips from Touhou games to demonstrate the effectiveness of x264′s macroblock-tree algorithm.  I’ve sometimes even used it to compare to other encoders as part of such a demonstration.  I’ve also used the standard test clip “parkrun” as a demonstration of adaptive quantization.  But claiming that either is representative of most real content — and thus can be used as a general determinant of how good encoders are — is of course insane.

    e.  Simply encode a bunch of videos and pick the one your favorite encoder does best on.

    3.  Preprocessing the source. A encoder test is a test of encoders, not preprocessing.  Some encoding apps may add preprocessors to the source, such as noise reduction.  This may make the video look better — possibly even better than the source — but it’s not a fair part of comparing the actual encoders.

    4.  Screw up decoding. People often forget that in addition to encoding, a test also involves decoding — a step which is equally possible to do wrong.  One common error caused by this is in tests of Theora on content whose resolution isn’t divisible by 16.  Decoding is often done with ffmpeg — which doesn’t crop the edges properly in some cases.  This isn’t really a big deal visually, but in a PSNR comparison, misaligning the entire frame by 4 or 8 pixels is a great way of completely invalidating the results.

    The greatest mistake of all

    Above all, the biggest and most common mistake — and the one that leads to many of the problems mentioned here –  is the mistaken belief that one, or even a few tests can really represent all usage fairly.  Any comparison has to have some specific goal — to compare something in some particular case, whether it be “maximum offline compression ignoring encoding speed” or “real-time high-speed video streaming” or whatnot.  And even then, no comparison can represent all use-cases in that category alone.  An encoder comparison can only be honest if it’s aware of its limitations.