Breaking Eggs And Making Omelettes

A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.

http://multimedia.cx/eggs/

Les articles publiés sur le site

1 | ... | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | ... | 38

The Big VP8 Debug

20 novembre 2010, par Multimedia Mike — VP8
I hope my previous walkthrough of the VP8 4x4 intra coding process was educational. Today, I'll be walking through an example of what happens when my toy VP8 encoder encodes an intra 16x16 block. This may prove educational to those who have never been exposed to the deep details of this or related algorithms. Also, I wanted to illustrate where I think my VP8 encoder process is going bad and generating such grotesque results.

Before I start, let me give a shout-out to Google Docs' Drawing tool which I used to generate these diagrams. It works quite well.

Results

(Always cut to the chase in a blog post; results first.) I'm glad I composed this post. In the course of doing so, I found the problem, fixed it, and am now able to present this image that was decoded from the bitstream encoded by my ~~toy~~ working VP8 encoder:

Yeah, I know that image doesn't look like anything you haven't seen before. The difference is that it has made a successful trip through my VP8 encoder.

Follow along through the encoding process and learn of the mistake...

Original Block and Subblocks

Here is the 16x16 block to be encoded:

The block is broken down into 16 4x4 subblocks for further encoding:

Prediction

The first step is to pick a prediction mode, generate a prediction block, and subtract the predictors from the macroblock. In this case, we will use DC prediction which means the predictor will be the same for each element.

In 4x4 VP8 DC intra prediction, samples outside of the frame are assumed to be 128. It's a little different in 16x16 DC intra prediction-- samples above the top row are assumed to be 127 while samples left of the leftmost column are assumed to be 129. For the top left macroblock, this still works out to 128.

Subtract 128 from each of the samples:

Forward Transform

Run each of the 16 prediction-removed subblocks through the forward transform. This example uses the forward transform from libvpx 0.9.5:

I have highlighted the DC coefficients in each subblock. That's because those receive special consideration in 16x16 intra coding.

Quantization

The Y plane AC quantizer is 4 in this example, the minimum allowed. (The Y plane DC quantizer is also 4 but doesn't come into play for intra 16x16 coding since the DC coefficients follow a different process.) Thus, quantize (integer divide) each AC element in each subblock (we'll ignore the DC coefficient for this part):

The Y2 Round Trip

Those highlighted DC coefficients from each of the 16 subblocks comprise the Y2 block. This block is transformed with a slightly different algorithm called the Walsh-Hadamard Transform (WHT). The results of this transform are then quantized (using 8 for both Y2 DC and AC in this example, as those are the smallest Y2 quantizers that VP8 allows), then zigzagged and entropy-coded along with the rest of the macroblock coefficients.

On the decoder side, the Y2 coefficients are decoded, de-zigzagged, dequantized and run through the inverse WHT.

And this is where I suspect that most of the error is creeping into my VP8 encoder. Observe the round-trip through the Y2 process:

As intimated, this part causes me consternation due to the wide discrepancy between the original and the reconstructed Y2 blocks. Observe the absolute difference between the 2 vectors:

That's really significant and leads me to believe that this is where the big problem is.

What's Wrong?

My first suspicion is that the quantization is throwing off the process. I was disabused of this idea when I removed quantization from the equation and immediately reversed the transform:

So perhaps there is a problem with the forward WHT. Just like with the usual subblock transform, the VP8 spec doesn't define how to perform the forward WHT, only the inverse WHT. Do I need to audition different forward WHTs from various versions of libvpx, similar to what I did with the other transform? That doesn't make a lot of sense-- libvpx doesn't seem to have so much trouble with basic encoding.

The Punchline

I reviewed the forward WHT code, the stuff that I plagiarized from libvpx 0.9.0. The function takes, among other parameters, a pitch value. There are 2 loops in the code. The first iterates through the rows of the input matrix-- which I assumed was a 4x4 matrix. I was puzzled that during each iteration of the row loop, the input pointer was only being advanced by (pitch/2). I removed the division by 2 and the problem went away. I.e., the encoded image looks correct.

What's up with the (pitch/2), anyway? It seems that the encoder likes to pack 2 4x4 subblocks into an 8x4 block data structure. In fact, the forward DCTs in the libvpx encoder have the same artifact. Remember how I surveyed several variations of forward DCT from different versions of libvpx? The one that proved most accurate in that test was the one I had already modified to advance the input pointer properly. Fixing the other 2 candidates yields similar results:
```
input:   92 91 89 86 91 90 88 86 89 89 89 88 89 87 88 93
short 0.9.0: -311 6 2 0 0 11 -6 1 2 -3 3 0 0 0 -2 1
inverse: 92 91 89 86 91 90 88 87 90 89 89 88 89 87 88 93
fast  0.9.0: -313 5 1 0 1 11 -6 1 3 -3 4 0 0 0 -2 1
inverse: 91 91 89 86 90 90 88 86 89 89 89 88 89 87 88 93
short 0.9.5: -312 7 1 0 1 12 -5 2 2 -3 3 -1 1 0 -2 1
inverse: 92 91 89 86 91 90 88 86 89 89 89 88 89 87 88 93
```
Code cribber beware!

Corrected Y2 Round Trip

Let's look at that Y2 round trip one more time:

And another look at the error between the original and the reconstruction:

Better.

Dequantization, Prediction, Inverse Transforms, and Reconstruction

To be honest, now that I solved the major problem, I'm getting a little tired of making these pictures. Long story short, all elements of the original 16 subblocks are dequantized and their DC coefficients are filled in with the appropriate item from the reconstructed Y2 block. A base predictor block is generated (all 128 values in this case). And each Y block is run through the inverse transform and added to the predictor block. The following is the reconstruction:

And if you compare that against the original luma macroblock (I don't feel like doing it right now), you'll find that it's pretty close.

I can't believe how close I was all this time, and how long that pitch bug held me up.
Tour of Part of the VP8 Process

18 novembre 2010, par Multimedia Mike — VP8
My toy VP8 encoder outputs a lot of textual data to illustrate exactly what it’s doing. For those who may not be exactly clear on how this or related algorithms operate, this may prove illuminating.

Let’s look at subblock 0 of macroblock 0 of a luma plane:
```
 subblock 0 (original)
  92  91  89  86
  91  90  88  86
  89  89  89  88
  89  87  88  93
```
Since it’s in the top-left corner of the image to be encoded, the phantom samples above and to the left are implicitly 128 for the purpose of intra prediction (in the VP8 algorithm).
```
 subblock 0 (original)
     128 128 128 128
 128  92  91  89  86
 128  91  90  88  86
 128  89  89  89  88
 128  89  87  88  93
```
Using the 4×4 DC prediction mode means averaging the 4 top predictors and 4 left predictors. So, the predictor is 128. Subtract this from each element of the subblock:
```
 subblock 0, predictor removed
 -36 -37 -39 -42
 -37 -38 -40 -42
 -39 -39 -39 -40
 -39 -41 -40 -35
```
Next, run the subblock through the forward transform:
```
 subblock 0, transformed
 -312   7   1   0
    1  12  -5   2
    2  -3   3  -1
    1   0  -2   1
```
Quantize (integer divide) each element; the DC (first element) and AC (rest of the elements) quantizers are both 4:
```
 subblock 0, quantized
 -78   1   0   0
   0   3  -1   0
   0   0   0   0
   0   0   0   0
```
The above block contains the coefficients that are actually transmitted (zigzagged and entropy-encoded) through the bitstream and decoded on the other end.

The decoding process looks something like this– after the same coefficients are decoded and rearranged, they are dequantized (multiplied) by the original quantizers:
```
 subblock 0, dequantized
 -312   4   0   0
    0  12  -4   0
    0   0   0   0
    0   0   0   0
```
Note that these coefficients are not exactly the same as the original, pre-quantized coefficients. This is a large part of where the “lossy” in “lossy video compression” comes from.

Next, the decoder generates a base predictor subblock. In this case, it’s all 128 (DC prediction for top-left subblock):
```
 subblock 0, predictor
  128 128 128 128
  128 128 128 128
  128 128 128 128
  128 128 128 128
```
Finally, the dequantized coefficients are shoved through the inverse transform and added to the base predictor block:
```
 subblock 0, reconstructed
  91  91  89  85
  90  90  89  87
  89  88  89  90
  88  88  89  92
```
Again, not exactly the same as the original block, but an incredible facsimile thereof.

Note that this decoding-after-encoding demonstration is not merely pedagogical– the encoder has to decode the subblock because the encoding of successive subblocks may depend on this subblock. The encoder can’t rely on the original representation of the subblock because the decoder won’t have that– it will have the reconstructed block.

For example, here’s the next subblock:
```
 subblock 1 (original)
  84  84  87  90
  85  85  86  93
  86  83  83  89
  91  85  84  87
```
Let’s assume DC prediction once more. The 4 top predictors are still all 128 since this subblock lies along the top row. However, the 4 left predictors are the right edge of the subblock reconstructed in the previous example:
```
 subblock 1 (original)
    128 128 128 128
 85  84  84  87  90
 87  85  85  86  93
 90  86  83  83  89
 92  91  85  84  87
```
The DC predictor is computed as (128 + 128 + 128 + 128 + 85 + 87 + 90 + 92 + 4) / 8 = 108 (the extra +4 is for rounding considerations). (Note that in this case, using the original subblock’s right edge would also have resulted in 108, but that’s beside the point.)

Continuing through the same process as in subblock 0:
```
 subblock 1, predictor removed
 -24 -24 -21 -18
 -23 -23 -22 -15
 -22 -25 -25 -19
 -17 -23 -24 -21

 subblock 1, transformed
 -173  -9  14  -1
    2 -11  -4   0
    1   6  -2   3
   -5   1   0   1

 subblock 1, quantized
 -43  -2   3   0
   0  -2  -1   0
   0   1   0   0
  -1   0   0   0

 subblock 1, dequantized
 -172  -8  12   0
    0  -8  -4   0
    0   4   0   0
   -4   0   0   0

 subblock 1, predictor
  108 108 108 108
  108 108 108 108
  108 108 108 108
  108 108 108 108

 subblock 1, reconstructed
  84  84  87  89
  86  85  87  91
  86  83  84  89
  90  85  84  88
```
I hope this concrete example (straight from a working codec) clarifies this part of the VP8 process.
Minimal Understanding of VP8′s Forward Transform

16 novembre 2010, par Multimedia Mike — VP8
Regarding my toy VP8 encoder, Pengvado mentioned in the comments of my last post, “x264 looks perfect using only i16x16 DC mode. You must be doing something wrong in computing residual or fdct or quantization.” This makes a lot of sense. The encoder generates a series of elements which describe how to reconstruct the original image. Intra block reconstruction takes into consideration the following elements:

I have already verified that both my encoder and FFmpeg’s VP8 decoder agree precisely on how to reconstruct blocks based on the predictors, coefficients, and quantizers. Thus, if the decoded image still looks crazy, the elements the encoder is generating to describe the image must be wrong.

So I started studying the forward DCT, which I had cribbed wholesale from the original libvpx 0.9.0 source code. It should be noted that the formal VP8 spec only defines the inverse transform process, not the forward process. I was using a version designated as the “short” version, vs. the “fast” version. Then I looked at the 0.9.5 FDCT. Then I got the idea of comparing the results of each.

input: 92 91 89 86 91 90 88 86 89 89 89 88 89 87 88 93
- libvpx 0.9.0 “short”:
```
forward: -314 5 1 5 4 5 -2 0 0 1 -1 -1 1 11 -3 -4
inverse: 92 91 89 86 89 86 91 90 91 90 88 86 88 86 89 89
```
- libvpx 0.9.0 “fast”:
```
forward: -314 4 0 5 4 4 -2 0 0 1 0 -1 1 11 -2 -5
inverse: 91 91 89 86 88 86 91 90 91 90 88 86 88 86 89 89
```
- libvpx 0.9.5 “short”:
```
forward: -312 7 1 0 1 12 -5 2 2 -3 3 -1 1 0 -2 1
inverse: 92 91 89 86 91 90 88 86 89 89 89 88 89 87 88 93
```
I was surprised when I noticed that input[] != idct(fdct(input[])) in some of the above cases. Then I remembered that the aforementioned property isn’t what is meant by a “bit-exact” transform– only that all implementations of the inverse transform are supposed to produce bit-exact output for a given vector of input coefficients.

Anyway, I tried applying each of these forward transforms. I got slightly differing results, with the latest one I tried (the fdct from libvpx 0.9.5) producing the best results (to my eye). At least the trees look better in the Big Buck Bunny logo image:

The dense trees of the Big Buck Bunny logo using one of the libvpx 0.9.0 forward transforms

The same segment of the image using the libvpx 0.9.5 forward transform

Then again, it could be that the different numbers generated by the newer forward transform triggered different prediction modes to be chosen. Overall, adapting the newer FDCT did not dramatically improve the encoding quality.

Working on the intra 4×4 mode encoding is generating some rather more accurate blocks than my intra 16×16 encoder. Pengvado indicated that x264 generates perfectly legible results when forcing the encoder to only use intra 16×16 mode. To be honest, I’m having trouble understanding how that can possibly occur thanks to the Walsh-Hadamard transform (WHT). I think that’s where a lot of the error is creeping in with my intra 16×16 encoder. Then again, FFmpeg implements an inverse WHT function that bears ‘vp8′ in its name. This implies that it’s custom to the algorithm and not exactly shared with H.264.
Technically Correct VP8 Encoding

26 octobre 2010, par Multimedia Mike — VP8

I know people are anxious to see what happens next with my toy VP8 encoder. First and foremost, I corrected the encoder’s DC prediction. A lot of rules govern that mode and if you don’t have it right, error cascades through the image. Now the encoder and decoder both agree on every fine detail of the bitstream syntax and rendering thereof. It still encodes to a neo-impressionist mosaic piece, but at least I’ve ironed the bugs out of this phase:

I also made it possible to adjust the quantization levels inside the encoder. This means that I’m finally getting some compression out of the thing, vs. the original approach of hardcoding the minimum quantizers.
VP8 Misplaced Plane

16 octobre 2010, par Multimedia Mike — VP8

So I’m stubbornly plugging away at my toy VP8 encoder and I managed to produce this gem. See if you can spot the subtle mistake:

The misplaced color plane resulted from using the luma plane stride where it was not appropriate. I fixed that and now chroma planes are wired to use to the same naive prediction algorithm as the luma plane.

Also, I fixed the entropy encoder so that end of block conditions are signaled correctly (instead of my original, suboptimal hack to just encode all zeros). I was disappointed to see that this did not result in a major compression improvement. Then again, I’m using the lowest possible quantization settings for this outing, so perhaps this is to be expected.

Sigh… 4×4 luma prediction is next. Wish me luck.