Breaking Eggs And Making Omelettes
A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.
Les articles publiés sur le site
-
Developing A Shader-Based Video Codec
22 juin 2013, par Multimedia Mike — Outlandish BrainstormsEarly last month, this thing called ORBX.js was in the news. It ostensibly has something to do with streaming video and codec technology, which naturally catches my interest. The hype was kicked off by Mozilla honcho Brendan Eich when he posted an article asserting that HD video decoding could be entirely performed in JavaScript. We’ve seen this kind of thing before using Broadway– an H.264 decoder implemented entirely in JS. But that exposes some very obvious limitations (notably CPU usage).
But this new video codec promises 1080p HD playback directly in JavaScript which is a lofty claim. How could it possibly do this? I got the impression that performance was achieved using WebGL, an extension which allows JavaScript access to accelerated 3D graphics hardware. Browsing through the conversations surrounding the ORBX.js announcement, I found this confirmation from Eich himself:
You’re right that WebGL does heavy lifting.
As of this writing, ORBX.js remains some kind of private tech demo. If there were a public demo available, it would necessarily be easy to reverse engineer the downloadable JavaScript decoder.
But the announcement was enough to make me wonder how it could be possible to create a video codec which effectively leverages 3D hardware.
Prior Art
In theorizing about this, it continually occurs to me that I can’t possibly be the first person to attempt to do this (or the ORBX.js people, for that matter). In googling on the matter, I found various forums and Q&A posts where people asked if it were possible to, e.g., accelerate JPEG decoding and presentation using 3D hardware, with no answers. I also found a blog post which describes a plan to use 3D hardware to accelerate VP8 video decoding. It was a project done under the banner of Google’s Summer of Code in 2011, though I’m not sure which open source group mentored the effort. The project did not end up producing the shader-based VP8 codec originally chartered but mentions that “The ‘client side’ of the VP8 VDPAU implementation is working and is currently being reviewed by the libvdpau maintainers.” I’m not sure what that means. Perhaps it includes modifications to the public API that supports VP8, but is waiting for the underlying hardware to actually implement VP8 decoding blocks in hardware.What’s So Hard About This?
Video decoding is a computationally intensive task. GPUs are known to be really awesome at chewing through computationally intensive tasks. So why aren’t GPUs a natural fit for decoding video codecs?Generally, it boils down to parallelism, or lack of opportunities thereof. GPUs are really good at doing the exact same operations over lots of data at once. The problem is that decoding compressed video usually requires multiple phases that cannot be parallelized, and the individual phases often cannot be parallelized. In strictly mathematical terms, a compressed data stream will need to be decoded by applying a function f(x) over each data element, x0 .. xn. However, the function relies on having applied the function to the previous data element, i.e.:
f(xn) = f(f(xn-1))
What happens when you try to parallelize such an algorithm? Temporal rifts in the space/time continuum, if you’re in a Star Trek episode. If you’re in the real world, you’ll get incorrect, unusuable data as the parallel computation is seeded with a bunch of invalid data at multiple points (which is illustrated in some of the pictures in the aforementioned blog post about accelerated VP8).
Example: JPEG
Let’s take a very general look at the various stages involved in decoding the ubiquitous JPEG format:
What are the opportunities to parallelize these various phases?
- Huffman decoding (run length decoding and zig-zag reordering is assumed to be rolled into this phase): not many opportunities for parallelizing the various Huffman formats out there, including this one. Decoding most Huffman streams is necessarily a sequential operation. I once hypothesized that it would be possible to engineer a codec to achieve some parallelism during the entropy decoding phase, and later found that On2′s VP8 codec employs the scheme. However, such a scheme is unlikely to break down to such a fine level that WebGL would require.
- Reverse DC prediction: JPEG — and many other codecs — doesn’t store full DC coefficients. It stores differences in successive DC coefficients. Reversing this process can’t be parallelized. See the discussion in the previous section.
- Dequantize coefficients: This could be very parallelized. It should be noted that software decoders often don’t dequantize all coefficients. Many coefficients are 0 and it’s a waste of a multiplication operation to dequantize. Thus, this phase is sometimes rolled into the Huffman decoding phase.
- Invert discrete cosine transform: This seems like it could be highly parallelizable. I will be exploring this further in this post.
- Convert YUV -> RGB for final display: This is a well-established use case for 3D acceleration.
Crash Course in 3D Shaders and Humility
So I wanted to see if I could accelerate some parts of JPEG decoding using something called shaders. I made an effort to understand 3D programming and its associated math throughout the 1990s but 3D technology left me behind a very long time ago while I got mixed up in this multimedia stuff. So I plowed through a few books concerning WebGL (thanks to my new Safari Books Online subscription). After I learned enough about WebGL/JS to be dangerous and just enough about shader programming to be absolutely lethal, I set out to try my hand at optimizing IDCT using shaders.Here’s my extremely high level (and probably hopelessly naive) view of the modern GPU shader programming model:
The WebGL program written in JavaScript drives the show. It sends a set of vertices into the WebGL system and each vertex is processed through a vertex shader. Then, each pixel that falls within a set of vertices is sent through a fragment shader to compute the final pixel attributes (R, G, B, and alpha value). Another consideration is textures: This is data that the program uploads to GPU memory which can be accessed programmatically by the shaders).
These shaders (vertex and fragment) are key to the GPU’s programmability. How are they programmed? Using a special C-like shading language. Thought I: “C-like language? I know C! I should be able to master this in short order!” So I charged forward with my assumptions and proceeded to get smacked down repeatedly by the overall programming paradigm. I came to recognize this as a variation of the scientific method: Develop a hypothesis– in my case, a mental model of how the system works; develop an experiment (short program) to prove or disprove the model; realize something fundamental that I was overlooking; formulate new hypothesis and repeat.
First Approach: Vertex Workhorse
My first pitch goes like this:- Upload DCT coefficients to GPU memory in the form of textures
- Program a vertex mesh that encapsulates 16×16 macroblocks
- Distribute the IDCT effort among multiple vertex shaders
- Pass transformed Y, U, and V blocks to fragment shader which will convert the samples to RGB
So the idea is that decoding of 16×16 macroblocks is parallelized. A macroblock embodies 6 blocks:
It would be nice to process one of these 6 blocks in each vertex. But that means drawing a square with 6 vertices. How do you do that? I eventually realized that drawing a square with 6 vertices is the recommended method for drawing a square on 3D hardware. Using 2 triangles, each with 3 vertices (0, 1, 2; 3, 4, 5):
A vertex shader knows which (x, y) coordinates it has been assigned, so it could figure out which sections of coefficients it needs to access within the textures. But how would a vertex shader know which of the 6 blocks it should process? Solution: Misappropriate the vertex’s z coordinate. It’s not used for anything else in this case.
So I set all of that up. Then I hit a new roadblock: How to get the reconstructed Y, U, and V samples transported to the fragment shader? I have found that communicating between shaders is quite difficult. Texture memory? WebGL doesn’t allow shaders to write back to texture memory; shaders can only read it. The standard way to communicate data from a vertex shader to a fragment shader is to declare variables as “varying”. Up until this point, I knew about varying variables but there was something I didn’t quite understand about them and it nagged at me: If 3 different executions of a vertex shader set 3 different values to a varying variable, what value is passed to the fragment shader?
It turns out that the varying variable varies, which means that the GPU passes interpolated values to each fragment shader invocation. This completely destroys this idea.
Second Idea: Vertex Workhorse, Take 2
The revised pitch is to work around the interpolation issue by just having each vertex shader invocation performs all 6 block transforms. That seems like a lot of redundant. However, I figured out that I can draw a square with only 4 vertices by arranging them in an ‘N’ pattern and asking WebGL to draw a TRIANGLE_STRIP instead of TRIANGLES. Now it’s only doing the 4x the extra work, and not 6x. GPUs are supposed to be great at this type of work, so it shouldn’t matter, right?I wired up an experiment and then ran into a new problem: While I was able to transform a block (or at least pretend to), and load up a varying array (that wouldn’t vary since all vertex shaders wrote the same values) to transmit to the fragment shader, the fragment shader can’t access specific values within the varying block. To clarify, a WebGL shader can use a constant value — or a value that can be evaluated as a constant at compile time — to index into arrays; a WebGL shader can not compute an index into an array. Per my reading, this is a WebGL security consideration and the limitation may not be present in other OpenGL(-ES) implementations.
Not Giving Up Yet: Choking The Fragment Shader
You might want to be sitting down for this pitch:- Vertex shader only interpolates texture coordinates to transmit to fragment shader
- Fragment shader performs IDCT for a single Y sample, U sample, and V sample
- Fragment shader converts YUV -> RGB
Seems straightforward enough. However, that step concerning IDCT for Y, U, and V entails a gargantuan number of operations. When computing the IDCT for an entire block of samples, it’s possible to leverage a lot of redundancy in the math which equates to far fewer overall operations. If you absolutely have to compute each sample individually, for an 8×8 block, that requires 64 multiplication/accumulation (MAC) operations per sample. For 3 color planes, and including a few extra multiplications involved in the RGB conversion, that tallies up to about 200 MACs per pixel. Then there’s the fact that this approach means a 4x redundant operations on the color planes.
It’s crazy, but I just want to see if it can be done. My approach is to pre-compute a pile of IDCT constants in the JavaScript and transmit them to the fragment shader via uniform variables. For a first order optimization, the IDCT constants are formatted as 4-element vectors. This allows computing 16 dot products rather than 64 individual multiplication/addition operations. Ideally, GPU hardware executes the dot products faster (and there is also the possibility of lining these calculations up as matrices).
I can report that I actually got a sample correctly transformed using this approach. Just one sample, through. Then I ran into some new problems:
Problem #1: Computing sample #1 vs. sample #0 requires a different table of 64 IDCT constants. Okay, so create a long table of 64 * 64 IDCT constants. However, this suffers from the same problem as seen in the previous approach: I can’t dynamically compute the index into this array. What’s the alternative? Maintain 64 separate named arrays and implement 64 branches, when branching of any kind is ill-advised in shader programming to begin with? I started to go down this path until I ran into…
Problem #2: Shaders can only be so large. 64 * 64 floats (4 bytes each) requires 16 kbytes of data and this well exceeds the amount of shader storage that I can assume is allowed. That brings this path of exploration to a screeching halt.
Further Brainstorming
I suppose I could forgo pre-computing the constants and directly compute the IDCT for each sample which would entail lots more multiplications as well as 128 cosine calculations per sample (384 considering all 3 color planes). I’m a little stuck with the transform idea right now. Maybe there are some other transforms I could try.Another idea would be vector quantization. What little ORBX.js literature is available indicates that there is a method to allow real-time streaming but that it requires GPU assistance to yield enough horsepower to make it feasible. When I think of such severe asymmetry between compression and decompression, my mind drifts towards VQ algorithms. As I come to understand the benefits and limitations of GPU acceleration, I think I can envision a way that something similar to SVQ1, with its copious, hierarchical vector tables stored as textures, could be implemented using shaders.
So far, this all pertains to intra-coded video frames. What about opportunities for inter-coded frames? The only approach that I can envision here is to use WebGL’s readPixels() function to fetch the rasterized frame out of the GPU, and then upload it again as a new texture which a new frame processing pipeline could reference. Whether this idea is plausible would require some profiling.
Using interframes in such a manner seems to imply that the entire codec would need to operate in RGB space and not YUV.
Conclusions
The people behind ORBX.js have apparently figured out a way to create a shader-based video codec. I have yet to even begin to reason out a plausible approach. However, I’m glad I did this exercise since I have finally broken through my ignorance regarding modern GPU shader programming. It’s nice to have a topic like multimedia that allows me a jumping-off point to explore other areas. -
Making Sure The PNG Gets There
14 juin 2013, par Multimedia Mike — GeneralRewind to 1999. I was developing an HTTP-based remote management interface for an embedded device. The device sat on an ethernet LAN and you could point a web browser at it. The pitch was to transmit an image of the device’s touch screen and the user could click on the picture to interact with the device. So we needed an image format. If you were computing at the time, you know that the web was insufferably limited back then. Our choice basically came down to GIF and JPEG. Being the office’s annoying free software zealot, I was championing a little known up and coming format named PNG.
So the challenge was to create our own PNG encoder (incorporating a library like libpng wasn’t an option for this platform). I seem to remember being annoyed at having to implement an integrity check (CRC) for the PNG encoder. It’s part of the PNG spec, after all. It just seemed so redundant. At the time, I reasoned that there were 5 layers of integrity validation in play.
I don’t know why, but I was reflecting on this episode recently and decided to revisit it. Here are all the encapsulation layers of a PNG file when flung over an ethernet network:
So there are up to 5 encapsulations for the data in this situation. At the innermost level is the image data which is compressed with the zlib DEFLATE method. At first, I thought that this also had a CRC or checksum. However, in researching this post, I couldn’t find any evidence of such an integrity check. Further, I don’t think we bothered to compress the PNG data in this project long ago. It was a small image, monochrome, and transferring via LAN, so the encoder could get away with signaling uncompressed data.
The graphical data gets wrapped up in a PNG chunk and all PNG chunks have a CRC. To transmit via the network, it goes into a TCP frame, which also has a checksum. That goes into an IP packet. I previously believed that this represented another integrity check. While an IP frame does have a checksum, the checksum only covers the IP header and not the payload. So that doesn’t really count towards this goal.
Finally, the data gets encapsulated into an ethernet frame which has — you guessed it — a CRC.
I see that other link layer protocols like PPP and wireless ethernet (802.11) also feature frame CRCs. So I guess what I’m saying is that, if you transfer a PNG file over the network, you can be confident that the data will be free of any errors.
-
How To Play Hardware Accelerated Video on A Mac
28 mai 2013, par Multimedia Mike — GeneralI have a friend who was considering purchasing a Mac Mini recently. At the time of this writing, there are 3 desktop models (and 2 more “server” models).
The cheapest one is a Core i5 2.5 GHz. Then there are 2 Core i7 models: 2.3 GHz and 2.6 GHz. The difference between the latter 2 is US$100. The only appreciable technical difference is the extra 0.3 GHz and the choice came down to those 2.
He asked me which one would be able to play HD video at full frame rate. I found this query puzzling. But then, I have been “in the biz” for a bit too long. Whether or not a computer or device can play a video well depends on a lot of factors.
Hardware Support
First of all, looking at the raw speed of the general-purpose CPU inside of a computer as a gauge of video playback performance is generally misguided in this day and age. In general, we have a video standard (H.264, which I’ll focus on for this post) and many bits of hardware are able to accelerate decoding. So, the question is not whether the CPU can decode the data in real time, but can any other hardware in the device (likely the graphics hardware) handle it? These machines have Intel HD 4000 graphics and, per my reading of the literature, they are capable of accelerating H.264 video decoding.Great, so the hardware supports accelerated decoding. So it’s a done deal, right? Not quite…
Operating System Support
An application can’t do anything pertaining to hardware without permission from the operating system. So the next question is: Does Mac OS X allow an application to access accelerated video decoding hardware if it’s available? This used to be a contentious matter (notably, Adobe Flash Player was unable to accelerate H.264 playback on Mac in the absence of such an API) but then Apple released an official API detailed in Technical Note TN2267.So, does this mean that video is magically accelerated? Nope, we’re still not there yet…
Application Support
It’s great that all of these underlying pieces are in place, but if an individual application chooses to decode the video directly on the CPU, it’s all for naught. An application needs to query the facilities and direct data through the API if it wants to leverage the acceleration. Obviously, at this point it becomes a matter of “which application?”My friend eventually opted to get the pricier of the desktop Mac Mini models and we ran some ad-hoc tests since I was curious how widespread the acceleration support is among Mac multimedia players. Here are some programs I wanted to test, playing 1080p H.264:
- Apple QuickTime Player
- VLC
- YouTube with Flash Player (any browser)
- YouTube with Safari/HTML5
- YouTube with Chrome/HTML5
- YouTube with Firefox/HTML5
- Netflix
I didn’t take exhaustive notes but my impromptu tests revealed QuickTime Player was, far and away, the most performant player, occupying only around 5% of the CPU according to the Mac OS X System Profiler graph (which is likely largely spent on audio decoding).
VLC consistently required 20-30% CPU, so it’s probably leveraging some acceleration facilities. I think that Flash Player and the various HTML5 elements performed similarly (their multi-process architectures can make such a trivial profiling test difficult).
The outlier was Netflix running in Firefox via Microsoft’s Silverlight plugin. Of course, the inner workings of Netflix’s technology are opaque to outsiders and we don’t even know if it uses H.264. It may very well use Microsoft’s VC-1 which is not a capability provided by the Mac OS X acceleration API (it doesn’t look like the Intel HD 4000 chip can handle it either). I have never seen any data one way or another about how Netflix encodes video. However, I was able to see that Netflix required an enormous amount of CPU muscle on the Mac platform.
Conclusion
The foregoing is a slight simplification of the video playback pipeline. There are some other considerations, most notably how the video is displayed afterwards. To circle back around to the original question: Can the Mac Mini handle full HD video playback? As my friend found, the meager Mac Mini can do an admirable job at playing full HD video without loading down the CPU. -
Survey of CD Image Formats
30 avril 2013, par Multimedia Mike — GeneralIn the course of exploring and analyzing the impressive library of CD images curated at the Internet Archive’s Shareware CD collection, one encounters a wealth of methods for copying a complete CD image onto other media for transport. In researching the formats, I have found that many of them are native to various binary, proprietary CD programs that operate under Windows. Since I have an interest in interpreting these image formats and I would also like to do so outside of Windows, I thought to conduct a survey to determine if enough information exists to write processing tools of my own.
Remember from my Grand Unified Theory of Compact Disc that CDs, from a high enough level of software abstraction, are just strings of 2352-byte sectors broken up into tracks. The difference among various types of CDs comes down to the specific meaning of these 2352 bytes.
Most imaging formats rip these strings of sectors into a giant file and then record some metadata information about the tracks and sectors.
ISO
This is perhaps the most common method for storing CD images. It’s generally only applicable to data CD-ROMs. File images generally end with a .iso extension. This refers to ISO-9660 which is the standard CD filesystem.Sometimes, disc images ripped from other types of discs (like Xbox/360 or GameCube discs) bear the extension .iso, which is a bit of a misnomer since they aren’t formatted using the ISO-9660 filesystem. But the extension sort of stuck.
BIN / CUE
I see the BIN & CUE file format combination quite frequently. Reportedly, a program named CDRWIN deployed this format first. This format can handle a mixed mode CD (e.g., starts with a data track and is followed by a series of audio tracks), whereas ISO can only handle the data track. The BIN file contains the raw data while the CUE file is a text file that defines how the BIN file is formatted (how many bytes in a sector, how many sectors to each individual track).CDI
This originates from a program called DiscJuggler. This is extremely prevalent in the Sega Dreamcast hobbyist community for some reason. I studied the raw hex dumps of some sample CDI files but there was no obvious data (mostly 0s). There is an open source utility called cdi2iso which is able to extract an ISO image from a CDI file. The program’s source clued me in that the metadata is actually sitting at the end of the image file. This makes sense when you consider how a ripping program needs to operate– copy tracks, sector by sector, and then do something with the metadata after the fact. Options include: 1) Write metadata at the end of the file (as seen here); 2) write metadata into a separate file (seen in other formats on this list); 3) write the data at the beginning of the file which would require a full rewrite of the entire (usually large) image file (I haven’t seen this yet).Anyway, I believe I have enough information to write a program that can interpret a CDI file. The reason this format is favored for Dreamcast disc images is likely due to the extreme weirdness of Dreamcast discs (it’s complicated, but eventually fits into my Grand Unified Theory of CDs, if you look at it from a high level).
MDF / MDS
MDF and MDS pairs come from a program called Alcohol 120%. The MDF file has the data while the MDS file contains the metadata. The metadata is in an opaque binary format, though. Thankfully, the Wikipedia page links to a description of the format. That’s another image format down.CCD / SUB / IMG
The CloneCD Control File is one I just ran across today thanks to a new image posted at the IA Shareware Archive (see Super Duke Volume 2). I haven’t found any definitive documentation on this, but it also doesn’t seen too complicated. The .ccd file is a text file that is pretty self-explanatory. The sample linked above, however, only has a .ccd file and a .sub file. I’m led to believe that the .sub file contains subchannel information while a .img file is supposed to contain the binary data.So this rip might be incomplete(nope, the .img file is on the page, in the sidebar; thanks to Phil in the comments for pointing this out). The .sub file is a bit short compared to the Archive’s description of the disc’s contents (only about 4.6 MB of data) and when I briefly scrolled through, it didn’t look like it contains any real computer data. So it probably is just the disc’s subchannel data (something I glossed over in my Grand Unified Theory).CSO
I have dealt with the CISO (compressed ISO) format before. It’s basically the same as a .iso file described above except that each individual 2048-byte data sector is compressed using zlib. The format boasts up to 9 compression levels, which shouldn’t be a big surprise since that correlates to zlib’s own compression tiers.Others
Wikipedia has a category for optical disc image formats. Of course, there are numerous others. However, I haven’t encountered them in the wild for the purpose of broad image distribution. -
Sequencing MIDI From A Chiptune
28 avril 2013, par Multimedia Mike — Outlandish BrainstormsThe feature requests for my game music appreciation website project continue to pour in. Many of them take the form of “please add player support for system XYZ and the chiptune library to go with it.” Most of these requests are A) plausible, and B) in process. I have also received recommendations for UI improvements which I take under consideration. Then there are the numerous requests to port everything from Native Client to JavaScript so that it will work everywhere, even on mobile, a notion which might take a separate post to debunk entirely.
But here’s an interesting request about which I would like to speculate: Automatically convert a chiptune into a MIDI file. I immediately wanted to dismiss it as impossible or highly implausible. But, as is my habit, I started pondering the concept a little more carefully and decided that there’s an outside chance of getting some part of the idea to work.
Intro to MIDI
MIDI stands for Musical Instrument Digital Interface. It’s a standard musical interchange format and allows music instruments and computers to exchange musical information. The file interchange format bears the extension .mid and contains a sequence of numbers that translate into commands separated by time deltas. E.g.: turn key on (this note, this velocity); wait x ticks; turn key off; wait y ticks; etc. I’m vastly oversimplifying, as usual.MIDI fascinated me back in the days of dialup internet and discrete sound cards (see also my write-up on the Gravis Ultrasound). Typical song-length MIDI files often ranged from a few kilobytes to a few 10s of kilobytes. They were significantly smaller than the MOD et al. family of tracker music formats mostly by virtue of the fact that MIDI files aren’t burdened by transporting digital audio samples.
I know I’m missing a lot of details. I haven’t dealt much with MIDI in the past… 15 years or so (ever since computer audio became a blur of MP3 and AAC audio). But I’m led to believe it’s still relevant. The individual who requested this feature expressed an interest in being able to import the sequenced data into any of the many music programs that can interpret .mid files.The Pitch
To limit the scope, let’s focus on music that comes from the 8-bit Nintendo Entertainment System or the original Game Boy. The former features 2 square wave channels, a triangle wave, a noise channel, and a limited digital channel. The latter creates music via 2 square waves, a wave channel, and a noise channel. The roles that these various channels usually play typically break down as: square waves represent the primary melody, triangle wave is used to simulate a bass line, noise channel approximates a variety of percussive sounds, and the DPCM/wave channels are fairly free-form. They can have random game sound effects or, if they are to assist in the music, are often used for more authentic percussive sounds.The various channels are controlled via an assortment of memory-mapped hardware registers. These registers are fed values such as frequency, volume, and duty cycle. My idea is to modify the music playback engine to track when various events occur. Whenever a channel is turned on or off, that corresponds to a MIDI key on or off event. If a channel is already playing but a new frequency is written, that would likely count as a note change, so log a key off event followed by a new key on event.
There is the major obstacle of what specific note is represented by a channel in a particular state. The MIDI standard defines 128 different notes spanning 11 octaves. Empirically, I wonder if I could create a table which maps the assorted frequencies to different MIDI notes?
I think this strategy would only work with the square and triangle waves. Noise and digital channels? I’m not prepared to tackle that challenge.
Prior Work?
I have to wonder if there is any existing work in this area. I’m certain that people have wanted to do this before; I wonder if anyone has succeeded?Just like reverse engineering a binary program entails trying to obtain a higher level abstraction of a program from a very low level representation, this challenge feels like reverse engineering a piece of music as it is being performed and automatically expressing it in a higher level form.