Breaking Eggs And Making Omelettes

A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.

http://multimedia.cx/eggs/

Les articles publiés sur le site

  • Xbox Sphinx Protocol

    21 octobre 2013, par Multimedia MikeDRM, xbox

    I’ve gone down the rabbit hole of trying to read the Xbox DVD drive from Linux. Honestly, I’m trying to remember why I even care at this point. Perhaps it’s just my metagame of trying to understand how games and related technologies operate. In my last post of the matter, I determined that it is possible to hook an Xbox drive up to a PC using a standard 40-pin IDE interface and read data sectors. However, I learned that just because the Xbox optical drive is reading an Xbox disc, that doesn’t mean it’s just going to read the sectors in response to a host request.

    Oh goodness, no. The drive is going to make the host work for those sectors.

    To help understand the concept of locked/unlocked sectors on an Xbox disc, I offer this simplistic diagram:


    Xbox locked disc diagram

    Any DVD drive (including the Xbox drive) is free to read those first 6992 sectors (about 14 MB of data) which just contain a short DVD video asking the user to insert the disc into a proper Xbox console. Reading the remaining sectors involves performing a sequence of SCSI commands that I have taken to calling the “Sphinx Protocol” for reasons I will explain later in this post.

    References
    Doing a little Googling after my last post on the matter produced this site hosting deep, technical Xbox information. It even has a page about exactly what I am trying to achieve: Use an Xbox DVD Drive in Your PC. The page provides a tool named dvdunlocker written by “The Specialist” to perform the necessary unlocking. The archive includes a compiled Windows binary as well as its source code. The source code is written in Delphi Pascal and leverages Windows SCSI APIs. Still, it is well commented and provides a roadmap, which I will try to describe in this post.

    Sphinx Protocol
    Here is a rough flowchart of the steps that are (probably) involved in the unlocking of those remaining sectors. I reverse engineered this based on the Pascal tool described in the previous section. Disclaimer: at the time of this writing, I haven’t tested all of the steps due to some Linux kernel problems, described later.


    Xbox SCSI Unlock Protocol

    Concerning the challenge/response table that the drive sends back, it’s large (0×664 / 1636 bytes), and not all of the bytes’ meanings are known. However, these are the bytes that seem to be necessary (all multi-byte numbers are big endian):

     bytes 0-1        Size of mode page payload data (should be 0x0662)
     bytes 2-771      Unknown
     byte  772        Should be 1
     byte  773        Number of entries in challenge/response table
     bytes 774-1026   Encrypted challenge/response table
     bytes 1027-1186  Unknown
     bytes 1187-1230  Key basis (44 bytes)
     bytes 1231-1635  Unknown
    

    The challenge/response table is the interesting part, but it’s encrypted with RC4 a.k.a. ARCFOUR. The key is derived from the 44 bytes I have labeled “key basis”– cryptographic literature probably has a better term for it; chime in if you know what that might be. An SHA-1 hash is computed over the 44 bytes.

    The resulting SHA-1 hash — the first part of it, to be exact — is fed as the key into the RC4 decryption. The output of SHA-1 contains 160 bits of information. 160 / 8 = 20 bytes of information. To express this as a printable hex digest requires 40 characters. The SHA-1 hash is converted to a hex digest and then the first 7 of the characters are fed into the RC4 initialization function as the key. Then, the RC4 decrypter does its work on the 253 bytes of the challenge/response table.

    So that’s why I took to calling this the “Sphinx Protocol” — I felt like I was being challenged with a bizarre riddle. Perhaps that describes a lot of cryptosystems, though You have to admit it sounds kind of cool.

    The challenge/response table contains 23 11-byte records. The format of this table is (again, multi-byte numbers are big-endian):

     byte  0     This is 1 if this challenge/response pair is valid
     byte  1     Challenge ID
     bytes 2-5   Challenge
     byte  6     Response ID
     bytes 7-10  Response
    

    Example
    It’s useful to note that the challenge/response table and associated key is different for every disc (at least all the ones I have looked at). So this might be data that comes from the disc, since the values will always be the same for a given disc.

    Let’s examine Official Xbox Magazine disc #16 (Indiana Jones and The Emperor’s Tomb):


    Xbox Magazine #16 featuring Indiana Jones

    Before I decrypt the challenge/response table, it looks like this:

       0: 180, 172: 0xEB100059;  66: 0xD56AFB56
       1:  34,  71: 0x8F9BF03A; 192: 0xC32CBDF8
       2: 226, 216: 0xA29B77F2;  12: 0x4474A6F1
       3:  72, 122: 0x9F5ABF33; 255: 0xC5E3C304
       4:   1, 103: 0x76142ADA; 233: 0xDE145D42 ****
       5:  49, 193: 0xA1CD6192; 189: 0x2169DBA5
       6: 182, 250: 0x9977894F;  96: 0x5A929E2B
       7: 148,  71: 0x6DD10A54; 115: 0xF0BDAC4F
       8:  12,  45: 0x5D5EB6FD; 148: 0x84E60A00
       9:  99, 121: 0xFEAED372; 201: 0xDA9986F9
      10: 172, 230: 0xE6C0D0B4; 214: 0x9050C250
      11:  84,  65: 0x95CB8775; 104: 0x550886C6
      12: 210,  65: 0x1ED23619; 171: 0x6DF4A35B
      13:   2, 155: 0xD0AAE1E0; 130: 0x00D1FFCF
      14:  40,   2: 0x172EFEB8; 159: 0x37E03E50
      15:  49,  15: 0x43E5E378; 223: 0x267F9C9A
      16: 240, 173: 0x357D5D1C; 250: 0x24965D67
      17:  80, 184: 0x5E7AF1A3;  81: 0x3A8F69A7
      18: 154, 186: 0x6626BEAC; 245: 0xE639540A
      19: 231, 249: 0xFABAAFB7; 227: 0x4C686A07
      20: 150, 186: 0x9A6D7AA3; 133: 0x25971CF0
      21: 236, 192: 0x5CD97DD4; 247: 0x26655EFB
      22:  68, 173: 0xE2D372E4; 207: 0x103FBF94
    there are 1 valid pairs in the list: 4
    

    My best clue that it’s not right is that there is only 1 valid entry (denoted by my tool using ****). The source I reverse engineered for this data indicates that there needs to be at least 2 valid pairs. After running the RC4 decryption on the table, it looks like this and I get far more valid pairs:

       0:   1, 174: 0xBD628255;   0: 0x9F0A31AF ****
       1:   2, 176: 0x3151B341;   2: 0x9C87C180
       2:   3, 105: 0x018879E5;   1: 0xFF068B5C
       3:   2,   7: 0x1F316AAF;   3: 0xF420D3ED
       4:   3,  73: 0xC2EBFBE9;   0: 0x17062B5B
       5: 252, 163: 0xFF14B5CB; 236: 0xAF813FBC
       6:   2, 233: 0x5EE95C49;   1: 0x37AA5511
       7:   1, 126: 0xBD628255;   0: 0x5BA3FBD4 ****
       8:   3,   4: 0xB68BFEE6;   3: 0xA8F3B918
       9:   3,  32: 0xEA614943;   2: 0xA678D715
      10:   2, 248: 0x1BDD374E;   0: 0x8D2AC2C7
      11:   3,  17: 0x0EABCE81;   2: 0xC90A7242
      12:   1, 186: 0xBD628255;   0: 0xC4820242 ****
      13:   3, 145: 0xB178F942;   3: 0x4D78AD62
      14:   3,  37: 0x4A6CE5E2;   2: 0xBF94E1C6
      15:   1, 102: 0xBD628255;   0: 0xFFB83D8D ****
      16:   3, 122: 0xF97B0905;   1: 0x38533125
      17:   3, 197: 0x57A6865D;   2: 0xA61D31EF
      18:   3,  27: 0xC7227D7C;   2: 0xA3F9BA1E
      19:   1,  16: 0xBD628255;   0: 0x8557CCC8 ****
      20:   2,  53: 0x1DA9D156;   3: 0xC9051754
      21:   2,  90: 0x3CD66BEE;   3: 0xFD851D3E
      22:   1, 252: 0xBD628255;   0: 0xB3F22701 ****
    there are 6 valid pairs in the list: 0 7 12 15 19 22
    

    So, hopefully, I have the decryption correct.

    Also of note is that you only get one chance to get this unlocking correct– fail, and the drive won’t return a valid DVD structure block again. You will either need to reboot the Xbox or eject & close the tray before you get to try again.

    Problems Making It Work In Linux
    There are a couple of ways to play with SCSI protocols under Linux. In more recent kernels, block devices are named /dev/sda, /dev/sdb, etc. Each of these block devices has a corresponding character device named /dev/sg0, /dev/sg1, etc. ‘sg’ stands for SCSI generic. This character devices can be opened as readable and/or writable and SCSI commands can be freely written with write() and data retrieved with read(). Pretty powerful.

    Except that the one machine I still possess which supports 40-pin IDE/ATAPI devices is running Linux kernel 2.6.24 which dates back to early 2008 and it still enumerates the IDE block devices as /dev/hda, /dev/hdb, etc. There are no corresponding /dev/sgX character devices. What to do? It seems that a program can still issue SCSI commands using an ioctl() facility named SG_IO.

    I was able to make the SG_IO ioctl() work for the most part (except for the discovery that the Xbox drive doesn’t respond to a basic SCSI Inquiry command). However, I ran into a serious limitation– a program can only open a /dev/hdX block device in read-only mode if the device corresponds to a read-only drive like, for example, a DVD-ROM drive. This means that a program can’t issue SCSI mode select commands to the drive, which counts as writing. This means that my tool can’t unlock the drive.

    Current Status
    So this is where my experiment is blocked right now. I have been trying to compile various Linux kernels to remedy the situation. But I always seem to find myself stuck in one of 2 situations, depending on the configuration options I choose: Either the drives are enumerated with the /dev/hdX convention and I am stuck in read-only mode (with no mode select); or the drives are enumerated with /dev/sdX along with corresponding /dev/sgN character devices, in which case the kernel does not recognize the Xbox DVD-ROM drive.

    This makes me wonder if there’s a discrepancy between the legacy ATA/ATAPI drivers (which sees the drive) and the newer SATA/PATA subsystem (which doesn’t see the drive). I also wonder about hacking the kernel logic to allow SCSI mode select logic to proceed to the device for a read-only file handle.

  • Interfacing to an Xbox Optical Drive

    1er octobre 2013, par Multimedia Mikexbox

    The next generation Xbox is going to hit the streets soon. But for some reason, I’m still interested in the previous generation’s unit (i.e., the original Xbox). Specifically, I’ve always wondered if it’s possible to use the original Xbox’s optical drive in order to read Xbox discs from Linux. I was never curious enough to actually buy an Xbox just to find out but I eventually came across a cast-off console on a recycle pile.

    I have long known that the Xbox has what appears to be a more or less standard optical drive with a 40-pin IDE connector. The only difference is the power adapter which I surmise is probably the easiest way to turn a bit of standardized hardware into a bit of proprietary hardware. The IDE and power connectors look like this:


    Xbox optical drive connections

    Thus, I wanted to try opening an Xbox and plugging the optical drive into a regular PC, albeit one that supports IDE cables, and allow the Xbox to supply power to the drive. Do you still have hardware laying around that has 40-pin IDE connectors? I guess my Mac Mini PPC fits the bill, but I’ll be darned if I’m going to pry that thing open again. I have another IDE-capable machine buried in my closet, last called into service when I needed a computer with a native RS-232 port 3 years ago. The ordeal surrounding making this old computer useful right now can be another post entirely.

    Here’s what the monstrosity looks like thanks to characteristically short IDE cable lengths:


    Xbox optical drive connected directly to PC

    Click for larger image


    Process:

    1. Turn on Xbox first
    2. Turn on PC

    Doing these things in the opposite order won’t work since the kernel really wants to see the drive when booting up. Inspecting the 'dmesg' log afterward reveals interesting items:

    hdd: PHILIPS XBOX DVD DRIVE, ATAPI CD/DVD-ROM drive
    hdd: host max PIO5 wanted PIO255(auto-tune) selected PIO4
    hdd: UDMA/33 mode selected
    [...]
    hdd: ATAPI DVD-ROM drive, 128kB Cache

    Why is that interesting? When is the last time to saw disk devices prefixed by ‘hd’ rather than ‘sd’? Blast from the past. Oh, and the optical drive’s vendor string clearly indicates that this is an Xbox drive saying ‘hi!’.

    Time To Read
    When I first studied an Xbox disc in a normal optical drive, I noticed that I was able to read 6992 2048-byte sectors — about 14 MB of data — as reported by the disc table of contents (TOC). This is just enough data to play a standard DVD video animation that kindly instructs the viewer to please use a proper Xbox. At this point, I estimated that there must be something special about Xbox optical drive firmware that knows how to read alternate information on these discs and access further sectors.

    I ran my TOC query tool with an Xbox Magazine demo disc in the optical drive and it reported substantially more than 6992 sectors, enough to account for more than 2 GB of data. That’s promising. I then tried running 'dd' against the device and it was able to read… about 14 MB, an exact quantity of bytes that, when divided by 2048 bytes/sector, yields 6992 sectors.

    Future (Past?) Work
    Assuming Google is your primary window into the broader internet, the world is beginning to lose its memory of things pertaining to the original Xbox (Microsoft’s naming scheme certainly doesn’t help searches). What I’m saying is that it can be difficult to find information about this stuff now. However, I was able to learn that a host needs to perform a sort of cryptographic handshake with the drive at the SCSI level before it is allowed to access the forbidden areas of the disc. I think. I’m still investigating this and will hopefully post more soon.

  • Game Music Appreciation, One Year Later

    1er août 2013, par Multimedia MikeGeneral

    I released my game music website last year about this time. It was a good start and had potential to grow in a lot of directions. But I’m a bit disappointed that I haven’t evolved it as quickly as I would like to. I have made a few improvements, like adjusting the play lengths of many metadata-less songs and revising the original atrocious design of the website using something called Twitter Bootstrap (and, wow, once you know what Bootstrap is, you start noticing it everywhere on the modern web). However, here are a few of the challenges that have slowed me down over the year:

    Problems With Native Client – Build System
    The technology which enables this project — Google’s Native Client (NaCl) — can be troublesome. One of my key frustrations with the environment is that every single revision of the NaCl SDK seems to adopt a completely new build system layout. If you want to port your NaCl project forward to newer revisions, you have to spend time wrapping your head around whatever the favored build system is. When I first investigated NaCl, I think it was using vanilla GNU Make. Then it switched to SCons. Then I forgot about NaCl for about a year and when I came back, the SDK had reverted back to GNU Make. While that has been consistent, the layout of the SDK sometimes changes and a different example Makefile shows the way.

    The very latest version of the API has required me to really overhaul the Makefile and to truly understand the zen of Makefile programming. I’m even starting to grasp the relationship it has to functional programming.

    Problems With Native Client – API Versions and Chrome Bugs
    I built the original Salty Game Music Player when NaCl API version 16 was current. By the time I published the v16 version, v19 was available. I made the effort to port forward (a few APIs had superfically changed, nothing too dramatic). However, when I would experiment with this new player, I would see intermittent problems on my Windows 7 desktop. Because of this, I was hesitant to make a new player release.

    Around the end of May, I started getting bug reports from site users that their Chrome browsers weren’t allowing them to activate the Salty Game Music Player — the upshot was that they couldn’t play music unless they manually flipped a setting in their browser configuration. It turns out that Chrome 27 introduced a bug that caused this problem. Not only that, but my player was one of only 2 known NaCl apps that used the problematic feature (the other was developed by the Google engineer who entered the bug).

    After feeling negligent for a long while about not doing anything to fix the bug, I made a concerted and creative effort to work around the bug and pushed out a new version of the player (based on API v25). My effort didn’t work and I had to roll it back somewhat (but still using the new player binaries). The bug was something that I couldn’t work around. However, at about the same time that I was attempting to do this, Google was rolling out Chrome 28 which fixed the bug, rendering my worry and effort moot.

    Problems With Native Client – Still Not In The Clear
    I felt reasonably secure about releasing the updated player since I couldn’t make my aforementioned problem occur on my Windows 7 setup anymore. I actually have a written test plan for this player, believe it or not. However, I quickly started receiving new bug reports from Windows users. Mostly, these are Windows 8 users. The player basically doesn’t work at all for them now. One user reports the problem on Windows 7 (and another on Windows 2008 Server, I think). But I can’t see it.

    I have a theory about what might be going wrong, but of course I’ll need to test it, and determine how to fix it.

    Database Difficulties
    The player is only half of the site; the other half is the organization of music files. Working on this project has repeatedly reminded me of my fundamental lack of skill concerning databases. I have a ‘production’ database– now I’m afraid to do anything with it for fear of messing it up. It’s an an SQLite3 database, so it’s easy to make backups and to create a copy in order to test and debug a new script. Still, I feel like I’m missing an entire career path worth of database best practices.

    There is also the matter of ongoing database maintenance. There are graphical frontends for SQLite3 which make casual updates easier and obviate the need for anything more sophisticated (like a custom web app). However, I have a slightly more complicated database entry task that I fear will require, well, a custom web app in order to smoothly process hundreds, if not thousands of new song files (which have quirks which prohibit the easy mass processing I have been able to get away with so far).

    Going Forward
    I remain hopeful that I’ll gradually overcome these difficulties. I still love this project and I have received nothing but positive feedback over the past year (modulo the assorted recommendations that I port the entire player to pure JavaScript).

    You would think I would learn a lesson about building anything on top of a Google platform in the future, especially Native Client. Despite all this, I have another NaCl project planned.

  • Managing Music Playback Channels

    30 juin 2013, par Multimedia MikeGeneral

    My Game Music Appreciation site allows users to interact with old video game music by toggling various channels, as long as the underlying synthesizer engine supports it.


    5 NES voices

    Users often find their way to the Nintendo DS section pretty quickly. This is when they notice an obnoxious quirk with the channel toggling feature: specifically, one channel doesn’t seem to map to a particular instrument or track.

    When it comes to computer music playback methodologies, I have long observed that there are 2 general strategies: Fixed channel and dynamic channel allocation.

    Fixed Channel Approach
    One of my primary sources of computer-based entertainment used to be watching music. Sure I listened to it as well. But for things like Amiga MOD files and related tracker formats, there was a rich ecosystem of fun music playback programs that visualized the music. There exist music visualization modes in various music players these days (such as iTunes and Windows Media Player), but those largely just show you a single wave form. These files were real time syntheses based on multiple audio channels and usually showed some form of analysis for each channel. My personal favorite was Cubic Player:


    Open Cubic Player -- oscilloscopes

    Most of these players supported the concept of masking individual channels. In doing so, the user could isolate, study, and enjoy different components of the song. For many 4-channel Amiga MOD files, I observed that the common arrangement was to use the 4 channels for beat (percussion track), bass line, chords, and melody. Thus, it was easy to just listen to, e.g., the bass line in isolation.

    MODs and similar formats specified precisely which digital audio sample to play at what time and on which specific audio channel. To view the internals of one of these formats, one gets the impression that they contain an extremely computer-centric view of music.

    Dynamic Channel Allocation Algorithm
    MODs et al. enjoyed a lot of popularity, but the standard for computer music is MIDI. While MOD and friends took a computer-centric view of music, MIDI takes, well, a music-centric view of music.

    There are MIDI visualization programs as well. The one that came with my Gravis Ultrasound was called PLAYMIDI.EXE. It looked like this…


    Gravis Ultrasound PLAYMIDI.EXE application

    … and it confused me. There are 16 distinct channels being visualized but some channels are shown playing multiple notes. When I dug into the technical details, I learned that MIDI just specifies what notes need to be played, at what times and frequencies and using which instrument samples, and it was the MIDI playback program’s job to make it happen.

    Thus, if a MIDI file specifies that track 1 should play a C major chord consisting of notes C, E, and G, it would transmit events “key-on C; delta time 0; key-on E; delta time 0; key-on G; delta time …; [other commands]“. If the playback program has access to multiple channels (say, up to 32, in the case of the GUS), the intuitive approach would be to maintain a pool of all available channels. Then, when it’s time to process the “key-on C” event, fetch the first available channel from the pool, mark it as in-use, play C on the channel, and return that channel to the pool when either the sample runs its course or the corresponding “key-off C” event is encountered in the MIDI command stream.

    About That Game Music
    Circling back around to my game music website, numerous supported systems use the fixed channel approach for playback while others use dynamic channel allocation approach, including evey Nintendo DS game I have so far analyzed.

    Which approach is better? As in many technical matters, there are trade-offs either way. For many systems, the fixed channel approach is necessary because for many older audio synthesis systems, different channels had very specific purposes. The 8-bit NES had 5 channels: 2 square wave generators (used musically for melody/treble), 1 triangle wave generator (usually used for bass line), a noise generator (subverted for all manner of percussive sounds), and a limited digital channel (was sometimes assigned richer percussive sounds). Dynamic channel allocation wouldn’t work here.

    But the dynamic approach works great on hardware with 16 digital channels available like, for example, the Nintendo DS. Digital channels are very general-purpose. What about the SNES, with its 8 digital channels? Either approach could work. In practice, most games used a fixed channel approach: Games might use 4-6 channels for music while reserving the remainder for various in-game sound effects. Some notable exceptions to this pattern were David Wise’s compositions for Rare’s SNES games (think Battletoads and the various Donkey Kong Country titles). These clearly use some dynamic channel approach since masking all but one channel will give you a variety of instrument sounds.

    Epilogue
    There! That took a long time to explain but I find it fascinating for some reason. I need to distill it down to far fewer words because I want to make it a FAQ on my website for “Why can’t I isolate specific tracks for Nintendo DS games?”

    Actually, perhaps I should remove the ability to toggle Nintendo DS channels in the first place. Here’s a funny tale of needless work: I found the Vio2sf engine for synthesizing Nintendo DS music and incorporated it into the program. It didn’t support toggling of individual channels so I figured out a way to add that feature to the engine. And then I noticed that most Nintendo DS games render that feature moot. After I released the webapp, I learned that I was out of date on the Vio2sf engine. The final insult was that the latest version already supports channel toggling. So I did the work for nothing. But then again, since I want to remove that feature from the UI, doubly so.

  • Developing A Shader-Based Video Codec

    22 juin 2013, par Multimedia MikeOutlandish Brainstorms

    Early last month, this thing called ORBX.js was in the news. It ostensibly has something to do with streaming video and codec technology, which naturally catches my interest. The hype was kicked off by Mozilla honcho Brendan Eich when he posted an article asserting that HD video decoding could be entirely performed in JavaScript. We’ve seen this kind of thing before using Broadway– an H.264 decoder implemented entirely in JS. But that exposes some very obvious limitations (notably CPU usage).

    But this new video codec promises 1080p HD playback directly in JavaScript which is a lofty claim. How could it possibly do this? I got the impression that performance was achieved using WebGL, an extension which allows JavaScript access to accelerated 3D graphics hardware. Browsing through the conversations surrounding the ORBX.js announcement, I found this confirmation from Eich himself:

    You’re right that WebGL does heavy lifting.

    As of this writing, ORBX.js remains some kind of private tech demo. If there were a public demo available, it would necessarily be easy to reverse engineer the downloadable JavaScript decoder.

    But the announcement was enough to make me wonder how it could be possible to create a video codec which effectively leverages 3D hardware.

    Prior Art
    In theorizing about this, it continually occurs to me that I can’t possibly be the first person to attempt to do this (or the ORBX.js people, for that matter). In googling on the matter, I found various forums and Q&A posts where people asked if it were possible to, e.g., accelerate JPEG decoding and presentation using 3D hardware, with no answers. I also found a blog post which describes a plan to use 3D hardware to accelerate VP8 video decoding. It was a project done under the banner of Google’s Summer of Code in 2011, though I’m not sure which open source group mentored the effort. The project did not end up producing the shader-based VP8 codec originally chartered but mentions that “The ‘client side’ of the VP8 VDPAU implementation is working and is currently being reviewed by the libvdpau maintainers.” I’m not sure what that means. Perhaps it includes modifications to the public API that supports VP8, but is waiting for the underlying hardware to actually implement VP8 decoding blocks in hardware.

    What’s So Hard About This?
    Video decoding is a computationally intensive task. GPUs are known to be really awesome at chewing through computationally intensive tasks. So why aren’t GPUs a natural fit for decoding video codecs?

    Generally, it boils down to parallelism, or lack of opportunities thereof. GPUs are really good at doing the exact same operations over lots of data at once. The problem is that decoding compressed video usually requires multiple phases that cannot be parallelized, and the individual phases often cannot be parallelized. In strictly mathematical terms, a compressed data stream will need to be decoded by applying a function f(x) over each data element, x0 .. xn. However, the function relies on having applied the function to the previous data element, i.e.:

    f(xn) = f(f(xn-1))
    

    What happens when you try to parallelize such an algorithm? Temporal rifts in the space/time continuum, if you’re in a Star Trek episode. If you’re in the real world, you’ll get incorrect, unusuable data as the parallel computation is seeded with a bunch of invalid data at multiple points (which is illustrated in some of the pictures in the aforementioned blog post about accelerated VP8).

    Example: JPEG
    Let’s take a very general look at the various stages involved in decoding the ubiquitous JPEG format:


    High level JPEG decoding flow

    What are the opportunities to parallelize these various phases?

    • Huffman decoding (run length decoding and zig-zag reordering is assumed to be rolled into this phase): not many opportunities for parallelizing the various Huffman formats out there, including this one. Decoding most Huffman streams is necessarily a sequential operation. I once hypothesized that it would be possible to engineer a codec to achieve some parallelism during the entropy decoding phase, and later found that On2′s VP8 codec employs the scheme. However, such a scheme is unlikely to break down to such a fine level that WebGL would require.
    • Reverse DC prediction: JPEG — and many other codecs — doesn’t store full DC coefficients. It stores differences in successive DC coefficients. Reversing this process can’t be parallelized. See the discussion in the previous section.
    • Dequantize coefficients: This could be very parallelized. It should be noted that software decoders often don’t dequantize all coefficients. Many coefficients are 0 and it’s a waste of a multiplication operation to dequantize. Thus, this phase is sometimes rolled into the Huffman decoding phase.
    • Invert discrete cosine transform: This seems like it could be highly parallelizable. I will be exploring this further in this post.
    • Convert YUV -> RGB for final display: This is a well-established use case for 3D acceleration.

    Crash Course in 3D Shaders and Humility
    So I wanted to see if I could accelerate some parts of JPEG decoding using something called shaders. I made an effort to understand 3D programming and its associated math throughout the 1990s but 3D technology left me behind a very long time ago while I got mixed up in this multimedia stuff. So I plowed through a few books concerning WebGL (thanks to my new Safari Books Online subscription). After I learned enough about WebGL/JS to be dangerous and just enough about shader programming to be absolutely lethal, I set out to try my hand at optimizing IDCT using shaders.

    Here’s my extremely high level (and probably hopelessly naive) view of the modern GPU shader programming model:


    Basic WebGL rendering pipeline

    The WebGL program written in JavaScript drives the show. It sends a set of vertices into the WebGL system and each vertex is processed through a vertex shader. Then, each pixel that falls within a set of vertices is sent through a fragment shader to compute the final pixel attributes (R, G, B, and alpha value). Another consideration is textures: This is data that the program uploads to GPU memory which can be accessed programmatically by the shaders).

    These shaders (vertex and fragment) are key to the GPU’s programmability. How are they programmed? Using a special C-like shading language. Thought I: “C-like language? I know C! I should be able to master this in short order!” So I charged forward with my assumptions and proceeded to get smacked down repeatedly by the overall programming paradigm. I came to recognize this as a variation of the scientific method: Develop a hypothesis– in my case, a mental model of how the system works; develop an experiment (short program) to prove or disprove the model; realize something fundamental that I was overlooking; formulate new hypothesis and repeat.

    First Approach: Vertex Workhorse
    My first pitch goes like this:

    • Upload DCT coefficients to GPU memory in the form of textures
    • Program a vertex mesh that encapsulates 16×16 macroblocks
    • Distribute the IDCT effort among multiple vertex shaders
    • Pass transformed Y, U, and V blocks to fragment shader which will convert the samples to RGB

    So the idea is that decoding of 16×16 macroblocks is parallelized. A macroblock embodies 6 blocks:


    JPEG macroblocks

    It would be nice to process one of these 6 blocks in each vertex. But that means drawing a square with 6 vertices. How do you do that? I eventually realized that drawing a square with 6 vertices is the recommended method for drawing a square on 3D hardware. Using 2 triangles, each with 3 vertices (0, 1, 2; 3, 4, 5):


    2 triangles make a square

    A vertex shader knows which (x, y) coordinates it has been assigned, so it could figure out which sections of coefficients it needs to access within the textures. But how would a vertex shader know which of the 6 blocks it should process? Solution: Misappropriate the vertex’s z coordinate. It’s not used for anything else in this case.

    So I set all of that up. Then I hit a new roadblock: How to get the reconstructed Y, U, and V samples transported to the fragment shader? I have found that communicating between shaders is quite difficult. Texture memory? WebGL doesn’t allow shaders to write back to texture memory; shaders can only read it. The standard way to communicate data from a vertex shader to a fragment shader is to declare variables as “varying”. Up until this point, I knew about varying variables but there was something I didn’t quite understand about them and it nagged at me: If 3 different executions of a vertex shader set 3 different values to a varying variable, what value is passed to the fragment shader?

    It turns out that the varying variable varies, which means that the GPU passes interpolated values to each fragment shader invocation. This completely destroys this idea.

    Second Idea: Vertex Workhorse, Take 2
    The revised pitch is to work around the interpolation issue by just having each vertex shader invocation performs all 6 block transforms. That seems like a lot of redundant. However, I figured out that I can draw a square with only 4 vertices by arranging them in an ‘N’ pattern and asking WebGL to draw a TRIANGLE_STRIP instead of TRIANGLES. Now it’s only doing the 4x the extra work, and not 6x. GPUs are supposed to be great at this type of work, so it shouldn’t matter, right?

    I wired up an experiment and then ran into a new problem: While I was able to transform a block (or at least pretend to), and load up a varying array (that wouldn’t vary since all vertex shaders wrote the same values) to transmit to the fragment shader, the fragment shader can’t access specific values within the varying block. To clarify, a WebGL shader can use a constant value — or a value that can be evaluated as a constant at compile time — to index into arrays; a WebGL shader can not compute an index into an array. Per my reading, this is a WebGL security consideration and the limitation may not be present in other OpenGL(-ES) implementations.

    Not Giving Up Yet: Choking The Fragment Shader
    You might want to be sitting down for this pitch:

    • Vertex shader only interpolates texture coordinates to transmit to fragment shader
    • Fragment shader performs IDCT for a single Y sample, U sample, and V sample
    • Fragment shader converts YUV -> RGB

    Seems straightforward enough. However, that step concerning IDCT for Y, U, and V entails a gargantuan number of operations. When computing the IDCT for an entire block of samples, it’s possible to leverage a lot of redundancy in the math which equates to far fewer overall operations. If you absolutely have to compute each sample individually, for an 8×8 block, that requires 64 multiplication/accumulation (MAC) operations per sample. For 3 color planes, and including a few extra multiplications involved in the RGB conversion, that tallies up to about 200 MACs per pixel. Then there’s the fact that this approach means a 4x redundant operations on the color planes.

    It’s crazy, but I just want to see if it can be done. My approach is to pre-compute a pile of IDCT constants in the JavaScript and transmit them to the fragment shader via uniform variables. For a first order optimization, the IDCT constants are formatted as 4-element vectors. This allows computing 16 dot products rather than 64 individual multiplication/addition operations. Ideally, GPU hardware executes the dot products faster (and there is also the possibility of lining these calculations up as matrices).

    I can report that I actually got a sample correctly transformed using this approach. Just one sample, through. Then I ran into some new problems:

    Problem #1: Computing sample #1 vs. sample #0 requires a different table of 64 IDCT constants. Okay, so create a long table of 64 * 64 IDCT constants. However, this suffers from the same problem as seen in the previous approach: I can’t dynamically compute the index into this array. What’s the alternative? Maintain 64 separate named arrays and implement 64 branches, when branching of any kind is ill-advised in shader programming to begin with? I started to go down this path until I ran into…

    Problem #2: Shaders can only be so large. 64 * 64 floats (4 bytes each) requires 16 kbytes of data and this well exceeds the amount of shader storage that I can assume is allowed. That brings this path of exploration to a screeching halt.

    Further Brainstorming
    I suppose I could forgo pre-computing the constants and directly compute the IDCT for each sample which would entail lots more multiplications as well as 128 cosine calculations per sample (384 considering all 3 color planes). I’m a little stuck with the transform idea right now. Maybe there are some other transforms I could try.

    Another idea would be vector quantization. What little ORBX.js literature is available indicates that there is a method to allow real-time streaming but that it requires GPU assistance to yield enough horsepower to make it feasible. When I think of such severe asymmetry between compression and decompression, my mind drifts towards VQ algorithms. As I come to understand the benefits and limitations of GPU acceleration, I think I can envision a way that something similar to SVQ1, with its copious, hierarchical vector tables stored as textures, could be implemented using shaders.

    So far, this all pertains to intra-coded video frames. What about opportunities for inter-coded frames? The only approach that I can envision here is to use WebGL’s readPixels() function to fetch the rasterized frame out of the GPU, and then upload it again as a new texture which a new frame processing pipeline could reference. Whether this idea is plausible would require some profiling.

    Using interframes in such a manner seems to imply that the entire codec would need to operate in RGB space and not YUV.

    Conclusions
    The people behind ORBX.js have apparently figured out a way to create a shader-based video codec. I have yet to even begin to reason out a plausible approach. However, I’m glad I did this exercise since I have finally broken through my ignorance regarding modern GPU shader programming. It’s nice to have a topic like multimedia that allows me a jumping-off point to explore other areas.