Breaking Eggs And Making Omelettes

A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.

http://multimedia.cx/eggs/

Les articles publiés sur le site

  • Further Dreamcast Hacking

    3 février 2011, par Multimedia MikeSega Dreamcast

    I'm still haunted by Sega Dreamcast programming, specifically the fact that I used to be able to execute custom programs on the thing (roughly 8-10 years ago) and now I cannot. I'm going to compose a post to describe my current adventures on this front. There are 3 approaches I have been using: Raw, Kallistios, and the almighty Linux.


    Raw
    What I refer to as "raw" is an assortment of programs that lived in a small number of source files (sometimes just one ASM file) and could be compiled with the most basic SH-4 toolchain. The advantage here is that there aren't many moving parts and not many things that can possibly go wrong, so it provides a good functional baseline.

    One of the original Dreamcast hackers was Marcus Comstedt, who still has his original DC material hosted at the reasonably easy-to-remember URL mc.pp.se/dc. I can get some of these simple demos to work, but not others.

    I also successfully assembled and ran a pair of 256-byte (!!) demos from this old DC scene page.

    KallistiOS
    KallistiOS (or just KOS) was a real-time OS developed for the DC and was popular among the DC homebrew community. All the programming I did back in the day was based around KOS. Now I can't get any of it to work. More specifically, KOS can't seem to make it past a certain point in its system initialization.

    The Linux Option
    I was never that excited about running Linux on my Dreamcast. For some hackers, running Linux on a given piece of consumer electronics is the highest attainable goal. Back in the day, I looked at it from a much more pragmatic perspective-- I didn't see much use in running Linux on the DC, not as much as running KOS which was developed to be a much more appropriate fit.

    However, I was able to burn a CD-R of an old binary image of Linux 2.4.5 compiled for the Dreamcast and boot it some months ago. So I at least have a feeling that this should work. I have never cross-compiled a kernel of my own (though I have compiled many, many x86 kernels in my time, so I'm not a total n00b in this regard). I figured this might be a good time to start.

    The first item that worries me is getting a functional cross-compiling toolchain. Fortunately, a little digging in the Linux kernel documentation pointed me in the direction of a bunch of ready-made toolchains hosted at kernel.org. So I grabbed one of the SH toolchains (gcc-4.3.3-nolibc) and got rolling.

    I'm well familiar with the cycle of 'make menuconfig' in order to pick configuration options, and then 'make' to build a kernel (or usually 'make zImage' or 'make bzImage' to create compressed images). For cross compiling, the primary difference seems to be editing the root Makefile in the Linux source code tree (I'm using 2.6.37, the latest stable as of this writing) and setting a value for the CROSS_COMPILE variable. Then, run 'make menuconfig' followed by 'make' as normal.

    The Linux 2.6 series is supposed to support a range of Renesas (formerly Hitachi) SH processors and board configurations. This includes reasonable defaults for the Sega Dreamcast hardware. I got it all compiling except for a series of .S files. Linus Torvalds once helped me debug a program I work on so I thought I'd see if there was something I could help debug here.

    The first issue was with ASM statements of a form similar to:

    mov #0xffffffe0, r1
    

    Now, the DC's SH-4 is a RISC CPU. A lot of RISC architectures adopt a fixed instruction size of 32 bits. You can't encode an entire 32-bit immediate value inside of a 32-bit instruction (there would be no room for the instruction encoding). Further, the SH series encoded instructions with a mere 16 bits. The move immediate data instruction only allows for an 8-bit, sign-extended value.

    I decided that the above statement is equivalent to:

    mov #-32, r1
    

    I'll give this statement the benefit of the doubt that it used to work with the gcc toolchain somewhere along the line. I assume that the assembler is supposed to know enough to substitute the first form with the second.

    The next problem is that an 'sti' instruction shows up in a number of spots. Using Intel x86 conventions, this is a "set interrupt flag" instruction (I remember that the 6502 CPU had the same instruction mnemonic, though its interrupt flag's operation was opposite that of the x86). The SH-4 reference manual lists no 'sti' instruction. When it gets to these lines, the assembler complains about immediate move instructions with too large data, like the instructions above. I'm guessing they must be macro'd to something else but I failed to find where. I commented out those lines for the time being. Probably not that smart, but I want to keep this moving for now.

    So I got the code to compile into a kernel file called 'vmlinux'. I've seen this file many times before but never thought about how to get it to run directly. The process has usually been to compress it and send it over to lilo or grub for loading, as that is the job of the bootloader. I have never even wondered what format the vmlinux file takes until now. It seems that 'vmlinux' is just a plain old ELF file:

    $ file vmlinux
    vmlinux: ELF 32-bit LSB executable, Renesas SH,
    version 1 (SYSV), statically linked, not stripped
    

    The 'dc-tool' program that uploads executables to the waiting bootloader on the Dreamcast is perfectly cool accepting ELF files (and S-record files, and raw binary files). After a very lengthy upload process, execution fails (resets the system).

    For the sake of comparison, I dusted off that Linux 2.4.5 bootable Dreamcast CD-ROM and directly uploaded the vmlinux file from that disc. That works just fine (until it's time to go to the next loading phase, i.e., finding a filesystem). Possible issues here could include the commented 'sti' instructions (could be that they aren't just decoration). I'm also trying to understand the memory organization-- perhaps the bootloader wants the ELF to be based at a different address. Or maybe the kernel and the bootloader don't like each other in the first place-- in this case, I need to study the bootable Linux CD-ROM to see how it's done.

    Optimism
    Even though I'm meeting with rather marginal success, this is tremendously educational. I greatly enjoy these exercises if only for the deeper understanding they bring for the lowest-level system details.

  • Heroic Defender of the Stack

    27 janvier 2011, par Multimedia MikeProgramming

    Problem Statement

    I have been investigating stack smashing and countermeasures (stack smashing prevention, or SSP). Briefly, stack smashing occurs when a function allocates a static array on the stack and writes past the end of it, onto other local variables and eventually onto other function stack frames. When it comes time to return from the function, the return address has been corrupted and the program ends up some place it really shouldn't. In the best case, the program just crashes; in the worst case, a malicious party crafts code to exploit this malfunction.

    Further, debugging such a problem is especially obnoxious because by the time the program has crashed, it has already trashed any record (on the stack) of how it got into the errant state.

    Preventative Countermeasure

    GCC has had SSP since version 4.1. The computer inserts SSP as additional code when the -fstack-protector command line switch is specified. Implementation-wise, SSP basically inserts a special value (the literature refers to this as the 'canary' as in "canary in the coalmine") at the top of the stack frame when entering the function, and code before leaving the function to make sure the canary didn't get stepped on. If something happens to the canary, the program is immediately aborted with a message to stderr about what happened. Further, gcc's man page on my Ubuntu machine proudly trumpets that this functionality is enabled per default ever since Ubuntu 6.10.

    And that's really all there is to it. Your code is safe from stack smashing by default. Or so the hand-wavy documentation would have you believe.

    Not exactly

    Exercising the SSP

    I wanted to see the SSP in action to make sure it was a real thing. So I wrote some code that smashes the stack in pretty brazen ways so that I could reasonably expect to trigger the SSP (see later in this post for the code). Here's what I learned that wasn't in any documentation:

    SSP is only emitted for functions that have static arrays of 8-bit data (i.e., [unsigned] chars). If you have static arrays of other data types (like, say, 32-bit ints), those are still fair game for stack smashing.

    Evaluating the security vs. speed/code size trade-offs, it makes sense that the compiler wouldn't apply this protection everywhere (I can only muse about how my optimization-obsessive multimedia hacking colleagues would absolute freak out if this code were unilaterally added to all functions). So why are only static char arrays deemed to be "vulnerable objects" (the wording that the gcc man page uses)? A security hacking colleague suggested that this is probably due to the fact that the kind of data which poses the highest risk is arrays of 8-bit input data from, e.g., network sources.

    The gcc man page also lists an option -fstack-protector-all that is supposed to protect all functions. The man page's definition of "all functions" perhaps differs from my own since invoking the option does not have differ in result from plain, vanilla -fstack-protector.

    The Valgrind Connection

    "Memory trouble? Run Valgrind!" That may as well be Valgrind's marketing slogan. Indeed, it's the go-to utility for finding troublesome memory-related problems and has saved me on a number of occasions. However, it must be noted that it is useless for debugging this type of problem. If you understand how Valgrind works, this makes perfect sense. Valgrind operates by watching all memory accesses and ensuring that the program is only accessing memory to which it has privileges. In the stack smashing scenario, the program is fully allowed to write to that stack space; after all, the program recently, legitimately pushed that return value onto the stack when calling the errant, stack smashing function.

    Valgrind embodies a suite of tools. My idea for an addition to this suite would be a mechanism which tracks return values every time a call instruction is encountered. The tool could track the return values in a separate stack data structure, though this might have some thorny consequences for some more unusual program flows. Instead, it might track them in some kind of hash/dictionary data structure and warn the programmer whenever a 'ret' instruction is returning to an address that isn't in the dictionary.

    Simple Stack Smashing Code

    Here's the code I wrote to test exactly how SSP gets invoked in gcc. Compile with 'gcc -g -O0 -Wall -fstack-protector-all -Wstack-protector stack-fun.c -o stack-fun'.

    stack-fun.c:

    C:
    1. /* keep outside of the stack frame */
    2. static int i;
    3.  
    4. void stack_smasher32(void)
    5. {
    6.   int buffer32[8];
    7.   // uncomment this array and compile without optimizations
    8.   // in order to force this function to compile with SSP
    9. //  char buffer_to_trigger_ssp[8];
    10.  
    11.   for (i = 0; i <50; i++)
    12.     buffer32[i] = 0xA5;
    13. }
    14.  
    15. void stack_smasher8(void)
    16. {
    17.   char buffer8[8];
    18.   for (i = 0; i <50; i++)
    19.     buffer8[i] = 0xA5;
    20. }
    21.  
    22. int main()
    23. {
    24. //  stack_smasher8();
    25.   stack_smasher32();
    26.   return 0;
    27. }

    The above incarnation should just produce the traditional "Segmentation fault". However, uncommenting and executing stack_smasher8() in favor of stack_smasher32() should result in "*** stack smashing detected ***: ./stack-fun terminated", followed by the venerable "Segmentation fault".

    As indicated in the comments for stack_smasher32(), it's possible to trick the compiler into emitting SSP for a function by inserting an array of at least 8 bytes (any less and SSP won't emit, as documented, unless gcc's ssp-buffer-size parameter is tweaked). This has to be compiled with no optimization at all (-O0) or else the compiler will (quite justifiably) optimize away the unused buffer and omit SSP.

    For reference, I ran my tests on Ubuntu 10.04.1 with gcc 4.4.3 compiling the code for both x86_32 and x86_64.

  • The First Problem

    19 janvier 2011, par Multimedia MikeHTML5

    A few years ago, The Linux Hater made the following poignant observation regarding Linux driver support:

    Drivers are only just the beginning... But for some reason y’all like to focus on the drivers. You know why lusers do that? Because it just happens to be the problem that people notice first.

    And so it is with the HTML5 video codec debate, re-invigorated in the past week by Google's announcement of dropping native H.264 support in their own HTML5 video tag implementation. As I read up on the fiery debate, I kept wondering why people are so obsessed with this issue. Then I remembered the Linux Hater's post and realized that the video codec issue is simply the first problem that most people notice regarding HTML5 video.

    I appreciate that the video codec debate has prompted Niedermayer to post on his blog once more. Otherwise, I'm just munching popcorn on the sidelines, amused and mildly relieved that the various factions are vociferously attacking each other rather than that little project I help with at work.

    Getting back to the "first problem" aspect-- there's so much emphasis on the video codec; I wonder why no one ever, ever mentions word one about an audio codec. AAC is typically the codec that pairs with H.264 in the MPEG stack. Dark Shikari once mentioned that "AAC’s licensing terms are exponentially more onerous than H.264′s. If Google didn’t want to use H.264, they would sure as hell not want to use AAC." Most people are probably using "H.264" to refer to the entire MPEG/H.264/AAC stack, even if they probably don't understand what all of those pieces mean.

    Anyway, The Linux Hater's driver piece continues:

    Once y’all have drivers, the fight will move to the next layer up. And like I said, it’s a lot harder at that layer.

    A few months ago, when I wanted to post the WebM output of my new VP8 encoder and thought it would be a nice touch to deliver it via a video tag, I ignored the video codec problem (just encoded a VP8/WebM file) only to immediately discover a problem at a different layer-- specifically, embedding a file using a video tag triggers a full file download when the page is loaded, which is unacceptable from end user and web hosting perspectives. This is a known issue but doesn't get as much attention, I guess because there are bigger problems to solve first (c.f. video codec issue).

    For other issues, check out the YouTube blog's HTML5 post or Hulu's post that also commented on HTML5. Issues such as video streaming flexibility, content protection, fullscreen video, webcam/microphone input, and numerous others are rarely mentioned in the debates. Only "video codec" is of paramount importance.

    But I'm lending too much weight to the cacophony of a largely uninformed internet debate. Realistically, I know there are many talented engineers down in the trenches working to solve at least some of these problems. To tie this in with the Linux driver example, I'm consistently stunned these days regarding how simple it is to get Linux working on a new computer-- most commodity consumer hardware really does just work right out of the box. Maybe one day, we'll wake up and find that HTML5 video has advanced to the point that it solves all of the relevant problems to make it the simple and obvious choice for delivering web video in nearly all situations.

    It won't be this year.

  • Processing Big Data Problems

    8 janvier 2011, par Multimedia MikeBig Data

    I'm becoming more interested in big data problems, i.e., extracting useful information out of absurdly sized sets of input data. I know it's a growing field and there is a lot to read on the subject. But you know how I roll-- just think of a problem to solve and dive right in.

    Here's how my adventure unfolded.

    The Corpus
    I need to run a command line program on a set of files I have collected. This corpus is on the order of 350,000 files. The files range from 7 bytes to 175 MB. Combined, they occupy around 164 GB of storage space.

    Oh, and said storage space resides on an external, USB 2.0-connected hard drive. Stop laughing.

    A file is named according to the SHA-1 hash of its data. The files are organized in a directory hierarchy according to the first 6 hex digits of the SHA-1 hash (e.g., a file named a4d5832f... is stored in a4/d5/83/a4d5832f...). All of this file hash, path, and size information is stored in an SQLite database.

    First Pass
    I wrote a Python script that read all the filenames from the database, fed them into a pool of worker processes using Python's multiprocessing module, and wrote some resulting data for each file back to the SQLite database. My Eee PC has a single-core, hyperthreaded Atom which presents 2 CPUs to the system. Thus, 2 worker threads crunched the corpus. It took awhile. It took somewhere on the order of 9 or 10 or maybe even 12 hours. It took long enough that I'm in no hurry to re-run the test and get more precise numbers.

    At least I extracted my initial set of data from the corpus. Or did I?

    Think About The Future

    A few days later, I went back to revisit the data only to notice that the SQLite database was corrupted. To add insult to that bit of injury, the script I had written to process the data was also completely corrupted (overwritten with something unrelated to Python code). BTW, this is was on a RAID brick configured for redundancy. So that's strike 3 in my personal dealings with RAID technology.

    I moved the corpus to a different external drive and also verified the files after writing (easy to do since I already had the SHA-1 hashes on record).

    The corrupted script was pretty simple to rewrite, even a little better than before. Then I got to re-run it. However, this run was on a faster machine, a hyperthreaded, quad-core beast that exposes 8 CPUs to the system. The reason I wasn't too concerned about the poor performance with my Eee PC is that I knew I was going to be able to run in on this monster later.

    So I let the rewritten script rip. The script gave me little updates regarding its progress. As it did so, I ran some rough calculations and realized that it wasn't predicted to finish much sooner than it would have if I were running it on the Eee PC.

    Limiting Factors
    It had been suggested to me that I/O bandwidth of the external USB drive might be a limiting factor. This is when I started to take that idea very seriously.

    The first idea I had was to move the SQLite database to a different drive. The script records data to the database for every file processed, though it only commits once every 100 UPDATEs, so at least it's not constantly syncing the disc. I ran before and after tests with a small subset of the corpus and noticed a substantial speedup thanks to this policy chance.

    Then I remembered hearing something about "atime" which is access time. Linux filesystems, per default, record the time that a file was last accessed. You can watch this in action by running 'stat <file> ; cat <file> > /dev/null ; stat <file>' and observe that the "Access" field has been updated to NOW(). This also means that every single file that gets read from the external drive still causes an additional write. To avoid this, I started mounting the external drive with '-o noatime' which instructs Linux not to record "last accessed" time for files.

    On the limited subset test, this more than doubled script performance. I then wondered about mounting the external drive as read-only. This had the same performance as noatime. I thought about using both options together but verified that access times are not updated for a read-only filesystem.

    A Note On Profiling
    Once you start accessing files in Linux, those files start getting cached in RAM. Thus, if you profile, say, reading a gigabyte file from a disk and get 31 MB/sec, and then repeat the same test, you're likely to see the test complete instantaneously. That's because the file is already sitting in memory, cached. This is useful in general application use, but not if you're trying to profile disk performance.

    Thus, in between runs, do (as root) 'sync; echo 3 > /proc/sys/vm/drop_caches' in order to wipe caches (explained here).

    Even Better?
    I re-ran the test using these little improvements. Now it takes somewhere around 5 or 6 hours to run.

    I contrived an artificially large file on the external drive and did some 'dd' tests to measure what the drive could really do. The drive consistently measured a bit over 31 MB/sec. If I could read and process the data at 30 MB/sec, the script would be done in about 95 minutes.

    But it's probably rather unreasonable to expect that kind of transfer rate for lots of smaller files scattered around a filesystem. However, it can't be that helpful to have 8 different processes constantly asking the HD for 8 different files at any one time.

    So I wrote a script called stream-corpus.py which simply fetched all the filenames from the database and loaded the contents of each in turn, leaving the data to be garbage-collected at Python's leisure. This test completed in 174 minutes, just shy of 3 hours. I computed an average read speed of around 17 MB/sec.

    Single-Reader Script
    I began to theorize that if I only have one thread reading, performance should improve greatly. To test this hypothesis without having to do a lot of extra work, I cleared the caches and ran stream-corpus.py until 'top' reported that about half of the real memory had been filled with data. Then I let the main processing script loose on the data. As both scripts were using sorted lists of files, they iterated over the filenames in the same order.

    Result: The processing script tore through the files that had obviously been cached thanks to stream-corpus.py, degrading drastically once it had caught up to the streaming script.

    Thus, I was incented to reorganize the processing script just slightly. Now, there is a reader thread which reads each file and stuffs the name of the file into an IPC queue that one of the worker threads can pick up and process. Note that no file data is exchanged between threads. No need-- the operating system is already implicitly holding onto the file data, waiting in case someone asks for it again before something needs that bit of RAM. Technically, this approach accesses each file multiple times. But it makes little practical difference thanks to caching.

    Result: About 183 minutes to process the complete corpus (which works out to a little over 16 MB/sec).

    Why Multiprocess
    Is it even worthwhile to bother multithreading this operation? Monitoring the whole operation via 'top', most instances of the processing script are barely using any CPU time. Indeed, it's likely that only one of the worker threads is doing any work most of the time, pulling a file out of the IPC queue as soon the reader thread triggers its load into cache. Right now, the processing is usually pretty quick. There are cases where the processing (external program) might hang (one of the reasons I'm running this project is to find those cases); the multiprocessing architecture at least allows other processes to take over until a hanging process is timed out and killed by its monitoring process.

    Further, the processing is pretty simple now but is likely to get more intensive in future iterations. Plus, there's the possibility that I might move everything onto a more appropriately-connected storage medium which should help alleviate the bottleneck bravely battled in this post.

    There's also the theoretical possibility that the reader thread could read too far ahead of the processing threads. Obviously, that's not too much of an issue in the current setup. But to guard against it, the processes could share a variable that tracks the total number of bytes that have been processed. The reader thread adds filesizes to the count while the processing threads subtract file sizes. The reader thread would delay reading more if the number got above a certain threshold.

    Leftovers
    I wondered if the order of accessing the files mattered. I didn't write them to the drive in any special order. The drive is formatted with Linux ext3. I ran stream-corpus.py on all the filenames sorted by filename (remember the SHA-1 naming convention described above) and also by sorting them randomly.

    Result: It helps immensely for the filenames to be sorted. The sorted variant was a little more than twice as fast as the random variant. Maybe it has to do with accessing all the files in a single directory before moving onto another directory.

    Further, I have long been under the impression that the best read speed you can expect from USB 2.0 was 27 Mbytes/sec (even though 480 Mbit/sec is bandied about in relation to the spec). This comes from profiling I performed with an external enclosure that supports both USB 2.0 and FireWire-400 (and eSata). FW-400 was able to read the same file at nearly 40 Mbytes/sec that USB 2.0 could only read at 27 Mbytes/sec. Other sources I have read corroborate this number. But this test (using different hardware), achieved over 31 Mbytes/sec.

  • Learn Multimedia Programming By Writing A JPEG Decoder

    6 janvier 2011, par Multimedia MikeProgramming

    For those of you who hack on multimedia tech, how did you get started? Did you begin by studying the mathematical underpinnings of multimedia codec algorithms? Or did you find a practical problem and jump right in by writing code? (Personally, I was always more of a nuts & bolts hacker than a math guy.) I ask because I occasionally get emails from aspiring multimedia hackers who want to know where to begin. Invariably, they want to go the math-first route. I heavily discourage this approach.

    I have a crazy idea for anyone who wants a crash course on multimedia hacking: write a JPEG decoder. In doing so, you will be exposed to a lot of key domain concepts such as bitstream parsing, Huffman decoding, dequantization, zigzagging, the dreaded (inverse) discrete cosine transform, YUV vs. RGB colorspaces, macroblock organization, delta coding, and run length coding.

    Sure, JPEG decoding is a solved problem. But that's hardly the point. Why would you enter an unfamiliar field and hope to come up to speed on the basics by leaping straight into the domain's unsolved problems? If you are successful in this exercise, no one will ever use the fruits of your labor, but that doesn't really matter.

    So, do you want to learn multimedia hacking quickly? Then grab a JPEG file (maybe create a few contrived ones that are small, have friendly dimensions, and feature predictable patterns), grab a good JPEG reference, and implement the decoding algorithm in the language and platform of your choice.

    On the matter of the reference, my personal favorite reference has always been A note about the JPEG decoding algorithm by Cristi Cuturicu. The English grammar is a bit dodgy but overall, it might be the best reference you'll find on the matter-- as simple as it needs to be, but no simpler.

    Good luck!