Breaking Eggs And Making Omelettes

A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.

http://multimedia.cx/eggs/

Articles published on the website

  • Cracking Aztec Game Audio

    7 June 2011, by Multimedia MikeGame Hacking

    Here's a mild multimedia-related reverse engineering challenge for you. It's pretty straightforward for those skilled in the art.

    The Setup
    One side effect of running this ridiculously niche interest blog at the intersection of multimedia, reverse engineering, and game hacking is that people occasionally contact me for assistance on those very matters. So it was when one of my MobyGames peers asked if I can help to extract some music from a game called Aztec Wars. The game consists of 2 discs, each with a music.xbe file that contains multiple tunes and is hundreds of megabytes large.



    That's all the data I received from the first email. At first I'm wondering what makes people think I have some magical insight into cracking these formats with such little information. Ordinarily, I would need to have the entire data file to work with and possibly the game binaries. But I didn't want to ask him to upload hundreds of megabytes of data and I didn't feel like downloading it; commitment issues and all.

    But then I gathered a little confidence and remembered that the .xbe files are probably just Game Resource Archive Formats (GRAF) which are, traditionally, absurdly simple. I asked my colleague to send me a hexdump of the first kilobyte of one of the .xbe GRAFs ('hexdump -C -n 1024 music.xbe > file') as well as the total file size of the GRAF.

    The Hexdump
    The first music.xbe file is 192817376 bytes large. These are the first 1024 144 bytes (more than enough):

    00000000  01 00 00 00 60 04 00 00  14 00 00 00 01 00 00 00  |....`...........|
    00000010  0d 00 00 00 48 00 00 00  94 39 63 01 1c a4 21 03  |....H....9c..¤!.|
    00000020  7a d2 54 04 04 28 ad 05  d8 88 fd 06 d8 88 fd 06  |zÒT..(­.Ø.ý.Ø.ý.|
    00000030  2a 6e 46 08 2a 6e 46 08  2a 6e 46 08 2a 6e 46 08  |*nF.*nF.*nF.*nF.|
    00000040  50 13 2f 0a e0 28 7e 0b  52 49 46 46 44 39 63 01  |P./.à(~.RIFFD9c.|
    00000050  57 41 56 45 66 6d 74 20  10 00 00 00 01 00 02 00  |WAVEfmt ........|
    00000060  44 ac 00 00 10 b1 02 00  04 00 10 00 64 61 74 61  |D¬...±......data|
    00000070  fc 13 63 01 00 00 00 00  00 00 00 00 00 00 00 00  |ü.c.............|
    00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    

    The Challenge
    Armed with only the information in the foregoing section, figure out a method for extracting all the audio files in that file and advise on their playback/conversion. Ideally, this method should require minimal effort from both you and the person on the other end of the conversation.

    The Resolution
    The reason I ask is because I came up with a solution but knew, deep down, that there must be a slightly easier way. How would you solve this?

    The music files in question are now preserved on YouTube (until they see fit to remove them for one reason or another).

  • Internecine Legal Threats

    1 June 2011, by Multimedia MikeLegal/Ethical

    FFmpeg and associated open source multimedia projects such as xine, MPlayer, and VLC have long had a rebel mystique about them; a bunch of hackers playing fast and loose with IP law in order to give the world the free multimedia experience it deserved. We figured out the algorithms using any tools available, including the feared technique of binary reverse engineering. When I gave a presentation about FFmpeg at Linuxtag in 2007, I created this image illustrating said mystique:



    It garnered laughs. But I made the point that we multimedia hackers just press on, doing our thing while ignoring legal threats. The policy has historically worked out famously for us– to date, I seem to be the only person on the receiving end of a sort-of legal threat from the outside world.

    Who would have thought that the most credible legal threat to an open source multimedia project would emanate from a fork of that very project? Because that’s exactly what has transpired:



    Click for full threat

    So it came to pass that Michael Niedermayer — the leader of the FFmpeg project — received a bona fide legal nastygram from Mans Rullgard, a representative of the FFmpeg-forked Libav project. The subject of dispute is a scorched-earth matter involving the somewhat iconic FFmpeg zigzag logo:

       
    Original 2D logo enhanced 3D logo

    To think of all those years we spent worrying about legal threats from organizations outside the community. I’m reminded of that time-honored horror trope/urban legend staple: Get out! The legal threats are coming from inside the house!

    I’m interested to see how this all plays out, particularly regarding jurisdiction, as we have a U.K. resident engaging an Italian lawyer outfit to deliver a legal threat to an Austrian citizen regarding an image hosted on a server in Hungary. I suspect I know why that law firm was chosen, but it’s still a curious jurisdictional setup.

    People often used to ask me if we multimedia hackers would get sued to death for doing what we do. My response was always, “There’s only one way to know for sure,” by which I meant that we would just have to engage in said shady activities and determine empirically if lawsuits resulted. So I’m a strong advocate for experimentation to push the limits. Kudos to Michael and Mans for volunteering to push the legal limits.

  • Salty Game Music

    31 May 2011, by Multimedia MikeGeneral

    Have you heard of Google’s Native Client (NaCl) project? Probably not. Basically, it allows native code modules to run inside a browser (where ‘browser’ is defined pretty narrowly as ‘Google Chrome’ in this case). Programs are sandboxed so they aren’t a security menace (or so the whitepapers claim) but are allowed to access a variety of APIs including video and audio. The latter API is significant because sound tends to be forgotten in all the hullabaloo surrounding non-Flash web technologies. At any rate, enjoy NaCl while you can because I suspect it won’t be around much longer.

    After my recent work upgrading some old music synthesis programs to user more modern audio APIs, I got the idea to try porting the same code to run under NaCl in Chrome (first Nosefart, then Game Music Emu/GME). In this exercise, I met with very limited success. This blog post documents some of the pitfalls in my excursion.



    Infrastructure
    People who know me know that I’m rather partial — to put it gently — to straight-up C vs. C++. The NaCl SDK is heavily skewed towards C++. However, it does provide a Python tool called init_project.py which can create the skeleton of a project and can do so in C with the '-c' option:

    ./init_project.py -c -n saltynosefart
    

    This generates something that can be built using a simple ‘make’. When I added Nosefart’s C files, I learned that the project Makefile has places for project-necessary CFLAGS but does not honor them. The problem is that the generated Makefile includes a broader system Makefile that overrides the CFLAGS in the project Makefile. Going into the system Makefile and changing "CFLAGS =" -> "CFLAGS +=" solves this problem.

    Still, maybe I’m the first person to attempt building something in Native Client so I’m the first person to notice this?

    Basic Playback
    At least the process to create an audio-enabled NaCl app is well-documented. Too bad it doesn’t seem to compile as advertised. According to my notes on the matter, I filled in PPP_InitializeModule() with the appropriate boilerplate as outlined in the docs but got a linker error concerning get_browser_interface().

    Plan B: C++
    Obviously, the straight C stuff is very much a second-class citizen in this NaCl setup. Fortunately, there is already that fully functional tone generator example program in the limited samples suite. Plan B is to copy that project and edit it until it accepts Nosefart/GME audio instead of a sine wave.

    The build system assumes all C++ files should have .cc extensions. I have to make some fixes so that it will accept .cpp files (either that, or rename all .cpp to .cc, but that’s not very clean).

    Making Noise
    You’ll be happy to know that I did successfully swap out the tone generator for either Nosefart or GME. Nosefart has a slightly fickle API that requires revving the emulator frame by frame and generating a certain number of audio samples. GME’s API is much easier to work with in this situation — just tell it how many samples it needs to generate and give it a pointer to a buffer. I played NES and SNES music play through this ad-hoc browser plugin, and I’m confident all the other supported formats would have worked if I went through the bother of converting the music data files into C headers to be included in the NaCl executable binaries (dynamically loading data via the network promised to be a far more challenging prospect reserved for phase 3 of the project).

    Portable?
    I wouldn’t say so. I developed it on Linux and things ran fine there. I tried to run the same binaries on the Windows version of Chrome to no avail. It looks like it wasn’t even loading the .nexe files (NaCl executables).

    Thinking About The (Lack Of A) Future
    As I was working on this project, I noticed that the online NaCl documentation materialized explicit banners warning that my NaCl binaries compiled for Chrome 11 won’t work for Chrome 12 and that I need to code to the newly-released 0.3 SDK version. Not a fuzzy feeling. I also don’t feel good that I’m working from examples using bleeding edge APIs that feature deprecation as part of their naming convention, e.g., pp::deprecated::ScriptableObject().

    Ever-changing API + minimal API documentation + API that only works in one browser brand + requiring end user to explicitly enable feature = … well, that’s why I didn’t bother to release any showcase pertaining to this little experiment. Would have been neat, but I strongly suspect that this is yet another one of those APIs that Google decides to deprecate soon.

    See Also:

  • Revisiting Nosefart and Discovering GME

    30 May 2011, by Multimedia MikeGame Hacking

    I found the following screenshot buried deep in an old directory structure of mine:



    I tried to recall how this screenshot came to exist. Had I actually created a functional KDE frontend to Nosefart yet neglected to release it? I think it’s more likely that I used some designer tool (possibly KDevelop) to prototype a frontend. This would have been sometime in 2000.

    However, this screenshot prompted me to revisit Nosefart.

    Nosefart Background
    Nosefart is a program that can play Nintendo Sound Format (NSF) files. NSF files are files containing components that were surgically separated from Nintendo Entertainment System (NES) ROM dumps. These components contain the music playback engines for various games. An NSF player is a stripped down emulation system that can simulate the NES6502 CPU along with the custom hardware (2 square waves, 1 triangle wave, 1 noise generator, and 1 limited digital channel).

    Nosefart was written by Matt Conte and eventually imported into a Sourceforge project, though it has not seen any development since then. The distribution contains standalone command line players for Linux and DOS, a GTK frontend for the Linux command line version, and plugins for Winamp, XMMS, and CL-Amp.

    The Sourceforge project page notes that Nosefart is also part of XBMC. Let the record show that Nosefart is also incorporated into xine (I did that in 2002, I think).

    Upgrading the API
    When I tried running the command line version of Nosefart under Linux, I hit hard against the legacy audio API: OSS. Remember that?

    In fairly short order, I was able to upgrade the CL program to use PulseAudio. The program is not especially sophisticated. It’s a single-threaded affair which checks for a keypress, processes an audio frame, and sends the frame out to the OSS file interface. All that was needed was to rewrite open_hardware() and close_hardware() for PA and then replace the write statement in play(). The only quirk that stood out is that including <pulse/pulseaudio.h> is insufficient for programming PA’s simple API. <pulse/simple.h> must be included separately.

    For extra credit, I adapted the program to ALSA. The program uses the most simplistic audio output API possible — just keep filling a buffer and sending it out to the DAC.

    Discovering GME
    I’m not sure what to do with the the program now since, during my research to attempt to bring Nosefart up to date, I became aware of a software library named Game Music Emu, or GME. It’s a pure C++ library that can essentially play any classic video game format you can possible name. Wow. A lot can happen in 10 years when you’re not paying attention.

    It’s such a well-written library that I didn’t need any tutorial or documentation to come up to speed. Just a quick read of the main gme.h header library enabled me in short order to whip up a quick C program that could play NSF and SPC files. Path of least resistance: Client program asks library to open a hardcoded file, synthesize 10 seconds of audio, and dump it into a file; ask the FLAC command line program to transcode raw data to .flac file; use ffplay to verify the results.

    I might develop some other uses for this library.

  • Method For Crawling Google

    28 May 2011, by Multimedia MikeBig Data

    I wanted to crawl Google in order to harvest a large corpus of certain types of data as yielded by a certain search term (we’ll call it “term” for this exercise). Google doesn’t appear to offer any API to automatically harvest their search results (why would they?). So I sat down and thought about how to do it. This is the solution I came up with.



    FAQ
    Q: Is this legal / ethical / compliant with Google’s terms of service?
    A: Does it look like I care? Moving right along…

    Manual Crawling Process
    For this exercise, I essentially automated the task that would be performed by a human. It goes something like this:

    1. Search for “term”
    2. On the first page of results, download each of the 10 results returned
    3. Click on the next page of results
    4. Go to step 2, until Google doesn’t return anymore pages of search results

    Google returns up to 1000 results for a given search term. Fetching them 10 at a time is less than efficient. Fortunately, the search URL can easily be tweaked to return up to 100 results per page.

    Expanding Reach
    Problem: 1000 results for the “term” search isn’t that many. I need a way to expand the search. I’m not aiming for relevancy; I’m just searching for random examples of some data that occurs around the internet.

    My solution for this is to refine the search using the “site” wildcard. For example, you can ask Google to search for “term” at all Canadian domains using “site:.ca”. So, the manual process now involves harvesting up to 1000 results for every single internet top level domain (TLD). But many TLDs can be more granular than that. For example, there are 50 sub-domains under .us, one for each state (e.g., .ca.us, .ny.us). Those all need to be searched independently. Same for all the sub-domains under TLDs which don’t allow domains under the main TLD, such as .uk (search under .co.uk, .ac.uk, etc.).

    Another extension is to combine “term” searches with other terms that are likely to have a rich correlation with “term”. For example, if “term” is relevant to various scientific fields, search for “term” in conjunction with various scientific disciplines.

    Algorithmically
    My solution is to create an SQLite database that contains a table of search seeds. Each seed is essentially a “site:” string combined with a starting index.

    Each TLD and sub-TLD is inserted as a searchseed record with a starting index of 0.

    A script performs the following crawling algorithm:

    • Fetch the next record from the searchseed table which has not been crawled
    • Fetch search result page from Google
    • Scrape URLs from page and insert each into URL table
    • Mark the searchseed record as having been crawled
    • If the results page indicates there are more results for this search, insert a new searchseed for the same seed but with a starting index 100 higher

    Digging Into Sites
    Sometimes, Google notes that certain sites are particularly rich sources of “term” and offers to let you search that site for “term”. This basically links to another search for ‘term site:somesite”. That site gets its own search seed and the program might harvest up to 1000 URLs from that site alone.

    Harvesting the Data
    Armed with a database of URLs, employ the following algorithm:

    • Fetch a random URL from the database which has yet to be downloaded
    • Try to download it
    • For goodness sake, have a mechanism in place to detect whether the download process has stalled and automatically kill it after a certain period of time
    • Store the data and update the database, noting where the information was stored and that it is already downloaded

    This step is easy to parallelize by simply executing multiple copies of the script. It is useful to update the URL table to indicate that one process is already trying to download a URL so multiple processes don’t duplicate work.

    Acting Human
    A few factors here:

    • Google allegedly doesn’t like automated programs crawling its search results. Thus, at the very least, don’t let your script advertise itself as an automated program. At a basic level, this means forging the User-Agent: HTTP header. By default, Python’s urllib2 will identify itself as a programming language. Change this to a well-known browser string.
    • Be patient; don’t fire off these search requests as quickly as possible. My crawling algorithm inserts a random delay of a few seconds in between each request. This can still yield hundreds of useful URLs per minute.
    • On harvesting the data: Even though you can parallelize this and download data as quickly as your connection can handle, it’s a good idea to randomize the URLs. If you hypothetically had 4 download processes running at once and they got to a point in the URL table which had many URLs from a single site, the server might be configured to reject too many simultaneous requests from a single client.

    Conclusion
    Anyway, that’s just the way I would (and did) do it. What did I do with all the data? That’s a subject for a different post.

    Adorable spider drawing from here.