Breaking Eggs And Making Omelettes

A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.

http://multimedia.cx/eggs/

Les articles publiés sur le site

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 38

Developing MobyCAIRO

26 mai 2021, par Multimedia Mike — General

I recently published a tool called MobyCAIRO. The ‘CAIRO’ part stands for Computer-Assisted Image ROtation, while the ‘Moby’ prefix refers to its role in helping process artifact image scans to submit to the MobyGames database. The tool is meant to provide an accelerated workflow for rotating and cropping image scans. It works on both Windows and Linux. Hopefully, it can solve similar workflow problems for other people.

As of this writing, MobyCAIRO has not been tested on Mac OS X yet– I expect some issues there that should be easily solvable if someone cares to test it.

The rest of this post describes my motivations and how I arrived at the solution.

Background
I have scanned well in excess of 2100 images for MobyGames and other purposes in the past 16 years or so. The workflow looks like this:

Image workflow

It should be noted that my original workflow featured me manually rotating the artifact on the scanner bed in order to ensure straightness, because I guess I thought that rotate functions in image editing programs constituted dark, unholy magic or something. So my workflow used to be even more arduous:

I can’t believe I had the patience to do this for hundreds of scans

Sometime last year, I was sitting down to perform some more scanning and found myself dreading the oncoming tedium of straightening and cropping the images. This prompted a pivotal question:

Why can’t a computer do this for me?

After all, I have always been a huge proponent of making computers handle the most tedious, repetitive, mind-numbing, and error-prone tasks. So I did some web searching to find if there were any solutions that dealt with this. I also consulted with some like-minded folks who have to cope with the same tedious workflow.

I came up empty-handed. So I endeavored to develop my own solution.

Problem Statement and Prior Work

I want to develop a workflow that can automatically rotate an image so that it is straight, and also find the most likely crop rectangle, uniformly whitening the area outside of the crop area (in the case of circles).

As mentioned, I checked to see if any other programs can handle this, starting with my usual workhorse, Photoshop Elements. But I can’t expect the trimmed down version to do everything. I tried to find out if its big brother could handle the task, but couldn’t find a definitive answer on that. Nor could I find any other tools that seem to take an interest in optimizing this particular workflow.

When I brought this up to some peers, I received some suggestions, including an idea that the venerable GIMP had a feature like this, but I could not find any evidence. Further, I would get responses of “Program XYZ can do image rotation and cropping.” I had to tamp down on the snark to avoid saying “Wow! An image editor that can perform rotation AND cropping? What a game-changer!” Rotation and cropping features are table stakes for any halfway competent image editor for the last 25 or so years at least. I am hoping to find or create a program which can lend a bit of programmatic assistance to the task.

Why can’t other programs handle this? The answer seems fairly obvious: Image editing tools are general tools and I want a highly customized workflow. It’s not reasonable to expect a turnkey solution to do this.

Brainstorming An Approach
I started with the happiest of happy cases— A disc that needed archiving (a marketing/press assets CD-ROM from a video game company, contents described here) which appeared to have some pretty clear straight lines:

My idea was to try to find straight lines in the image and then rotate the image so that the image is parallel to the horizontal based on the longest single straight line detected.

I just needed to figure out how to find a straight line inside of an image. Fortunately, I quickly learned that this is very much a solved problem thanks to something called the Hough transform. As a bonus, I read that this is also the tool I would want to use for finding circles, when I got to that part. The nice thing about knowing the formal algorithm to use is being able to find efficient, optimized libraries which already implement it.

Early Prototype
A little searching for how to perform a Hough transform in Python led me first to scikit. I was able to rapidly produce a prototype that did some basic image processing. However, running the Hough transform directly on the image and rotating according to the longest line segment discovered turned out not to yield expected results.

It also took a very long time to chew on the 3300×3300 raw image– certainly longer than I care to wait for an accelerated workflow concept. The key, however, is that you are apparently not supposed to run the Hough transform on a raw image– you need to compute the edges first, and then attempt to determine which edges are ‘straight’. The recommended algorithm for this step is the Canny edge detector. After applying this, I get the expected rotation:

The algorithm also completes in a few seconds. So this is a good early result and I was feeling pretty confident. But, again– happiest of happy cases. I should also mention at this point that I had originally envisioned a tool that I would simply run against a scanned image and it would automatically/magically make the image straight, followed by a perfect crop.

Along came my MobyGames comrade Foxhack to disabuse me of the hope of ever developing a fully automated tool. Just try and find a usefully long straight line in this:

Darn it, Foxhack…

There are straight edges, to be sure. But my initial brainstorm of rotating according to the longest straight edge looks infeasible. Further, it’s at this point that we start brainstorming that perhaps we could match on ratings badges such as the standard ESRB badges omnipresent on U.S. video games. This gets into feature detection and complicates things.

This Needs To Be Interactive
At this point in the effort, I came to terms with the fact that the solution will need to have some element of interactivity. I will also need to get out of my safe Linux haven and figure out how to develop this on a Windows desktop, something I am not experienced with.

I initially dreamed up an impressive beast of a program written in C++ that leverages Windows desktop GUI frameworks, OpenGL for display and real-time rotation, GPU acceleration for image analysis and processing tricks, and some novel input concepts. I thought GPU acceleration would be crucial since I have a fairly good GPU on my main Windows desktop and I hear that these things are pretty good at image processing.

I created a list of prototyping tasks on a Trello board and made a decent amount of headway on prototyping all the various pieces that I would need to tie together in order to make this a reality. But it was ultimately slowgoing when you can only grab an hour or 2 here and there to try to get anything done.

Settling On A Solution
Recently, I was determined to get a set of old shareware discs archived. I ripped the data a year ago but I was blocked on the scanning task because I knew that would also involve tedious straightening and cropping. So I finally got all the scans done, which was reasonably quick. But I was determined to not manually post-process them.

This was fairly recent, but I can’t quite recall how I managed to come across the OpenCV library and its Python bindings. OpenCV is an amazing library that provides a significant toolbox for performing image processing tasks. Not only that, it provides “just enough” UI primitives to be able to quickly create a basic GUI for your program, including image display via multiple windows, buttons, and keyboard/mouse input. Furthermore, OpenCV seems to be plenty fast enough to do everything I need in real time, just with (accelerated where appropriate) CPU processing.

So I went to work porting the ideas from the simple standalone Python/scikit tool. I thought of a refinement to the straight line detector– instead of just finding the longest straight edge, it creates a histogram of 360 rotation angles, and builds a list of lines corresponding to each angle. Then it sorts the angles by cumulative line length and allows the user to iterate through this list, which will hopefully provide the most likely straightened angle up front. Further, the tool allows making fine adjustments by 1/10 of an angle via the keyboard, not the mouse. It does all this while highlighting in red the straight line segments that are parallel to the horizontal axis, per the current candidate angle.

The tool draws a light-colored grid over the frame to aid the user in visually verifying the straightness of the image. Further, the program has a mode that allows the user to see the algorithm’s detected edges:

For the cropping phase, the program uses the Hough circle transform in a similar manner, finding the most likely circles (if the image to be processed is supposed to be a circle) and allowing the user to cycle among them while making precise adjustments via the keyboard, again, rather than the mouse.

Running the Hough circle transform is a significantly more intensive operation than the line transform. When I ran it on a full 3300×3300 image, it ran for a long time. I didn’t let it run longer than a minute before forcibly ending the program. Is this approach unworkable? Not quite– It turns out that the transform is just as effective when shrinking the image to 400×400, and completes in under 2 seconds on my Core i5 CPU.

For rectangular cropping, I just settled on using OpenCV’s built-in region-of-interest (ROI) facility. I tried to intelligently find the best candidate rectangle and allow fine adjustments via the keyboard, but I wasn’t having much success, so I took a path of lesser resistance.

Packaging and Residual Weirdness
I realized that this tool would be more useful to a broader Windows-using base of digital preservationists if they didn’t have to install Python, establish a virtual environment, and install the prerequisite dependencies. Thus, I made the effort to figure out how to wrap the entire thing up into a monolithic Windows EXE binary. It is available from the project’s Github release page (another thing I figured out for the sake of this project!).

The binary is pretty heavy, weighing in at a bit over 50 megabytes. You might advise using compression– it IS compressed! Before I figured out the --onefile command for pyinstaller.exe, the generated dist/ subdirectory was 150 MB. Among other things, there’s a 30 MB FORTRAN BLAS library packaged in!

Conclusion and Future Directions
Once I got it all working with a simple tkinter UI up front in order to select between circle and rectangle crop modes, I unleashed the tool on 60 or so scans in bulk, using the Windows forfiles command (another learning experience). I didn’t put a clock on the effort, but it felt faster. Of course, I was livid with proudness the whole time because I was using my own tool. I just wish I had thought of it sooner. But, really, with 2100+ scans under my belt, I’m just getting started– I literally have thousands more artifacts to scan for preservation.

The tool isn’t perfect, of course. Just tonight, I threw another scan at MobyCAIRO. Just go ahead and try to find straight lines in this specimen:

I eventually had to use the text left and right of center to line up against the grid with the manual keyboard adjustments. Still, I’m impressed by how these computer vision algorithms can see patterns I can’t, highlighting lines I never would have guessed at.

I’m eager to play with OpenCV some more, particularly the video processing functions, perhaps even some GPU-accelerated versions.
The post Developing MobyCAIRO first appeared on Breaking Eggs And Making Omelettes.
Small Time DevOps

1er janvier 2021, par Multimedia Mike — General
When you are a certain type of nerd who has been on the internet for long enough, you might run the risk of accumulating a lot of projects and websites. Website-wise, I have this multimedia.cx domain on which I host a bunch of ancient static multimedia documents as well as this PHP/MySQL-based blog. Further, there are 3 other PHP/MySQL-based blogs hosted on subdomains. Also, there is the wiki, another PHP/MySQL web app. A few other custom PHP- and Python-based apps are running around on the server as well.

While things largely run on auto-pilot, I need to concern myself every now and then with their ongoing upkeep.

If you ask N different people about the meaning of the term ‘DevOps’, you will surely get N different definitions. However, whenever I have to perform VM maintenance, I like to think I am at least dipping my toes into the DevOps domain. At the very least, the job seems to be concerned with making infrastructure setup and upgrades reliable and repeatable.

Even if it’s not fully automated, at the very least, I have generated a lot of lists for how to make things work (I’m a big fan of Trello’s Kanban boards for this), so it gets easier every time (ideally, anyway).

Infrastructure History

For a solid decade, from 2004 to 2014, everything was hosted on shared, cPanel-based web hosting. In mid-2014, I moved from the shared hosting over to my own VPSs, hosted on DigitalOcean. I must have used Ubuntu 14.04 at the time, as I look down down the list of Ubuntu LTS releases. It was with much trepidation that I undertook this task (knowing that anything that might go wrong with the stack, from the OS up to the apps, would all be firmly my fault), but it turned out not to be that bad. The earliest lesson you learn for such a small-time setup is to have a frontend VPS (web server) and a backend VPS (database server). That way, a surge in HTTP requests has no chance of crashing the database server due to depleted memory.

At the end of 2016, I decided to refresh the VMs. I brought them up to Ubuntu 16.04 at the time.

Earlier this year, I decided it would be a good idea to refresh the VMs again since it had been more than 3 years. The VMs were getting long in the tooth. Plus, I had seen an article speculating that Azure, another notable cloud hosting environment, might be getting full. It made me feel like I should grab some resources while I still could (resource-hoarding was in this year).

I decided to use 18.04 for these refreshed VMs, even though 20.04 was available. I think I was a little nervous about 20.04 because I heard weird things about something called snap packages being the new standard for distributing software for the platform and I wasn’t ready to take that plunge.

Which brings me to this month’s VM refresh in which I opted to take the 20.04 plunge.

Oh MediaWiki

I’ve been the maintainer and caretaker of the MultimediaWiki for 15 years now (wow! Where does the time go?). It doesn’t see a lot of updating these days, but I know it still serves as a resource for lots of obscure technical multimedia information. I still get requests for new accounts because someone has uncovered some niche technical data and wants to make sure it gets properly documented.

MediaWiki is quite an amazing bit of software and it undergoes constant development and improvement. According to the version history, I probably started the MultimediaWiki with the 1.5 series. As of this writing, 1.35 is the latest and therefore greatest lineage.

This pace of development can make it a bit of a chore to keep up to date. This was particularly true in the old days of the shared hosting when you didn’t have direct shell access and so it’s something you put off for a long time.

Honestly, to be fair, the upgrade process is pretty straightforward:
1. Unpack a set of new files on top of the existing tree
2. Run a PHP script to perform any database table upgrades
Pretty straightforward, assuming that there are no hiccups along the way, right? And the vast majority of the time, that’s the case. Until it’s not. I had an upgrade go south about a year and a half ago (I wasn’t the only MW installation to have the problem at the time, I learned). While I do have proper backups, it still threw me for a loop and I worked for about an hour to restore the previous version of the site. That experience understandably left me a bit gun-shy about upgrading the wiki.

But upgrades must happen, especially when security notices come out. Eventually, I created a Trello template with a solid, 18-step checklist for upgrading MW as soon as a new version shows up. It’s still a chore, just not so nerve-wracking when the steps are all enumerated like that.

As I compose the post, I think I recall my impetus for wanting to refresh from the 16.04 VM. 16.04 used PHP 7.0. I wanted to upgrade to the latest MW, but if I tried to do so, it warned me that it needed PHP 7.4. So I initialized the new 18.04 VM as described above… only to realize that PHP 7.2 is the default on 18.04. You need to go all the way to 20.04 for 7.4 standard. I’m sure it’s possible to install later versions of PHP on 16.04 or 18.04, but I appreciate going with the defaults provided by the distro.

I figured I would just stay with MediaWiki 1.34 series and eschew 1.35 series (requiring PHP 7.4) for the time being… until I started getting emails that 1.34 would go end-of-life soon. Oh, and there are some critical security updates, but those are only for 1.35 (and also 1.31 series which is still stubbornly being maintained for some reason).

So here I am with a fresh Ubuntu 20.04 VM running PHP 7.4 and MediaWiki 1.35 series.

How Much Process?

Anyone who decides to host on VPSs vs, say, shared hosting is (or ought to be) versed on the matter that all your data is your own problem and that glitches sometimes happen and that your VM might just suddenly disappear. (Indeed, I’ve read rants about VMs disappearing and taking entire un-backed-up websites with them, and also watched as the ranters get no sympathy– “yeah, it’s a VM; the data is your responsibility”) So I like to make sure I have enough notes so that I could bring up a new VM quickly if I ever needed to.

But the process is a lot of manual steps. Sometimes I wonder if I need to use some automation software like Ansible in order to bring a new VM to life. Why do that if I only update the VM once every 1-3 years? Well, perhaps I should update more frequently in order to ensure the process is solid?

Seems like a lot of effort for a few websites which really don’t see much traffic in the grand scheme of things. But it still might be an interesting exercise and might be good preparation for some other websites I have in mind.

Besides, if I really wanted to go off the deep end, I would wrap everything up in containers and deploy using D-O’s managed Kubernetes solution.
The post Small Time DevOps first appeared on Breaking Eggs And Making Omelettes.
Reverse Engineering Clue Chronicles Compression

15 janvier 2019, par Multimedia Mike — Game Hacking
My last post described my exploration into the 1999 computer game Clue Chronicles: Fatal Illusion. Some readers expressed interest in the details so I thought I would post a bit more about how I have investigated and what I have learned.

It’s frustrating to need to reverse engineer a compression algorithm that is only applied to a total of 8 files (out of a total set of ~140), but here we are. Still, I’m glad some others expressed interest in this challenge as it motivated me to author this post, which in turn prompted me to test and challenge some of my assumptions.

Spoiler: Commenter ‘m’ gave me the clue I needed: PKWare Data Compression Library used the implode algorithm rather than deflate. I was able to run this .ini data through an open source explode algorithm found in libmpq and got the correct data out.

Files To Study
I uploaded a selection of files for others to study, should they feel so inclined. These include the main game binary (if anyone has ideas about how to isolate the decompression algorithm from the deadlisting); compressed and uncompressed examples from 2 files (newspaper.ini and Drink.ini); and the compressed version of Clue.ini, which I suspect is the root of the game’s script.

The Story So Far
This ad-hoc scripting language found in the Clue Chronicles game is driven by a series of .ini files that are available in both compressed and uncompressed forms, save for a handful of them which only come in compressed flavor. I have figured out a few obvious details of the compressed file format:
```
bytes 0-3 "COMP"
bytes 4-11 unknown
bytes 12-15 size of uncompressed data
bytes 16-19 size of compressed data (filesize - 20 bytes)
bytes 20- compressed payload
```
The average compression ratio is on the same order as what could be achieved by running ‘gzip’ against the uncompressed files and using one of the lower number settings (i.e., favor speed vs. compression size, e.g., ‘gzip -2’ or ‘gzip -3’). Since the zlib/DEFLATE algorithm is quite widespread on every known computing platform, I thought that this would be a good candidate to test.

Exploration
My thinking was that I could load the bytes in the compressed ini file and feed it into Python’s zlib library, sliding through the first 100 bytes to see if any of them “catch” on the zlib decompression algorithm.

Here is the exploration script:

<script src="https://gist.github.com/multimediamike/c95f1a9cc58b959f4d8b2a299927d35e.js"></script>

It didn’t work, i.e., the script did not find any valid zlib data. A commentor on my last post suggested trying bzip2, so I tried the same script but with the bzip2 decompressor library. Still no luck.

Wrong Approach
I realized I had not tested to make sure that this exploratory script would work on known zlib data. So I ran it on a .gz file and it failed to find zlib data. So it looks like my assumptions were wrong. Meanwhile, I can instruct Python to compress data with zlib and dump the data to a file, and then run the script against that raw zlib output and the script recognizes the data.

I spent some time examining how zlib and gzip interact at the format level. It looks like the zlib data doesn’t actually begin on byte boundaries within a gzip container. So this approach was doomed to failure.

A Closer Look At The Executable
Installation of Clue Chronicles results in a main Windows executable named Fatal_Illusion.exe. It occurred to me to examine this again, specifically for references to something like zlib.dll. Nothing like that. However, a search for ‘compr’ shows various error messages which imply that there is PNG-related code inside (referencing IHDR and zTXt data types), even though PNG files are not present in the game’s asset mix.

But there are also strings like “PKWARE Data Compression Library for Win32”. So I have started going down the rabbit hole of determining whether the compression is part of a ZIP format file. After all, a ZIP local file header data structure has 4-byte compressed and uncompressed sizes, as seen in this format.

Binary Reverse Engineering
At one point, I took the approach of attempting to reverse engineer the binary. When studying a deadlisting of the code, it’s easy to search for the string “COMP” and find some code that cares about these compressed files. Unfortunately, the code quickly follows an indirect jump instruction which makes it intractable to track the algorithm from a simple deadlisting.

I also tried installing some old Microsoft dev tools on my old Windows XP box and setting some breakpoints while the game was running and do some old-fashioned step debugging. That was a total non-starter. According to my notes:

Address 0x004A3C32 is the setup to the strncmp(“COMP”, ini_data, 4) function call. Start there.

Problem: The game forces 640x480x256 mode and that makes debugging very difficult.

Just For One Game?
I keep wondering if this engine was used for any other games. Clue Chronicles was created by EAI Interactive. As I review the list of games they are known to have created (ranging between 1997 and 2000), a few of them jump out at me as possibly being able to leverage the same engine. I have a few of them, so I checked those… nothing. Then I scrubbed some YouTube videos showing gameplay of other suspects. None of those strike me as having similar engine characteristics to Clue Chronicles. So this remains a mystery: did they really craft this engine with its own scripting language just for one game?
The post Reverse Engineering Clue Chronicles Compression first appeared on Breaking Eggs And Making Omelettes.
Parsing The Clue Chronicles

30 décembre 2018, par Multimedia Mike — Game Hacking
A long time ago, I procured a 1999 game called Clue Chronicles: Fatal Illusion, based on the classic board game Clue, a.k.a. Cluedo. At the time, I was big into collecting old, unloved PC games so that I could research obscure multimedia formats.

Surveying the 3 CD-ROMs contained in the box packaging revealed only Smacker (SMK) videos for full motion video which was nothing new to me or the multimedia hacking community at the time. Studying the mix of data formats present on the discs, I found a selection of straightforward formats such as WAV for audio and BMP for still images. I generally find myself more fascinated by how computer games are constructed rather than by playing them, and this mix of files has always triggered a strong “I could implement a new engine for this!” feeling in me, perhaps as part of the ScummVM project which already provides the core infrastructure for reimplementing engines for 2D adventure games.

Tying all of the assets together is a custom high-level programming language. I have touched on this before in a blog post over a decade ago. The scripts are in a series of files bearing the extension .ini (usually reserved for configuration scripts, but we’ll let that slide). A representative sample of such a script can be found here:

clue-chronicles-scarlet-1.txt

What Is This Language?
At the time I first analyzed this language, I was still primarily a C/C++-minded programmer, with a decent amount of Perl experience as a high level language, and had just started to explore Python. I assessed this language to be “mildly object oriented with C++-type comments (‘//’) and reliant upon a number of implicit library functions”. Other people saw other properties. When I look at it nowadays, it reminds me a bit more of JavaScript than C++. I think it’s sort of a Rorschach test for programming languages.

Strangely, I sort of had this fear that I would put a lot of effort into figuring out how to parse out the language only for someone to come along and point out that it’s a well-known yet academic language that already has a great deal of supporting code and libraries available as open source. Google for “spanish dolphins far side comic” for an illustration of the feeling this would leave me with.

It doesn’t matter in the end. Even if such libraries exist, how easy would they be to integrate into something like ScummVM? Time to focus on a workable approach to understanding and processing the format.

Problem Scope
So I set about to see if I can write a program to parse the language seen in these INI files. Some questions:
1. How large is the corpus of data that I need to be sure to support?
2. What parsing approach should I take?
3. What is the exact language format?
4. Other hidden challenges?
To figure out how large the data corpus is, I counted all of the INI files on all of the discs. There are 138 unique INI files between the 3 discs. However, there are 146 unique INI files after installation. This leads to a hidden challenge described a bit later.

What parsing approach should I take? I worried a bit too much that I might not be doing this the “right” way. I’m trying to ignore doubts like this, like how “SQL Shame” blocked me on a task for a little while a few years ago as I concerned myself that I might not be using the purest, most elegant approach to the problem. I know I covered language parsing a lot time ago in university computer science education and there is a lot of academic literature to the matter. But sometimes, you just have to charge in and experiment and prototype and see what falls out. In doing so, I expect to have a better understanding of the problems that need to solved and the right questions to ask, not unlike that time that I wrote a continuous integration system from scratch because I didn’t actually know that “continuous integration” was the keyword I needed.

Next, what is the exact language format? I realized that parsing the language isn’t the first and foremost problem here– I need to know exactly what the language is. I need to know what the grammar are keywords are. In essence, I need to reverse engineer the language before I write a proper parser for it. I guess that fits in nicely with the historical aim of this blog (reverse engineering).

Now, about the hidden challenges– I mentioned that there are 8 more INI files after the game installs itself. Okay, so what’s the big deal? For some reason, all of the INI files are in plaintext on the CD-ROM but get compressed (apparently, according to file size ratios) when installed to the hard drive. This includes those 8 extra INI files. I thought to look inside the CAB installation archive file on the CD-ROM and the files were there… but all in compressed form. I suspect that one of the files forms the “root” of the program and is the launching point for the game.

Parsing Approach
I took a stab at parsing an INI file. My approach was to first perform lexical analysis on the file and create a list of 4 types: symbols, numbers, strings, and language elements ([]{}()=.,:). Apparently, this is the kind of thing that Lex/Flex are good at. This prototyping tool is written in Python, but when I port this to ScummVM, it might be useful to call upon the services of Lex/Flex, or another lexical analyzer, for there are many. I have a feeling it will be easier to use better tools when I understand the full structure of the language based on the data available.

The purpose of this tool is to explore all the possibilities of the existing corpus of INI files. To that end, I ran all 138 of the plaintext files through it, collected all of the symbols, and massaged the results, assuming that the symbols that occurred most frequently are probably core language features. These are all the symbols which occur more than 1000 times among all the scripts:
```
   6248 false
   5734 looping
   4390 scripts
   3877 layer
   3423 sequentialscript
   3408 setactive
   3360 file
   3257 thescreen
   3239 true
   3008 autoplay
   2914 offset
   2599 transparent
   2441 text
   2361 caption
   2276 add
   2205 ge
   2197 smackanimation
   2196 graphicscript
   2196 graphic
   1977 setstate
   1642 state
   1611 skippable
   1576 desc
   1413 delayscript
   1298 script
   1267 seconds
   1019 rect
```
About That Compression
I have sorted out at least these few details of the compression:
```
bytes 0-3    "COMP" (a pretty strong sign that this is, in fact, compressed data)
bytes 4-11   unknown
bytes 12-15  size of uncompressed data
bytes 16-19  size of compressed data (filesize - 20)
bytes 20-    compressed payload
```
The compression ratios are on the same order of gzip. I was hoping that it was stock zlib data. However, I have been unable to prove this. I wrote a Python script that scrubbed through the first 100 bytes of payload data and tried to get Python’s zlib.decompress to initialize– no luck. It’s frustrating to know that I’ll have to reverse engineer a compression algorithm that deals with just 8 total text files if I want to see this effort through to fruition.

Update, January 15, 2019
Some folks expressed interest in trying to sort out the details of the compression format. So I have posted a followup in which I post some samples and go into deeper details about things I have tried:

Reverse Engineering Clue Chronicles Compression
The post Parsing The Clue Chronicles first appeared on Breaking Eggs And Making Omelettes.
Dreamcast Serial Extractor

31 décembre 2017, par Multimedia Mike — Sega Dreamcast
It has not been a very productive year for blogging. But I started the year by describing an unfinished project that I developed for the Sega Dreamcast, so I may as well end the year the same way. The previous project was a media player. That initiative actually met with some amount of success and could have developed into something interesting if I had kept at it.

By contrast, this post describes an effort that was ultimately a fool’s errand that I spent way too much time trying to make work.

Problem Statement
In my neverending quest to analyze the structure of video games while also hoarding a massive collection of them (though I’m proud to report that I did play at least a few of them this past year), I wanted to be able to extract the data from my many Dreamcast titles, both games and demo discs. I had a tool called the DC Coder’s Cable, a serial cable that enables communication between a Dreamcast and a PC. With the right software, you could dump an entire Dreamcast GD-ROM, which contained a gigabyte worth of sectors.

Problem: The dumping software (named ‘dreamrip’ and written by noted game hacker BERO) operated in a very basic mode, methodically dumping sector after sector and sending it down the serial cable. This meant that it took about 28 hours to extract all the data on a single disc by running at the maximum speed of 115,200 bits/second, or about 11 kilobytes/second. I wanted to create a faster method.

The Pitch
I formed a mental model of dreamrip’s operation that looked like this:

As an improvement, I envisioned this beautiful architecture:

Architectural Assumptions
My proposed architecture was predicated on the assumption that the disc reading and serial output functions were both I/O-bound operations and that the CPU would be idle much of the time. My big idea was to use that presumably idle CPU time to compress the sectors before sending them over the wire. As long as the CPU can compress the data faster than 11 kbytes/sec, it should be a win. In order to achieve this, I broke the main program into 3 threads:
1. The first thread reads the sectors; more specifically, it asks the drive firmware to please read the sectors and make the data available in system RAM
2. The second thread waits for sector data to appear in memory and then compresses it
3. The third thread takes the compressed data when it is ready and shuffles it out through the serial cable
Simple and elegant, right?

For data track compression, I wanted to start with zlib in order to prove the architecture, but then also try bzip2 or lzma. As long as they could compress data faster than the serial port could write it, then it should be a win. For audio track compression, I wanted to use the Flake FLAC encoder. According to my notes, I did get both bzip2 compression and the Flake compressor working on the Dreamcast. I recall choosing Flake over the official FLAC encoder because it was much simpler and had fewer dependencies, always an important consideration for platforms such as this.

Problems
I worked for quite awhile on this project. I have a lot of notes recorded but a lot of the problems I had remain a bit vague in my memory. However, there was one problem I discovered that eventually sunk the entire initiative:

The serial output operation is CPU-bound.

My initial mental model was that the a buffer could be “handed off” to the serial subsystem and the CPU could go back to doing other work. Nope. Turns out that the CPU was participating at every step of the serial transfer.

Further, I eventually dug into the serial driver code and learned that there was already some compression taking place via the miniLZO library.

Lessons Learned
- Recognize the assumptions that you’re making up front at the start of the project.
- Prototype in order to ensure plausibility
- Profile to make sure you’re optimizing the right thing (this is something I have learned again and again).
Another interesting tidbit from my notes: it doesn’t matter how many sectors you read at a time, the overall speed is roughly the same. I endeavored to read 1000 2048-byte data sectors, 1 or 10 or 100 at a time, or all 1000 at once. My results:
- 1: 19442 ms
- 10: 19207 ms
- 100: 19194 ms
- 1000: 19320 ms
No difference. That surprised me.

Side Benefits
At one point, I needed to understand how BERO’s dreamrip software was operating. I knew I used to have the source code but I could no longer find it. Instead, I decided to try to reverse engineer what I needed from the SH-4 binary image that I had. It wasn’t an ELF image; rather, it was a raw binary meant to be loaded at a particular memory location which makes it extra challenging for ‘objdump’. This led to me asking my most viewed and upvoted question on Stack Overflow: “Disassembling A Flat Binary File Using objdump”. The next day, it also led me to post one of my most upvoted answers when I found the solution elsewhere.

Strangely, I have since tried out the command line shown in my answer and have been unable to make it work. But people keep upvoting both the question and the answer.

Eventually this all became moot when I discovered a misplaced copy of the source code on one of my computers.

I strongly recall binging through the Alias TV show while I was slogging away on this project, so I guess that’s a positive association since I got so many fun screenshots out of it.

The Final Resolution
Strangely, I was still determined to make this project work even though the Dreamcast SD adapter arrived for me about halfway through the effort. Part of this was just stubbornness, but part of it was my assumptions about serial port speeds, in particular, my assumption that there was a certain speed-of-light type of limitation on serial port speeds so that the SD adapter, operating over the DC’s serial port, would not be appreciably faster than the serial cable.

This turned out to be very incorrect. In fact, the SD adapter is capable of extracting an entire gigabyte disc image in 35-40 minutes. This is the method I have since been using to extract Dreamcast disc images.
The post Dreamcast Serial Extractor first appeared on Breaking Eggs And Making Omelettes.