Breaking Eggs And Making Omelettes

A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.

http://multimedia.cx/eggs/

Les articles publiés sur le site

  • Evolution of Multimedia Fiefdoms

    1er octobre 2014, par Multimedia MikeGeneral

    I want to examine how multimedia fiefdoms have risen and fallen through the years.


    Medieval Castle

    Back in the day, the multimedia fiefdoms were built around the formats put forth by competing companies: there was Microsoft/WMV, Apple/MOV, and Real/RM as the big contenders. On2 always wanted to be a player in this arena but could never quite catch a break. A few brave contenders held the line for open source and also for the power users who desired one application that could handle everything (my original motivation for wanting to get into multimedia hacking).

    The computer desktop was the battleground for internet-based media stream. Whatever happened to those days? Actually, if memory serves, Flash-based video streaming stepped on all of them.

    Over the last 6-7 years, the battleground has expanded to cover mobile devices, where Flash’s impact has… lessened. During this time, multimedia technology pretty well standardized on a particular stack, namely, the MPEG (MP4/H.264/AAC) stack.

    The belligerents in this war tried for years to effectively penetrate new territory, namely, the living room where the television lived. This had been slowgoing for years due to various user interface and content issues, but steadily improved.

    Last April, Amazon announced their entry into the set-top box market with the Fire TV. That was when it suddenly crystallized for me that the multimedia ecosystem has radically shifted. Now, the multimedia fiefdoms revolve around access to content via streaming services.

    Off the top of my head, here are some of the fiefdoms these days (fiefdoms I have experience using):

    • Netflix (subscription streaming)
    • Amazon (subscription, rental, and purchased streaming)
    • Hulu Plus (subscription streaming)
    • Apple (rental and purchased media)

    I checked some results on Can I Stream.It? (which I refer to often) and found a bunch more streaming fiefdoms such as Google (both Play and YouTube, which are separate services), Sony, Xbox 360, Crackle, Redbox Instant, Vudu, Target Ticket, Epix, Sony, SnagFilms, and XFINITY StreamPix. And surely, these are probably just services available in the United States; I know other geographical regions have their own fiefdoms.

    What happened?

    When I got into multimedia hacking, there were all these disparate, competing ecosystems. As a consumer, I didn’t care where the media came from, I just wanted to play it. That’s what inspired me to work on open source multimedia projects. Now I realize that I have the same problem 10-15 years later: there are multiple competing ecosystems. I might subscribe to fiefdoms X and Y, but am frustrated to learn that something I’d like to watch is only available through fiefdom Z. Very few of these fiefdoms can be penetrated using open source technology.

    I’m not really sure about the point about this whole post. Multimedia technology seems really standardized these days. But that’s probably just my perspective because I have spent way too long focusing on a few areas of multimedia technology such as audio and video coding. It’s interesting that all these services probably leverage the same limited number of codecs. Their differentiation comes from the catalog of content that each is able to license for streaming. There are different problems to solve in the multimedia arena now.

  • Visualizing Call Graphs Using Gephi

    1er septembre 2014, par Multimedia MikeGeneral

    When I was at university studying computer science, I took a basic chemistry course. During an accompanying lab, the teaching assistant chatted me up and asked about my major. He then said, “Computer science? Well, that’s just typing stuff, right?”

    My impulsive retort: “Sure, and chemistry is just about mixing together liquids and coming up with different colored liquids, as seen on the cover of my high school chemistry textbook, right?”


    Chemistry fun

    In fact, pure computer science has precious little to do with typing (as is joked in CS circles, computer science is about computers in the same way that astronomy is about telescopes). However, people who study computer science often pursue careers as programmers, or to put it in fancier professional language, software engineers.

    So, what’s a software engineer’s job? Isn’t it just typing? That’s where I’ve been going with this overly long setup. After thinking about it for long enough, I like to say that a software engineer’s trade is managing complexity.

    A few years ago, I discovered Gephi, an open source tool for graph and data visualization. It looked neat but I didn’t have much use for it at the time. Recently, however, I was trying to get a better handle on a large codebase. I.e., I was trying to manage the project’s complexity. And then I thought of Gephi again.

    Prior Work
    One way to get a grip on a large C codebase is to instrument it for profiling and extract details from the profiler. On Linux systems, this means compiling and linking the code using the -pg flag. After running the executable, there will be a gmon.out file which is post-processed using the gprof command.

    GNU software development tools have a reputation for being rather powerful and flexible, but also extremely raw. This first hit home when I was learning how to use the GNU tool for code coverage — gcov — and the way it outputs very raw data that you need to massage with other tools in order to get really useful intelligence.

    And so it is with gprof output. The output gives you a list of functions sorted by the amount of processing time spent in each. Then it gives you a flattened call tree. This is arranged as “during the profiled executions, function c was called by functions a and b and called functions d, e, and f; function d was called by function c and called functions g and h”.

    How can this call tree data be represented in a more instructive manner that is easier to navigate? My first impulse (and I don’t think I’m alone in this) is to convert the gprof call tree into a representation suitable for interpretation by Graphviz. Unfortunately, doing so tends to generate some enormous and unwieldy static images.

    Feeding gprof Data To Gephi
    I learned of Gephi a few years ago and recalled it when I developed an interest in gaining better perspective on a large base of alien C code. To understand what this codebase is doing for a particular use case, instrument it with gprof, gather execution data, and then study the code paths.

    How could I feed the gprof data into Gephi? Gephi supports numerous graphing formats including an XML-based format named GEXF.

    Thus, the challenge becomes converting gprof output to GEXF.

    Which I did.

    Demonstration
    I have been absent from FFmpeg development for a long time, which is a pity because a lot of interesting development has occurred over the last 2-3 years after a troubling period of stagnation. I know that 2 big video codec developments have been HEVC (next in the line of MPEG codecs) and VP9 (heir to VP8’s throne). FFmpeg implements them both now.

    I decided I wanted to study the code flow of VP9. So I got the latest FFmpeg code from git and built it using the options "--extra-cflags=-pg --extra-ldflags=-pg". Annoyingly, I also needed to specify "--disable-asm" because gcc complains of some register allocation snafus when compiling inline ASM in profiling mode (and this is on x86_64). No matter; ASM isn’t necessary for understanding overall code flow.

    After compiling, the binary ‘ffmpeg_g’ will have symbols and be instrumented for profiling. I grabbed a sample from this VP9 test vector set and went to work.

    ./ffmpeg_g -i vp90-2-00-quantizer-00.webm -f null /dev/null
    gprof ./ffmpeg_g > vp9decode.txt
    convert-gprof-to-gexf.py vp9decode.txt > ~/bigdisk/vp9decode.gexf
    

    Gephi loads vp9decode.gexf with no problem. Using Gephi, however, can be a bit challenging if one is not versed in any data exploration jargon. I recommend this Gephi getting starting guide in slide deck form. Here’s what the default graph looks like:


    gprof-ffmpeg-gephi-1

    Not very pretty or helpful. BTW, that beefy arrow running from mid-top to lower-right is the call from decode_coeffs_b -> iwht_iwht_4x4_add_c. There were 18774 from the former to the latter in this execution. Right now, the edge thicknesses correlate to number of calls between the nodes, which I’m not sure is the best representation.

    Following the tutorial slide deck, I at least learned how to enable the node labels (function symbols in this case) and apply a layout algorithm. The tutorial shows the force atlas layout. Here’s what the node neighborhood looks like for probing file type:


    gprof-ffmpeg-gephi-2

    Okay, so that’s not especially surprising– avprobe_input_format3 calls all of the *_probe functions in order to automatically determine input type. Let’s find that decode_coeffs_b function and see what its neighborhood looks like:


    gprof-ffmpeg-gephi-3

    That’s not very useful. Perhaps another algorithm might help. I select the Fruchterman–Reingold algorithm instead and get a slightly more coherent representation of the decoding node neighborhood:


    gprof-ffmpeg-gephi-4

    Further Work
    Obviously, I’m just getting started with this data exploration topic. One thing I would really appreciate in such a tool is the ability to interactively travel the graph since that’s what I’m really hoping to get out of this experiment– watching the code flows.

    Perhaps someone else can find better use cases for visualizing call graph data. Thus, I have published the source code for this tool at Github.

  • Vedanti and Max Sound vs. Google

    14 août 2014, par Multimedia MikeLegal/Ethical

    Vedanti Systems Limited (VSL) and Max Sound Coporation filed a lawsuit against Google recently. Ordinarily, I wouldn’t care about corporate legal battles. However, this one interests me because it’s multimedia-related. I’m curious to know how coding technology patents might hold up in a real court case.

    Here’s the most entertaining complaint in the lawsuit:

    Despite Google’s well-publicized Code of Conduct — “Don’t be Evil” — which it explains is “about doing the right thing,” “following the law,” and “acting honorably,” Google, in fact, has an established pattern of conduct which is the exact opposite of its claimed piety.

    I wonder if this is the first known case in which Google has been sued over its long-obsoleted “Don’t be evil” mantra?

    Researching The Plaintiffs

    I think I made a mistake by assuming this lawsuit might have merit. My first order of business was to see what the plaintiff organizations have produced. I have a strong feeling that these might be run of the mill patent trolls.

    VSL currently has a blank web page. Further, the Wayback Machine only has pages reaching back to 2011. The earliest page lists these claims against a plain black background (I’ve highlighted some of the more boisterous claims and the passages that make it appear that Vedanti doesn’t actually produce anything but is strictly an IP organization):

    The inventions key :
    The patent and software reduced any data content, without compressing, up to a 97% total reduction of the data which also produces a lossless result. This physics based invention is often called the Holy Grail.

    Vedanti Systems Intellectual Property
    Our strategic IP portfolio is granted in all of the world’s largest technology development and use countries. A major value indemnification of our licensee products is the early date of invention filing and subsequent Issue. Vedanti IP has an intrinsic 20 year patent protection and valuation in royalties and licensing. The original data transmission art has no prior art against it.

    Vedanti Systems invented among other firsts, The Slice and Partitioning of Macroblocks within a RGB Tri level region in a frame to select or not, the pixel.

    Vedanti Systems invention is used in nearly every wireless chipset and handset in the world

    Our original pixel selection system revolutionized wireless handset communications. An example of this system “Slice” and “Macroblock Partitioning” is used throughout Satellite channel expansion, Wireless partitioning, Telecom – Video Conferencing, Surveillance Cameras, and 2010 developing Media applications.

    Vedanti Systems is a Semiconductor based software, applications, and IP Continuations Intellectual Property company.

    Let’s move onto the other plaintiff, Max Sound. They have a significantly more substantive website. They also have an Android app named Spins HD Audio, which appears to be little more than a music player based on the screenshots.

    Max Sound also has a stock ticker symbol: MAXD. Something clicked into place when I looked up their ticker symbol: While worth only a few pennies, it was worth a few more pennies after this lawsuit was announced, which might be one of the motivations behind the lawsuit.

    Here’s a trick I learned when I was looking for a new tech job last year: When I first look at a company’s website and am trying to figure out what they really do, I head straight to their jobs/careers page. A lot of corporate websites have way too much blathering corporatese that can be tough to cut through. But when I see what mix of talent and specific skills they are hoping to hire, that gives me a much better portrait of what the company does.

    The reason I bring this up is because this tech company doesn’t seem to have jobs/careers page.

    The Lawsuit
    The core complaint centers around Patent 7974339: Optimized data transmission system and method. It was filed in July 2004 (or possibly as early as January 2002), issued in July 2011, and assigned (purchased?) by Vedanti in May 2012. The lawsuit alleges that nearly everything Google has ever produced (or, more accurately, purchased) leverages the patented technology.

    The patent itself has 5 drawings. If you’ve ever seen a multimedia codec patent, or any whitepaper on a multimedia codec, you’ve seen these graphs before. E.g., “Raw pixels come in here -> some analysis happens here -> more analysis happens over here -> entropy coding -> final bitstream”. The text of a patent document isn’t meant to be particularly useful. I’ve tried to understand this stuff before and it never goes well. Skimming the text, I just see a blur of the words data, transmission, pixel, and matrix.

    So I read the complaint to try to figure out what this is all about. To summarize the storyline as narrated by the lawsuit, some inventors were unhappy with the state of video compression in 2001 and endeavored to create something better. So they did, and called it the VSL codec. This codec is so far undocumented on the MultimediaWiki, so it probably has yet to be seen “in the wild”. Good luck finding hard technical data on it now since searches for “VSL codec” are overwhelmed by articles about this lawsuit. Also, the original codec probably wasn’t called VSL because VSL is apparently an IP organization formed much later.

    Then, the protagonists of the lawsuit patented the codec. Then, years later, Google wanted to purchase a video codec that they could open source and use to supplant H.264.

    The complaint goes on to allege that in 2010, Google specifically contacted VSL to possibly license or acquire this mysterious VSL technology. Google was allegedly allowed to study the technology, eventually decided not to continue discussions, and shipped back the proprietary materials.

    Here’s where things get weird. When Google shipped back the materials, they allegedly shipped back a bunch of Post-It notes. The notes are alleged to contain a ton of incriminating evidence. The lawsuit claims that the notes contained such tidbits as:

    • Google was concerned that its infringement could be considered “recklessness” (the standard applicable to willful infringement);
    • Google personnel should “try” to destroy incriminating emails;
    • Google should consider a “design around” because it was facing a “risk of litigation.”

    Actually, given Google’s acquisition of On2, I can totally believe that last one (On2’s codecs have famously contained a lot of weirdness which is commonly suspected to be attributable to designing around known patents).

    Anyway, a lot of this case seems to hinge on the authenticity of these Post-It notes:

    “65. The Post-It notes are unequivocal evidence of Google’s knowledge of the ’339 Patent and infringement by Defendants”

    I wish I could find a stock photo of a stack of Post-It notes in an evidence bag.

    I’ve worked at big technology companies. Big tech companies these days are very diligent about indoctrinating employees about IP liability issues. The reason this Post-It situation strikes me as odd is because the alleged contents of the notes basically outline everything the corporate lawyers tell you NOT to do.

    Analysis
    I’m trying to determine what specific algorithms and coding techniques. I guess I was expecting to see a specific claim that, “Our patent outlines this specific coding technique and here is unequivocal proof that Google A) uses the same technique, and B) specifically did so after looking at our patent.” I didn’t find that (well, a bit of part B, c.f., the Post-It note debacle), but maybe that’s not how these patent lawsuits operate. I’ve never kept up before.

    Maybe it’s just a patent troll. Maybe it’s for the stock bump. I’m expecting to see pump-n-dump stock spam featuring the stock symbol MAXD anytime now.

    I’ve never been interested in following a lawsuit case carefully before. I suddenly find myself wondering if I can subscribe to the RSS feed for this case? Too much to hope for. But I found this item through Pando and maybe they’ll stay on top of it.

  • Server Move For multimedia.cx

    1er août 2014, par Multimedia MikeGeneral

    I made a big change to multimedia.cx last week: I moved hosting from a shared web hosting plan that I had been using for 10 years to a dedicated virtual private server (VPS). In short, I now have no one to blame but myself for any server problems I experience from here on out.

    The tipping point occurred a few months ago when my game music search engine kept breaking regardless of what technology I was using. First, I had an admittedly odd C-based CGI solution which broke due to mysterious binary compatibility issues, the sort that are bound to occur when trying to make a Linux binary run on heterogeneous distributions. The second solution was an SQLite-based solution. Like the first solution, this worked great until it didn’t work anymore. Something else mysteriously broke vis-à-vis PHP and SQLite on my server. I started investigating a MySQL-based full text search solution but couldn’t make it work, and decided that I shouldn’t have to either.

    Ironically, just before I finished this entire move operation, I noticed that my SQLite-based FTS solution was working again on the old shared host. I’m not sure when that problem went away. No matter, I had already thrown the switch.

    How Hard Could It Be?
    We all have thresholds for the type of chores we’re willing to put up with and which we’d rather pay someone else to perform. For the past 10 years, I felt that administering a website’s underlying software is something that I would rather pay someone else to worry about. To be fair, 10 years ago, I don’t think VPSs were a thing, or at least a viable thing in the consumer space, and I wouldn’t have been competent enough to properly administer one. Though I would have been a full-time Linux user for 5 years at that point, I was still the type to build all of my own packages from source (I may have still been running Linux From Scratch 10 years ago) which might not be the most tractable solution for server stability.

    These days, VPSs are a much more affordable option (easily competitive with shared web hosting). I also realized I know exactly how to install and configure all the software that runs the main components of the various multimedia.cx sites, having done it on local setups just to ensure that my automated backups would actually be useful in the event of catastrophe.

    All I needed was the will to do it.

    The Switchover Process
    Here’s the rough plan:

    • Investigate options for both VPS providers and mail hosts– I might be willing to run a web server but NOT a mail server
    • Start plotting several months in advance of my yearly shared hosting renewal date
    • Screw around for several months, playing video games and generally finding reasons to put off the move
    • Panic when realizing there are only a few days left before the yearly renewal comes due

    So that’s the planning phase. BTW, I chose Digital Ocean for VPS and Zoho for email hosting. Here’s the execution phase I did last week:

    • Register with Digital Ocean and set up DNS entries to point to the old shared host for the time being
    • Once the D-O DNS servers respond correctly using a manual ‘dig’ command, use their servers as the authoritative ones for multimedia.cx
    • Create a new Droplet (D-O VPS), install all the right software, move the databases, upload the files; and exhaustively document each step, gotcha, and pitfall; treat a VPS as necessarily disposable and have an eye towards iterating the process with a new VPS
    • Use /etc/hosts on a local machine to point DNS to the new server and verify that each site is working correctly
    • After everything looks all right, update the DNS records to point to the new server

    Finally, flip the switch on the MX record by pointing it to the new email provider.

    Improvements and Problems
    Hosting on Digital Ocean is quite amazing so far. Maybe it’s the SSDs. Whatever it is, all the sites are performing far better than on the old shared web host. People who edit the MultimediaWiki report that changes get saved in less than the 10 or so seconds required on the old server.

    Again, all problems are now my problems. A sore spot with the shared web host was general poor performance. The hosting company would sometimes complain that my sites were using too much CPU. I would have loved to try to optimize things. However, the cPanel interface found on many shared hosts don’t give you a great deal of data for debugging performance problems. However, same sites, same software, same load on the VPS is considerably more performant.

    Problem: I’ve already had the MySQL database die due to a spike in usage. I had to manually restart it. I was considering a cron-based solution to check if the server is running and restart it if not. In response to my analysis that my databases are mostly read and not often modified, so db crashes shouldn’t be too disastrous, a friend helpfully reminded me that, “You would not make a good sysadmin with attitudes like ‘an occasional crash is okay’.”

    To this end, I am planning to migrate the database server to a separate VPS. This is a strategy that even Digital Ocean recommends. I’m hoping that the MySQL server isn’t subject to such memory spikes, but I’ll continue to monitor it after I set it up.

    Overall, the server continues to get modest amounts of traffic. I predict it will remain that way unless Dark Shikari resurrects the x264dev blog. The biggest spike that multimedia.cx ever saw was when Steve Jobs linked to this WebM post.

    Dropped Sites
    There are a bunch of subdomains I dropped because I hadn’t done anything with them for years and I doubt anyone will notice they’re gone. One notable section that I decided to drop is the samples.mplayerhq.hu archive. It will live on, but it will be hosted by samples.ffmpeg.org, which had a full mirror anyway. The lower-end VPS instances don’t have the 53 GB necessary.

    Going Forward
    Here’s to another 10 years of multimedia.cx, even if multimedia isn’t as exciting as it was 10 years ago (personal opinion; I’ll have another post on this later). But at least I can get working on some other projects now that this is done. For the past 4 months or so, whenever I think of doing some other project, I always remembered that this server move took priority over everything else.

  • Reverse Engineering Italian Literature

    1er juillet 2014, par Multimedia MikeReverse Engineering

    Some time ago, Diego “Flameeyes” Pettenò tried his hand at reverse engineering a set of really old CD-ROMs containing even older Italian literature. The goal of this RE endeavor would be to extract the useful literature along with any structural metadata (chapters, etc.) and convert it to a more open format suitable for publication at, e.g., Project Gutenberg or Archive.org.

    Unfortunately, the structure of the data thwarted the more simplistic analysis attempts (like inspecting for blocks of textual data). This will require deeper RE techniques. Further frustrating the effort, however, is the fact that the binaries that implement the reading program are written for the now-archaic Windows 3.1 operating system.

    In pursuit of this RE goal, I recently thought of a way to glean more intelligence using DOSBox.

    Prior Work
    There are 6 discs in the full set (distributed along with 6 sequential issues of a print magazine named L’Espresso). Analysis of the contents of the various discs reveals that many of the files are the same on each disc. It was straightforward to identify the set of files which are unique on each disc. This set of files all end with the extension “LZn”, where n = 1..6 depending on the disc number. Further, the root directory of each disc has a file indicating the sequence number (1..6) of the CD. Obviously, these are the interesting targets.

    The LZ file extensions stand out to an individual skilled in the art of compression– could it be a variation of the venerable LZ compression? That’s actually unlikely because LZ — also seen as LIZ — stands for Letteratura Italiana Zanichelli (Zanichelli’s Italian Literature).

    The Unix ‘file’ command was of limited utility, unable to plausibly identify any of the files.

    Progress was stalled.

    Saying Hello To An Old Frenemy
    I have been showing this screenshot to younger coworkers to see if any of them recognize it:


    DOSBox running Window 3.1

    Not a single one has seen it before. Senior computer citizen status: Confirmed.

    I recently watched an Ancient DOS Games video about Windows 3.1 games. This episode showed Windows 3.1 running under DOSBox. I had heard this was possible but that it took a little work to get running. I had a hunch that someone else had probably already done the hard stuff so I took to the BitTorrent networks and quickly found a download that had the goods ready to go– a directory of Windows 3.1 files that just had to be dropped into a DOSBox directory and they would be ready to run.

    Aside: Running OS software procured from a BitTorrent network? Isn’t that an insane security nightmare? I’m not too worried since it effectively runs under a sandboxed virtual machine, courtesy of DOSBox. I suppose there’s the risk of trojan’d OS software infecting binaries that eventually leave the sandbox.

    Using DOSBox Like ‘strace’
    strace is a tool available on some Unix systems, including Linux, which is able to monitor the system calls that a program makes. In reverse engineering contexts, it can be useful to monitor an opaque, binary program to see the names of the files it opens and how many bytes it reads, and from which locations. I have written examples of this before (wow, almost 10 years ago to the day; now I feel old for the second time in this post).

    Here’s the pitch: Make DOSBox perform as strace in order to serve as a platform for reverse engineering Windows 3.1 applications. I formed a mental model about how DOSBox operates — abstracted file system classes with methods for opening and reading files — and then jumped into the source code. Sure enough, the code was exactly as I suspected and a few strategic print statements gave me the data I was looking for.

    Eventually, I even took to running DOSBox under the GNU Debugger (GDB). This hasn’t proven especially useful yet, but it has led to an absurd level of nesting:


    GDB runs DOSBox runs Windows 3.1

    The target application runs under Windows 3.1, which is running under DOSBox, which is running under GDB. This led to a crazy situation in which DOSBox had the mouse focus when a GDB breakpoint was triggered. At this point, DOSBox had all desktop input focus and couldn’t surrender it because it wasn’t running. I had no way to interact with the Linux desktop and had to reboot the computer. The next time, I took care to only use the keyboard to navigate the application and trigger the breakpoint and not allow DOSBox to consume the mouse focus.

    New Intelligence

    By instrumenting the local file class (virtual HD files) and the ISO file class (CD-ROM files), I was able to watch which programs and dynamic libraries are loaded and which data files the code cares about. I was able to narrow down the fact that the most interesting programs are called LEGGENDO.EXE (‘reading’) and LEGGENDA.EXE (‘legend’; this has been a great Italian lesson as well as RE puzzle). The first calls the latter, which displays this view of the data we are trying to get at:


    LIZ: Authors index

    When first run, the program takes an interest in a file called DBBIBLIO (‘database library’, I suspect):

    === Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x0
    === Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x151
    === Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x2A2
    [...]
    

    While we were unable to sort out all of the data files in our cursory investigation, a few things were obvious. The structure of this file looked to contain 336-byte records. Turns out I was off by 1– the records are actually 337 bytes each. The count of records read from disc is equal to the number of items shown in the UI.

    Next, the program is interested in a few more files:

    *** isoFile(): 'DEPOSITO\BLOKCTC.LZ1', offset 0x27D6000, 2911488 bytes large
    === Read('DEPOSITO\BLOKCTC.LZ1'): req 96 bytes; read 96 bytes from pos 0x0
    *** isoFile(): 'DEPOSITO\BLOKCTX0.LZ1', offset 0x2A9D000, 17152 bytes large
    === Read('DEPOSITO\BLOKCTX0.LZ1'): req 128 bytes; read 128 bytes from pos 0x0
    === Seek('DEPOSITO\BLOKCTX0.LZ1'): seek 384 (0x180) bytes, type 0
    === Read('DEPOSITO\BLOKCTX0.LZ1'): req 256 bytes; read 256 bytes from pos 0x180
    === Seek('DEPOSITO\BLOKCTC.LZ1'): seek 1152 (0x480) bytes, type 0
    === Read('DEPOSITO\BLOKCTC.LZ1'): req 32 bytes; read 32 bytes from pos 0x480
    === Read('DEPOSITO\BLOKCTC.LZ1'): req 1504 bytes; read 1504 bytes from pos 0x4A0
    [...]
    
    

    Eventually, it becomes obvious that BLOKCTC has the juicy meat. There are 32-byte records followed by variable-length encoded text sections. Since there is no text to be found in these files, the text is either compressed, encrypted, or both. Some rough counting (the program seems to disable copy/paste, which thwarts more precise counting), indicates that the text size is larger than the data chunks being read from disc, so compression seems likely. Encryption isn’t out of the question (especially since the program deems it necessary to disable copy and pasting of this public domain literary data), and if it’s in use, that means the key is being read from one of these files.

    Blocked On Disassembly
    So I’m a bit blocked right now. I know exactly where the data lives, but it’s clear that I need to reverse engineer some binary code. The big problem is that I have no idea how to disassemble Windows 3.1 binaries. These are NE-type executable files. Disassemblers abound for MZ files (MS-DOS executables) and PE files (executables for Windows 95 and beyond). NE files get no respect. It’s difficult (but not impossible) to even find data about the format anymore, and details are incomplete. It should be noted, however, the DOSBox-as-strace method described here lends insight into how Windows 3.1 processes NE-type EXEs. You can’t get any more authoritative than that.

    So far, I have tried the freeware version of IDA Pro. Unfortunately, I haven’t been able to get the program to work on my Windows machine for a long time. Even if I could, I can’t find any evidence that it actually supports NE files (the free version specifically mentions MZ and PE, but does not mention NE or LE).

    I found an old copy of Borland’s beloved Turbo Assembler and Debugger package. It has Turbo Debugger for Windows, both regular and 32-bit versions. Unfortunately, the normal version just hangs Windows 3.1 in DOSBox. The 32-bit Turbo Debugger loads just fine but can’t load the NE file.

    I’ve also wondered if DOSBox contains any advanced features for trapping program execution and disassembling. I haven’t looked too deeply into this yet.

    Future Work
    NE files seem to be the executable format that time forgot. I have a crazy brainstorm about repacking NE files as MZ executables so that they could be taken apart with an MZ disassembler. But this will take some experimenting.

    If anyone else has any ideas about ripping open these binaries, I would appreciate hearing them.

    And I guess I shouldn’t be too surprised to learn that all the literature in this corpus is already freely available and easily downloadable anyway. But you shouldn’t be too surprised if that doesn’t discourage me from trying to crack the format that’s keeping this particular copy of the data locked up.