Breaking Eggs And Making Omelettes

A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.

http://multimedia.cx/eggs/

Les articles publiés sur le site

1 | ... | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | ... | 38

Adventures in Unicode

29 novembre 2012, par Multimedia Mike — Programming, php, Python, sqlite3, unicode
Tangential to multimedia hacking is proper metadata handling. Recently, I have gathered an interest in processing a large corpus of multimedia files which are likely to contain metadata strings which do not fall into the lower ASCII set. This is significant because the lower ASCII set intersects perfectly with my own programming comfort zone. Indeed, all of my programming life, I have insisted on covering my ears and loudly asserting “LA LA LA LA LA! ALL TEXT EVERYWHERE IS ASCII!” I suspect I’m not alone in this.

Thus, I took this as an opportunity to conquer my longstanding fear of Unicode. I developed a self-learning course comprised of a series of exercises which add up to this diagram:

Part 1: Understanding Text Encoding
Python has regular strings by default and then it has Unicode strings. The latter are prefixed by the letter ‘u’. This is what ‘ö’ looks like encoded in each type.
python
< view plain text >
1. >>> 'ö', u'ö'
2. ('\xc3\xb6', u'\xf6')
A large part of my frustration with Unicode comes from Python yelling at me about UnicodeDecodeErrors and an inability to handle the number 0xc3 for some reason. This usually comes when I’m trying to wrap my head around an unrelated problem and don’t care to get sidetracked by text encoding issues. However, when I studied the above output, I finally understood where the 0xc3 comes from. I just didn’t understand what the encoding represents exactly.

I can see from assorted tables that ‘ö’ is character 0xF6 in various encodings (in Unicode and Latin-1), so u’\xf6′ makes sense. But what does ‘\xc3\xb6′ mean? It’s my style to excavate straight down to the lowest levels, and I wanted to understand exactly how characters are represented in memory. The UTF-8 encoding tables inform us that any Unicode code point above 0x7F but less than 0×800 will be encoded with 2 bytes:
```
 110xxxxx 10xxxxxx
```
Applying this pattern to the \xc3\xb6 encoding:
```
            hex: 0xc3      0xb6
           bits: 11000011  10110110
 important bits: ---00011  --110110
      assembled: 00011110110
     code point: 0xf6
```
I was elated when I drew that out and made the connection. Maybe I’m the last programmer to figure this stuff out. But I’m still happy that I actually understand those Python errors pertaining to the number 0xc3 and that I won’t have to apply canned solutions without understanding the core problem.

I’m cheating on this part of this exercise just a little bit since the diagram implied that the Unicode text needs to come from a binary file. I’ll return to that in a bit. For now, I’ll just contrive the following Unicode string from the Python REPL:
python
< view plain text >
1. >>> u = u'Üñìçôđé'
2. >>> u
3. u'\xdc\xf1\xec\xe7\xf4\u0111\xe9'
Part 2: From Python To SQLite3
The next step is to see what happens when I use Python’s SQLite3 module to dump the string into a new database. Will the Unicode encoding be preserved on disk? What will UTF-8 look like on disk anyway?
python
< view plain text >
1. >>> import sqlite3
2. >>> conn = sqlite3.connect('unicode.db')
3. >>> conn.execute("CREATE TABLE t (t text)")
4. >>> conn.execute("INSERT INTO t VALUES (?)", (u, ))
5. >>> conn.commit()
6. >>> conn.close()
Next, I manually view the resulting database file (unicode.db) using a hex editor and look for strings. Here we go:
```
000007F0   02 29 C3 9C  C3 B1 C3 AC  C3 A7 C3 B4  C4 91 C3 A9
```
Look at that! It’s just like the \xc3\xf6 encoding we see in the regular Python strings.

Part 3: From SQLite3 To A Web Page Via PHP
Finally, use PHP (love it or hate it, but it’s what’s most convenient on my hosting provider) to query the string from the database and display it on a web page, completing the outlined processing pipeline.
php
< view plain text >
1. < ?php
2. $dbh = new PDO("sqlite:unicode.db");
3. foreach ($dbh->query("SELECT t from t") as $row);
4. $unicode_string = $row['t'];
5. ?>
7. <html>
8. <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta></head>
9. <body><h1>< ?=$unicode_string?></h1></body>
10. </html>
I tested the foregoing PHP script on 3 separate browsers that I had handy (Firefox, Internet Explorer, and Chrome):

I’d say that counts as success! It’s important to note that the “meta http-equiv” tag is absolutely necessary. Omit and see something like this:

Since we know what the UTF-8 stream looks like, it’s pretty obvious how the mapping is operating here: 0xc3 and 0xc4 correspond to ‘Ã’ and ‘Ä’, respectively. This corresponds to an encoding named ISO/IEC 8859-1, a.k.a. Latin-1. Speaking of which…

Part 4: Converting Binary Data To Unicode
At the start of the experiment, I was trying to extract metadata strings from these binary multimedia files and I noticed characters like our friend ‘ö’ from above. In the bytestream, this was represented simply with 0xf6. I mistakenly believed that this was the on-disk representation of UTF-8. Wrong. Turns out it’s Latin-1.

However, I still need to solve the problem of transforming such strings into Unicode to be shoved through the pipeline diagrammed above. For this experiment, I created a 9-byte file with the Latin-1 string ‘Üñìçôdé’ couched by 0′s, to simulate yanking a string out of a binary file. Here’s unicode.file:
```
00000000   00 DC F1 EC  E7 F4 64 E9  00         ......d..
```
(Aside: this experiment uses plain ‘d’ since the ‘đ’ with a bar through it doesn’t occur in Latin-1; shows up all over the place in Vietnamese, at least.)

I’ve been mashing around Python code via the REPL, trying to get this string into a Unicode-friendly format. This is a successful method but it’s probably not the best:
python
< view plain text >
1. >>> import struct
2. >>> f = open('unicode.file', 'r').read()
3. >>> u = u''
4. >>> for c in struct.unpack("B"*7, f[1:8]):
5. ... u += unichr(c)
6. ...
7. >>> u
8. u'\xdc\xf1\xec\xe7\xf4d\xe9'
9. >>> print u
10. Üñìçôdé
Conclusion
Dealing with text encoding matters reminds me of dealing with integer endian-ness concerns. When you’re just dealing with one system, you probably don’t need to think too much about it because the system is usually handling everything consistently underneath the covers.

However, when the data leaves one system and will be interpreted by another system, that’s when a programmer needs to be cognizant of matters such as integer endianness or text encoding.
Adding C64 SID Music

1er novembre 2012, par Multimedia Mike — General

I have been working on adding support for SID files — the music format for the Commodore 64 — to the game music website for awhile. I feel a bit out of my element since I’m not that familiar with the C64. But why should I let that slow me down? Allow me to go through the steps I have previously outlined in order to make this happen.

I need to know what picture should represent the system in the search results page. The foregoing picture should be fine, but I’m getting way ahead of myself.

Phase 1 is finding adequate player software. The most venerable contender in this arena is libsidplay, or so I first thought. It turns out that there’s libsidplay (originally hosted at Geocities, apparently, and no longer on the net) and also libsidplay2. Both are kind of old (libsidplay2 was last updated in 2004). I tried to compile libsidplay2 and the C++ didn’t agree with current version of g++.

However, a recent effort named libsidplayfp is carrying on the SID emulation tradition. It works rather well, notwithstanding the fact that compiling the entire library has a habit of apparently hanging the Linux VM where I develop this stuff.

Phase 2 is to develop a testbench app around the playback library. With the help of the libsidplayfp library maintainers, I accomplished this. The testbench app consistently requires about 15% of a single core of a fairly powerful Core i7. So I look forward to recommendations that I port that playback library to pure JavaScript.

Phase 3 is plug into the web player. I haven’t worked on this yet. I’m confident that this will work since phase 2 worked (plus, I have a plan to combine phases 2 and 3).

One interesting issue that has arisen is that proper operation of libsidplayfp requires that 3 C64 ROM files be present (the, ahem, KERNAL, BASIC interpreter, and character generator). While these are copyrighted ROMs, they are easily obtainable on the internet. The goal of my project is to eliminate as much friction as possible for enjoying these old tunes. To that end, I will just bake the ROM files directly into the player.

Phase 4 is collecting a SID song corpus. This is the simplest part of the whole process thanks to the remarkable curation efforts of the High Voltage SID Collection (HVSC). Anyone can download a giant archive of every known SID file. So that’s a done deal.

Or is it? One small issue is that I was hoping that the first iteration of my game music website would focus on, well, game music. There is a lot of music in the HVSC that are original compositions or come from demos. The way that the archive is organized makes it difficult to automatically discern whether a particular SID file comes from a game or not.

Phase 5 is munging the metadata. The good news here is that the files have the metadata built in. The not-so-great news is that there isn’t quite as much as I might like. Each file is tagged with title, author, and publisher/copyright. If there is more than one song in a file, they all have the same metadata. Fortunately, if I can import them all into my game music database, there is an opportunity to add a lot more metadata.

Further, there is no play length metadata for these files. This means I will need to set each to a default length like 2 minutes and do something like I did before in order to automatically determine if any songs terminate sooner.

Oddly, the issue I’m most concerned about is character encoding. This is the first project for which I’m making certain that I understand character encoding since I can’t reasonably get away with assuming that everything is ASCII. So far, based on the random sampling of SID files I have checked, there is a good chance of encountering metadata strings with characters that are not in the lower ASCII set. From what I have observed, these characters map to Unicode code points. So I finally get to learn about manipulating strings in such a way that it preserves the character encoding. At the very least, I need Python to rip the strings out of the binary SID files and make sure the Unicode remains intact while being inserted into an SQLite3 database.
Trouble with CoCCA Registry

7 octobre 2012, par Multimedia Mike — General
I’ve been rather despondent all week. People who see me daily could readily identify this fact. Unfortunately, the exact reason was difficult to adequately explain. The problems that nerds deal with…

When A Domain Expires
As a few people noticed, the multimedia.cx domain and all of it’s subdomains didn’t work this last week. The problem started on Monday, October 1. Whose fault? Well, fundamentally, I neglected to renew the domain name in time. However, I prefer to place the blame on the .cx domain registrar, CoCCA Registry. You see, they have never developed the technology to email a domain holder with a notice that their domain is about to expire or has already expired.

This domain is the only one I have ever held so I don’t have a lot of experience in this matter. I wondered if I was crazy for thinking it would be normal for a registrar to send an email or 2 with status updates about your domain. I get the impression from speaking with others that this is indeed normal. I have 3 different email addresses listed under my account at the registrar– 2 at multimedia.cx and a backup gmail account. I checked spam folders after this incident. Then I remembered that I have never received any email notifications from them (although password reset emails show up, so that part thankfully works). Also, their support emails are black holes.

So, I guess the moral is: be wary of dealing with CoCCA Registry. However, they seem to be the only way to register domains under a wide variety of uncommon country codes.

By Friday, the domain appeared to have been reinstated, even through the status was officially listed as “renewal-pending” according to the web-based management console. Eventually, as cached DNS results started to time out throughout the day, I started seeing subdomains come back. I excitedly used the ‘dig’ command to count down the seconds until gamemusic.multimedia.cx was accessible on the network I was on (the number after the domain name is the time-to-live or ‘TTL’ value):
```
$ dig +nocmd gamemusic.multimedia.cx +noall +answer
gamemusic.multimedia.cx. 3      IN      A       174.143.152.251
$ dig +nocmd gamemusic.multimedia.cx +noall +answer
gamemusic.multimedia.cx. 2      IN      A       174.143.152.251
$ dig +nocmd gamemusic.multimedia.cx +noall +answer
gamemusic.multimedia.cx. 1      IN      A       174.143.152.251
$ dig +nocmd gamemusic.multimedia.cx +noall +answer
gamemusic.multimedia.cx. 12962  IN      A       207.45.186.114
```
Finally, today (Saturday), I received a receipt confirming that the domain has been renewed.

8 Years Old
Incidentally, happy eighth birthday to multimedia.cx. It was September, 2004 when I decided to branch out from a simple ISP-based web presence.

People often ask why I went with the .cx TLD. When I decided I wanted a proper domain name 8 years ago, I found that multimedia.X was already taken for just about every TLD value of X. .cx was a notable exception and was distinctive enough (speaking of .X, though, I see that multimedia.xxx is still up for grabs as of this writing; I imagine that would come with a whole other set of problems).

It’s funny that tech nerds often rail against outsourcing too much — email, storage, computing power, web hosting — all to some type of cloud provider under the premise that it could easily be taken away. But this episode teaches me that even having your own domain name is no guarantee of a solid online presence.

Meanwhile, I have taken proactive steps to avert this same situation from arising again:

Barring a lack of automated emails from the registrar, I hope a Google Calendar reminder set up a month ahead of expiration will do the trick.
Death of A Micro Center

21 septembre 2012, par Multimedia Mike — History

The Micro Center computer store located in Santa Clara, CA, USA closed recently:

I liked Micro Center. I have liked Micro Center ever since I first visited their Denver, CO location 10 years ago. I would sometimes drive an hour in each direction just to visit that shop. I was excited to see that they had a location in the Bay Area when I moved here a few years ago (despite the preponderance of Fry’s stores).

Now this location is gone. I wonder how much of the “we couldn’t come to favorable terms on a lease” was true (vs. an excuse to close a retail store at a time when more business is moving online, particularly in the heart of Silicon Valley). But that’s not what I wanted to discuss. I came here to discuss…

The Micro Center Window Logos

The craziest part about shopping the Santa Clara Micro Center location was the logos they displayed on the window outside. Every time I saw it, it made me sentimental for a time when some of these logos were current, or when some of these companies were still in business. Some of the logos on their front window were for companies I’ve never heard of. It reminds me of the nearby 7-11 convenience stores when I was growing up– their walls were decorated with people sporting embarrassingly 1970s styles long after the 1970s had transpired.

I thought I would record what those front window logos were and try to pinpoint when the store launched exactly (assuming the logos have been their since the initial opening and never changed).

Click for larger image

Here we have Lotus, Hewlett Packard/HP, Corel, Fuji, Power Macintosh, NEC, and Fujitsu. Lotus was purchased by IBM in 1995 and still seems to be maintained as a separate brand. The Power Macintosh was introduced as a brand in 1994. Corel’s logo has seen a few mutations over the years but I don’t know when this one fell out of favor.

Fuji (vs. Fujitsu) appears to refer to Fujifilm, though this logo is also obsolete.

Click for larger image

Hayes– I specifically remember reading the Slashdot post accouncing that Hayes is dead (followed by many comments reminiscing about the Hayes command set). Here is the post, from early 1999.

From Googling, it doesn’t appear IBM still has a presence in the consumer computing space (though they do have something pertaining to software for consumer products). Then there’s the good old rainbow Apple logo, something that went away in 1997. I suspect 1997 was also the last hurrah of the name ‘Macintosh’ (though I remember mistakenly referring to Apple computer products as Macintoshes well into the mid-2000s and inadvertently angering some Apple enthusiasts).

Click for larger image

As for the next segment, obviously, both Sony and Toshiba are still very much alive. Iomega was acquired by EMC in 2008 but is still maintained as a separate brand. USRobotics is still around and making — what else? — 56K modems (and their current logo is slightly different than the one seen here).

Targus seems to be a case maker (“Leading Provider of Cases, Bags and Accessories for Laptops and Tablets”). I wonder if that’s just their current business or if they had more areas long ago? It seems strange that they would get brand billing like this.

Finally, searching for information about Practical Peripherals only produces sites about how they’re long dead (like this history lesson). It’s unclear when they died.

The interior of this store was also decorated with more technology company logos near the ceiling (I didn’t really register that fact until I had visited many times). Regrettably, I now won’t be able to see how up to date those logos were.

Based on the data points above, it’s safe to conclude that the store opened between 1995 or 1996 (again, assuming the logos were placed at opening and never changed).

Epilogue

Here’s one more curious item still visible from the outside:

“See the world’s fastest PC!” Featuring an Intel Core 2 Extreme? That CPU dates back to 2007 and was succeeded by Nehalem in late 2008. So even that sign, which is presumably easier and cleaner to replace than the window logos, was absurdly out of date.
Chiptune Database and API

14 septembre 2012, par Multimedia Mike — General

So I set out to create a website that allows people to easily listen to video game music directly through their web browser. I succeeded in that goal. However, I must admit that the project has limited appeal since the web player is delivered via Chrome’s Native Client technology, somewhat limiting its audience. I’m not certain if anyone really expects NaCl to take off in any serious way, but I still have a few other projects in mind.

I recently realized that, as a side effect of this project, I accidentally created something of significant value to fans of old video games and associated music– a searchable database of chiptune music and metadata. To my knowledge, no one else has endeavored to create such a thing. I figured that I might as well make the database easily accessible with an API and see where it leads.

To that end, I created 2 API entry points. First, there is the search API located at http://gamemusic.multimedia.cx/api/search/. This can be exercised by ending the URL with a URL-encoded search string, e.g.: http://gamemusic.multimedia.cx/api/search/super+mario. This returns JSON data containing an array of results in decreasing order of relevance. Each result has a game title, database ID, media URL, system type, and an SHA-1 hash. This is the same API that the site’s own search page uses.

The database ID can be plugged into http://gamemusic.multimedia.cx/api/metadata/ to retrieve the song’s metadata in JSON format. E.g., the ID for Super Mario Bros. 3 on the NES is 161: http://gamemusic.multimedia.cx/api/metadata/161.

I recently read an article about sins against true RESTful API principles which led me to believe I’m almost certainly doing this web API stuff wrong. I don’t think it’s a huge deal, though, since I don’t think anyone actually listens to chiptunes any more. But if there are offline chiptune music players that are still in service and actively maintained, perhaps the authors would like to implement this API. It would require some type of HTTP networking library, a JSON parser, the embedded XZ decoder, and some new code to parse through my .gamemusic and .psfarchive formats.

This database could be a significant value-add to chiptune playback software, and could help people experience classic game music much more easily.