Recherche avancée

Médias (0)

Mot : - Tags -/xmlrpc

Aucun média correspondant à vos critères n’est disponible sur le site.

Autres articles (27)

  • La sauvegarde automatique de canaux SPIP

    1er avril 2010, par

    Dans le cadre de la mise en place d’une plateforme ouverte, il est important pour les hébergeurs de pouvoir disposer de sauvegardes assez régulières pour parer à tout problème éventuel.
    Pour réaliser cette tâche on se base sur deux plugins SPIP : Saveauto qui permet une sauvegarde régulière de la base de donnée sous la forme d’un dump mysql (utilisable dans phpmyadmin) mes_fichiers_2 qui permet de réaliser une archive au format zip des données importantes du site (les documents, les éléments (...)

  • Script d’installation automatique de MediaSPIP

    25 avril 2011, par

    Afin de palier aux difficultés d’installation dues principalement aux dépendances logicielles coté serveur, un script d’installation "tout en un" en bash a été créé afin de faciliter cette étape sur un serveur doté d’une distribution Linux compatible.
    Vous devez bénéficier d’un accès SSH à votre serveur et d’un compte "root" afin de l’utiliser, ce qui permettra d’installer les dépendances. Contactez votre hébergeur si vous ne disposez pas de cela.
    La documentation de l’utilisation du script d’installation (...)

  • Automated installation script of MediaSPIP

    25 avril 2011, par

    To overcome the difficulties mainly due to the installation of server side software dependencies, an "all-in-one" installation script written in bash was created to facilitate this step on a server with a compatible Linux distribution.
    You must have access to your server via SSH and a root account to use it, which will install the dependencies. Contact your provider if you do not have that.
    The documentation of the use of this installation script is available here.
    The code of this (...)

Sur d’autres sites (4964)

  • Processing Big Data Problems

    8 janvier 2011, par Multimedia Mike — Big Data

    I’m becoming more interested in big data problems, i.e., extracting useful information out of absurdly sized sets of input data. I know it’s a growing field and there is a lot to read on the subject. But you know how I roll— just think of a problem to solve and dive right in.

    Here’s how my adventure unfolded.

    The Corpus
    I need to run a command line program on a set of files I have collected. This corpus is on the order of 350,000 files. The files range from 7 bytes to 175 MB. Combined, they occupy around 164 GB of storage space.

    Oh, and said storage space resides on an external, USB 2.0-connected hard drive. Stop laughing.

    A file is named according to the SHA-1 hash of its data. The files are organized in a directory hierarchy according to the first 6 hex digits of the SHA-1 hash (e.g., a file named a4d5832f... is stored in a4/d5/83/a4d5832f...). All of this file hash, path, and size information is stored in an SQLite database.

    First Pass
    I wrote a Python script that read all the filenames from the database, fed them into a pool of worker processes using Python’s multiprocessing module, and wrote some resulting data for each file back to the SQLite database. My Eee PC has a single-core, hyperthreaded Atom which presents 2 CPUs to the system. Thus, 2 worker threads crunched the corpus. It took awhile. It took somewhere on the order of 9 or 10 or maybe even 12 hours. It took long enough that I’m in no hurry to re-run the test and get more precise numbers.

    At least I extracted my initial set of data from the corpus. Or did I ?

    Think About The Future

    A few days later, I went back to revisit the data only to notice that the SQLite database was corrupted. To add insult to that bit of injury, the script I had written to process the data was also completely corrupted (overwritten with something unrelated to Python code). BTW, this is was on a RAID brick configured for redundancy. So that’s strike 3 in my personal dealings with RAID technology.

    I moved the corpus to a different external drive and also verified the files after writing (easy to do since I already had the SHA-1 hashes on record).

    The corrupted script was pretty simple to rewrite, even a little better than before. Then I got to re-run it. However, this run was on a faster machine, a hyperthreaded, quad-core beast that exposes 8 CPUs to the system. The reason I wasn’t too concerned about the poor performance with my Eee PC is that I knew I was going to be able to run in on this monster later.

    So I let the rewritten script rip. The script gave me little updates regarding its progress. As it did so, I ran some rough calculations and realized that it wasn’t predicted to finish much sooner than it would have if I were running it on the Eee PC.

    Limiting Factors
    It had been suggested to me that I/O bandwidth of the external USB drive might be a limiting factor. This is when I started to take that idea very seriously.

    The first idea I had was to move the SQLite database to a different drive. The script records data to the database for every file processed, though it only commits once every 100 UPDATEs, so at least it’s not constantly syncing the disc. I ran before and after tests with a small subset of the corpus and noticed a substantial speedup thanks to this policy chance.

    Then I remembered hearing something about "atime" which is access time. Linux filesystems, per default, record the time that a file was last accessed. You can watch this in action by running 'stat <file> ; cat <file> > /dev/null ; stat <file>' and observe that the "Access" field has been updated to NOW(). This also means that every single file that gets read from the external drive still causes an additional write. To avoid this, I started mounting the external drive with '-o noatime' which instructs Linux not to record "last accessed" time for files.

    On the limited subset test, this more than doubled script performance. I then wondered about mounting the external drive as read-only. This had the same performance as noatime. I thought about using both options together but verified that access times are not updated for a read-only filesystem.

    A Note On Profiling
    Once you start accessing files in Linux, those files start getting cached in RAM. Thus, if you profile, say, reading a gigabyte file from a disk and get 31 MB/sec, and then repeat the same test, you’re likely to see the test complete instantaneously. That’s because the file is already sitting in memory, cached. This is useful in general application use, but not if you’re trying to profile disk performance.

    Thus, in between runs, do (as root) 'sync; echo 3 > /proc/sys/vm/drop_caches' in order to wipe caches (explained here).

    Even Better ?
    I re-ran the test using these little improvements. Now it takes somewhere around 5 or 6 hours to run.

    I contrived an artificially large file on the external drive and did some 'dd' tests to measure what the drive could really do. The drive consistently measured a bit over 31 MB/sec. If I could read and process the data at 30 MB/sec, the script would be done in about 95 minutes.

    But it’s probably rather unreasonable to expect that kind of transfer rate for lots of smaller files scattered around a filesystem. However, it can’t be that helpful to have 8 different processes constantly asking the HD for 8 different files at any one time.

    So I wrote a script called stream-corpus.py which simply fetched all the filenames from the database and loaded the contents of each in turn, leaving the data to be garbage-collected at Python’s leisure. This test completed in 174 minutes, just shy of 3 hours. I computed an average read speed of around 17 MB/sec.

    Single-Reader Script
    I began to theorize that if I only have one thread reading, performance should improve greatly. To test this hypothesis without having to do a lot of extra work, I cleared the caches and ran stream-corpus.py until 'top' reported that about half of the real memory had been filled with data. Then I let the main processing script loose on the data. As both scripts were using sorted lists of files, they iterated over the filenames in the same order.

    Result : The processing script tore through the files that had obviously been cached thanks to stream-corpus.py, degrading drastically once it had caught up to the streaming script.

    Thus, I was incented to reorganize the processing script just slightly. Now, there is a reader thread which reads each file and stuffs the name of the file into an IPC queue that one of the worker threads can pick up and process. Note that no file data is exchanged between threads. No need— the operating system is already implicitly holding onto the file data, waiting in case someone asks for it again before something needs that bit of RAM. Technically, this approach accesses each file multiple times. But it makes little practical difference thanks to caching.

    Result : About 183 minutes to process the complete corpus (which works out to a little over 16 MB/sec).

    Why Multiprocess
    Is it even worthwhile to bother multithreading this operation ? Monitoring the whole operation via 'top', most instances of the processing script are barely using any CPU time. Indeed, it’s likely that only one of the worker threads is doing any work most of the time, pulling a file out of the IPC queue as soon the reader thread triggers its load into cache. Right now, the processing is usually pretty quick. There are cases where the processing (external program) might hang (one of the reasons I’m running this project is to find those cases) ; the multiprocessing architecture at least allows other processes to take over until a hanging process is timed out and killed by its monitoring process.

    Further, the processing is pretty simple now but is likely to get more intensive in future iterations. Plus, there’s the possibility that I might move everything onto a more appropriately-connected storage medium which should help alleviate the bottleneck bravely battled in this post.

    There’s also the theoretical possibility that the reader thread could read too far ahead of the processing threads. Obviously, that’s not too much of an issue in the current setup. But to guard against it, the processes could share a variable that tracks the total number of bytes that have been processed. The reader thread adds filesizes to the count while the processing threads subtract file sizes. The reader thread would delay reading more if the number got above a certain threshold.

    Leftovers
    I wondered if the order of accessing the files mattered. I didn’t write them to the drive in any special order. The drive is formatted with Linux ext3. I ran stream-corpus.py on all the filenames sorted by filename (remember the SHA-1 naming convention described above) and also by sorting them randomly.

    Result : It helps immensely for the filenames to be sorted. The sorted variant was a little more than twice as fast as the random variant. Maybe it has to do with accessing all the files in a single directory before moving onto another directory.

    Further, I have long been under the impression that the best read speed you can expect from USB 2.0 was 27 Mbytes/sec (even though 480 Mbit/sec is bandied about in relation to the spec). This comes from profiling I performed with an external enclosure that supports both USB 2.0 and FireWire-400 (and eSata). FW-400 was able to read the same file at nearly 40 Mbytes/sec that USB 2.0 could only read at 27 Mbytes/sec. Other sources I have read corroborate this number. But this test (using different hardware), achieved over 31 Mbytes/sec.

  • Reverse Engineering Italian Literature

    1er juillet 2014, par Multimedia Mike — Reverse Engineering

    Some time ago, Diego “Flameeyes” Pettenò tried his hand at reverse engineering a set of really old CD-ROMs containing even older Italian literature. The goal of this RE endeavor would be to extract the useful literature along with any structural metadata (chapters, etc.) and convert it to a more open format suitable for publication at, e.g., Project Gutenberg or Archive.org.

    Unfortunately, the structure of the data thwarted the more simplistic analysis attempts (like inspecting for blocks of textual data). This will require deeper RE techniques. Further frustrating the effort, however, is the fact that the binaries that implement the reading program are written for the now-archaic Windows 3.1 operating system.

    In pursuit of this RE goal, I recently thought of a way to glean more intelligence using DOSBox.

    Prior Work
    There are 6 discs in the full set (distributed along with 6 sequential issues of a print magazine named L’Espresso). Analysis of the contents of the various discs reveals that many of the files are the same on each disc. It was straightforward to identify the set of files which are unique on each disc. This set of files all end with the extension “LZn”, where n = 1..6 depending on the disc number. Further, the root directory of each disc has a file indicating the sequence number (1..6) of the CD. Obviously, these are the interesting targets.

    The LZ file extensions stand out to an individual skilled in the art of compression– could it be a variation of the venerable LZ compression ? That’s actually unlikely because LZ — also seen as LIZ — stands for Letteratura Italiana Zanichelli (Zanichelli’s Italian Literature).

    The Unix ‘file’ command was of limited utility, unable to plausibly identify any of the files.

    Progress was stalled.

    Saying Hello To An Old Frenemy
    I have been showing this screenshot to younger coworkers to see if any of them recognize it :


    DOSBox running Window 3.1

    Not a single one has seen it before. Senior computer citizen status : Confirmed.

    I recently watched an Ancient DOS Games video about Windows 3.1 games. This episode showed Windows 3.1 running under DOSBox. I had heard this was possible but that it took a little work to get running. I had a hunch that someone else had probably already done the hard stuff so I took to the BitTorrent networks and quickly found a download that had the goods ready to go– a directory of Windows 3.1 files that just had to be dropped into a DOSBox directory and they would be ready to run.

    Aside : Running OS software procured from a BitTorrent network ? Isn’t that an insane security nightmare ? I’m not too worried since it effectively runs under a sandboxed virtual machine, courtesy of DOSBox. I suppose there’s the risk of trojan’d OS software infecting binaries that eventually leave the sandbox.

    Using DOSBox Like ‘strace’
    strace is a tool available on some Unix systems, including Linux, which is able to monitor the system calls that a program makes. In reverse engineering contexts, it can be useful to monitor an opaque, binary program to see the names of the files it opens and how many bytes it reads, and from which locations. I have written examples of this before (wow, almost 10 years ago to the day ; now I feel old for the second time in this post).

    Here’s the pitch : Make DOSBox perform as strace in order to serve as a platform for reverse engineering Windows 3.1 applications. I formed a mental model about how DOSBox operates — abstracted file system classes with methods for opening and reading files — and then jumped into the source code. Sure enough, the code was exactly as I suspected and a few strategic print statements gave me the data I was looking for.

    Eventually, I even took to running DOSBox under the GNU Debugger (GDB). This hasn’t proven especially useful yet, but it has led to an absurd level of nesting :


    GDB runs DOSBox runs Windows 3.1

    The target application runs under Windows 3.1, which is running under DOSBox, which is running under GDB. This led to a crazy situation in which DOSBox had the mouse focus when a GDB breakpoint was triggered. At this point, DOSBox had all desktop input focus and couldn’t surrender it because it wasn’t running. I had no way to interact with the Linux desktop and had to reboot the computer. The next time, I took care to only use the keyboard to navigate the application and trigger the breakpoint and not allow DOSBox to consume the mouse focus.

    New Intelligence

    By instrumenting the local file class (virtual HD files) and the ISO file class (CD-ROM files), I was able to watch which programs and dynamic libraries are loaded and which data files the code cares about. I was able to narrow down the fact that the most interesting programs are called LEGGENDO.EXE (‘reading’) and LEGGENDA.EXE (‘legend’ ; this has been a great Italian lesson as well as RE puzzle). The first calls the latter, which displays this view of the data we are trying to get at :


    LIZ: Authors index

    When first run, the program takes an interest in a file called DBBIBLIO (‘database library’, I suspect) :

    === Read(’LIZ98\DBBIBLIO.LZ1’) : req 337 bytes ; read 337 bytes from pos 0x0
    === Read(’LIZ98\DBBIBLIO.LZ1’) : req 337 bytes ; read 337 bytes from pos 0x151
    === Read(’LIZ98\DBBIBLIO.LZ1’) : req 337 bytes ; read 337 bytes from pos 0x2A2
    [...]
    

    While we were unable to sort out all of the data files in our cursory investigation, a few things were obvious. The structure of this file looked to contain 336-byte records. Turns out I was off by 1– the records are actually 337 bytes each. The count of records read from disc is equal to the number of items shown in the UI.

    Next, the program is interested in a few more files :

    *** isoFile() : ’DEPOSITO\BLOKCTC.LZ1’, offset 0x27D6000, 2911488 bytes large
    === Read(’DEPOSITO\BLOKCTC.LZ1’) : req 96 bytes ; read 96 bytes from pos 0x0
    *** isoFile() : ’DEPOSITO\BLOKCTX0.LZ1’, offset 0x2A9D000, 17152 bytes large
    === Read(’DEPOSITO\BLOKCTX0.LZ1’) : req 128 bytes ; read 128 bytes from pos 0x0
    === Seek(’DEPOSITO\BLOKCTX0.LZ1’) : seek 384 (0x180) bytes, type 0
    === Read(’DEPOSITO\BLOKCTX0.LZ1’) : req 256 bytes ; read 256 bytes from pos 0x180
    === Seek(’DEPOSITO\BLOKCTC.LZ1’) : seek 1152 (0x480) bytes, type 0
    === Read(’DEPOSITO\BLOKCTC.LZ1’) : req 32 bytes ; read 32 bytes from pos 0x480
    === Read(’DEPOSITO\BLOKCTC.LZ1’) : req 1504 bytes ; read 1504 bytes from pos 0x4A0
    [...]

    Eventually, it becomes obvious that BLOKCTC has the juicy meat. There are 32-byte records followed by variable-length encoded text sections. Since there is no text to be found in these files, the text is either compressed, encrypted, or both. Some rough counting (the program seems to disable copy/paste, which thwarts more precise counting), indicates that the text size is larger than the data chunks being read from disc, so compression seems likely. Encryption isn’t out of the question (especially since the program deems it necessary to disable copy and pasting of this public domain literary data), and if it’s in use, that means the key is being read from one of these files.

    Blocked On Disassembly
    So I’m a bit blocked right now. I know exactly where the data lives, but it’s clear that I need to reverse engineer some binary code. The big problem is that I have no idea how to disassemble Windows 3.1 binaries. These are NE-type executable files. Disassemblers abound for MZ files (MS-DOS executables) and PE files (executables for Windows 95 and beyond). NE files get no respect. It’s difficult (but not impossible) to even find data about the format anymore, and details are incomplete. It should be noted, however, the DOSBox-as-strace method described here lends insight into how Windows 3.1 processes NE-type EXEs. You can’t get any more authoritative than that.

    So far, I have tried the freeware version of IDA Pro. Unfortunately, I haven’t been able to get the program to work on my Windows machine for a long time. Even if I could, I can’t find any evidence that it actually supports NE files (the free version specifically mentions MZ and PE, but does not mention NE or LE).

    I found an old copy of Borland’s beloved Turbo Assembler and Debugger package. It has Turbo Debugger for Windows, both regular and 32-bit versions. Unfortunately, the normal version just hangs Windows 3.1 in DOSBox. The 32-bit Turbo Debugger loads just fine but can’t load the NE file.

    I’ve also wondered if DOSBox contains any advanced features for trapping program execution and disassembling. I haven’t looked too deeply into this yet.

    Future Work
    NE files seem to be the executable format that time forgot. I have a crazy brainstorm about repacking NE files as MZ executables so that they could be taken apart with an MZ disassembler. But this will take some experimenting.

    If anyone else has any ideas about ripping open these binaries, I would appreciate hearing them.

    And I guess I shouldn’t be too surprised to learn that all the literature in this corpus is already freely available and easily downloadable anyway. But you shouldn’t be too surprised if that doesn’t discourage me from trying to crack the format that’s keeping this particular copy of the data locked up.

  • ffmpeg - Dynamic letters and random position watermark to video ?

    2 avril 2016, par sekmo

    I am making an online course, and to avoid piracy distribution I thought to put watermarks on the videos (including personal user information) so it cannot upload to sharing websites. Now the hard part : I would move the watermark during the video, in 3/4 random positions, every 30 seconds.
    It is possibile with ffmpeg ?