Recherche avancée

Médias (0)

Mot : - Tags -/serveur

Aucun média correspondant à vos critères n’est disponible sur le site.

Autres articles (46)

  • XMP PHP

    13 mai 2011, par

    Dixit Wikipedia, XMP signifie :
    Extensible Metadata Platform ou XMP est un format de métadonnées basé sur XML utilisé dans les applications PDF, de photographie et de graphisme. Il a été lancé par Adobe Systems en avril 2001 en étant intégré à la version 5.0 d’Adobe Acrobat.
    Étant basé sur XML, il gère un ensemble de tags dynamiques pour l’utilisation dans le cadre du Web sémantique.
    XMP permet d’enregistrer sous forme d’un document XML des informations relatives à un fichier : titre, auteur, historique (...)

  • Publier sur MédiaSpip

    13 juin 2013

    Puis-je poster des contenus à partir d’une tablette Ipad ?
    Oui, si votre Médiaspip installé est à la version 0.2 ou supérieure. Contacter au besoin l’administrateur de votre MédiaSpip pour le savoir

  • ANNEXE : Les plugins utilisés spécifiquement pour la ferme

    5 mars 2010, par

    Le site central/maître de la ferme a besoin d’utiliser plusieurs plugins supplémentaires vis à vis des canaux pour son bon fonctionnement. le plugin Gestion de la mutualisation ; le plugin inscription3 pour gérer les inscriptions et les demandes de création d’instance de mutualisation dès l’inscription des utilisateurs ; le plugin verifier qui fournit une API de vérification des champs (utilisé par inscription3) ; le plugin champs extras v2 nécessité par inscription3 (...)

Sur d’autres sites (7105)

  • Ffmpeg set output format C++

    7 septembre 2022, par Turgut

    I made a program that encodes a video and I want to specify the format as h264 but I can't figure out how to do it. It automatically sets the format to mpeg4 and I can't change it. I got my code from ffmpegs official examples muxing.c and slightly edited it to fit my code (I haven't changed much especially did not touch the parts where it sets the format)

    


    Here is my code so for (I have trimmed down the code slightly, removing redundant parts)

    


    video_encoder.cpp :

    


    
video_encoder::video_encoder(int w, int h, float fps, unsigned int duration) 
 :width(w), height(h), STREAM_FRAME_RATE(fps), STREAM_DURATION(duration)
{
    std::string as_str = "./output/video.mp4";

    char* filename = const_cast(as_str.c_str());
    enc_inf.video_st, enc_inf.audio_st = (struct OutputStream) { 0 };
    enc_inf.video_st.next_pts = 1; 
    enc_inf.audio_st.next_pts = 1;
    enc_inf.encode_audio, enc_inf.encode_video = 0;
    int ret;
    int i;

    /* allocate the output media context */
    avformat_alloc_output_context2(&enc_inf.oc, NULL, NULL, filename);

    if (!enc_inf.oc) {
        std::cout << "FAILED" << std::endl;
        avformat_alloc_output_context2(&enc_inf.oc, NULL, "mpeg", filename);
    }

    enc_inf.fmt = enc_inf.oc->oformat;

    /* Add the audio and video streams using the default format codecs
     * and initialize the codecs. */
    if (enc_inf.fmt->video_codec != AV_CODEC_ID_NONE) {
        add_stream(&enc_inf.video_st, enc_inf.oc, &video_codec, enc_inf.fmt->video_codec);
        enc_inf.have_video = 1;
        enc_inf.encode_video = 1;
    }
    if (enc_inf.fmt->audio_codec != AV_CODEC_ID_NONE) {
        add_stream(&enc_inf.audio_st, enc_inf.oc, &audio_codec, enc_inf.fmt->audio_codec);
        enc_inf.have_audio = 1;
        enc_inf.encode_audio = 1;
    }

    /* Now that all the parameters are set, we can open the audio and
     * video codecs and allocate the necessary encode buffers. */
    if (enc_inf.have_video)
        open_video(enc_inf.oc, video_codec, &enc_inf.video_st, opt);

    if (enc_inf.have_audio)
        open_audio(enc_inf.oc, audio_codec, &enc_inf.audio_st, opt);
    av_dump_format(enc_inf.oc, 0, filename, 1);

    /* open the output file, if needed */
    if (!(enc_inf.fmt->flags & AVFMT_NOFILE)) {
        ret = avio_open(&enc_inf.oc->pb, filename, AVIO_FLAG_WRITE);
        if (ret < 0) {
            //VI_ERROR("Could not open '%s': %s\n", filename, ret);
            //return 1;
        }
    }

    /* Write the stream header, if any. */
    ret = avformat_write_header(enc_inf.oc, &opt);
    if (ret < 0) {
        VI_ERROR("Error occurred when opening output file:");
        //return 1;
    }
    
    //return 0;
}


/* Add an output stream. */
void video_encoder::add_stream(OutputStream *ost, AVFormatContext *oc,
                       const AVCodec **codec,
                       enum AVCodecID codec_id)
{
    AVCodecContext *c;
    int i;

    /* find the encoder */
    *codec = avcodec_find_encoder(codec_id);
    
    if (!(*codec)) {
        fprintf(stderr, "Could not find encoder for '%s'\n",
                avcodec_get_name(codec_id));
        exit(1);
    }

    ost->tmp_pkt = av_packet_alloc();

    if (!ost->tmp_pkt) {
        fprintf(stderr, "Could not allocate AVPacket\n");
        exit(1);
    }

    ost->st = avformat_new_stream(oc, NULL);
    if (!ost->st) {
        fprintf(stderr, "Could not allocate stream\n");
        exit(1);
    }
    ost->st->id = oc->nb_streams-1;
    c = avcodec_alloc_context3(*codec);
    if (!c) {
        fprintf(stderr, "Could not alloc an encoding context\n");
        exit(1);
    }
    ost->enc = c;


    switch ((*codec)->type) {
    case AVMEDIA_TYPE_AUDIO:
        ...
        break;
    case AVMEDIA_TYPE_VIDEO:
        c->codec_id = codec_id;

        c->bit_rate = 10000;
        /* Resolution must be a multiple of two. */
        c->width    = width;
        c->height   = height;
        /* timebase: This is the fundamental unit of time (in seconds) in terms
         * of which frame timestamps are represented. For fixed-fps content,
         * timebase should be 1/framerate and timestamp increments should be
         * identical to 1. */
        ost->st->time_base = (AVRational){ 1, STREAM_FRAME_RATE }; // *frame_rate
        c->time_base       = ost->st->time_base;

        c->gop_size      = 7; /* emit one intra frame every twelve frames at most */
        //c->codec_id      = AV_CODEC_ID_H264;
        c->pix_fmt       = STREAM_PIX_FMT;
        //if (c->codec_id == AV_CODEC_ID_MPEG2VIDEO) 
        //    c->max_b_frames = 2;
        if (c->codec_id == AV_CODEC_ID_MPEG1VIDEO) {
            /* Needed to avoid using macroblocks in which some coeffs overflow.
             * This does not happen with normal video, it just happens here as
             * the motion of the chroma plane does not match the luma plane. */
            c->mb_decision = 2;
        }

        if ((*codec)->pix_fmts){
            //c->pix_fmt = (*codec)->pix_fmts[0];
            std::cout << "NEW FORMAT : " << c->pix_fmt << std::endl;
        }

        break;
    }
     

    /* Some formats want stream headers to be separate. */
    if (oc->oformat->flags & AVFMT_GLOBALHEADER)
        c->flags |= AV_CODEC_FLAG_GLOBAL_HEADER;
}


    


    video_encoder.h

    


    
typedef struct OutputStream {
    AVStream *st;
    AVCodecContext *enc;

    /* pts of the next frame that will be generated */
    int64_t next_pts;
    int samples_count;

    AVFrame *frame;
    AVFrame *tmp_frame;

    AVPacket *tmp_pkt;

    float t, tincr, tincr2;

    struct SwsContext *sws_ctx;
    struct SwrContext *swr_ctx;
} OutputStream;

class video_encoder{
    private:
        typedef struct {
            OutputStream video_st, audio_st;
            const AVOutputFormat *fmt;
            AVFormatContext *oc;
            int have_video, have_audio, encode_video, encode_audio;
            std::string name;
        } encode_info;
    public:
        encode_info enc_inf;
        video_encoder(int w, int h, float fps, unsigned int duration);
        ~video_encoder();  
        ...
    private:
        ...
        void add_stream(OutputStream *ost, AVFormatContext *oc,
                       const AVCodec **codec,
                       enum AVCodecID codec_id);


    


    I'm thinking that the example sets the codec at avformat_alloc_output_context2(&enc_inf.oc, NULL, NULL, filename) but I'm not quite sure how to set it to h264.

    


    I've tried something like this avformat_alloc_output_context2(&enc_inf.oc, enc_inf.fmt, "h264", filename)

    


    But it just gives a seg fault. What am I supposed to do ?

    


    Edit : I've tried adding these two lines to video_encoder::video_encoder by deleting avformat_alloc_output_context2(&enc_inf.oc, NULL, NULL, filename); :

    


    
    video_codec = avcodec_find_encoder(AV_CODEC_ID_H264);
    enc_inf.video_st.enc = avcodec_alloc_context3(video_codec);



    


    But it resulted in these errors :
It says this every frame (A bunch of times)

    


    [mpeg @ 0x56057c465480] buffer underflow st=0 bufi=26822 size=31816


    


    Says this once when the frame encoding loop is over :

    


    [mpeg @ 0x5565ac4a04c0] start time for stream 0 is not set in estimate_timings_from_pts
[mpeg @ 0x5565ac4a04c0] stream 0 : no TS found at start of file, duration not set
[mpeg @ 0x5565ac4a04c0] Could not find codec parameters for stream 0 (Video: mpeg2video, none): unspecified size
Consider increasing the value for the 'analyzeduration' (0) and 'probesize' (5000000) options


    


  • Improve ffmpeg x11grab screen capture performance

    10 janvier 2020, par Toby Eggitt

    I have been doing screen-only (no sound) capture using ffmpeg with libx264 for the encoding quite successfully on an old machine built around a Core2 Quad Q6600 processor. I now need to include audio in this, but the fans on this ancient machine are too loud. So, I found a fanless motherboard (https://www.asrock.com/mb/Intel/J5005-ITX/index.asp) that has an Intel Pentium Silver J5005 processor and decided to use this instead. The CPU’s benchmarks put it in a similar bracket to the Q6600, and the general performance seems to be significantly better, presumably at least in part because it’s now using DDR4 memory that’s faster access.

    However, the machine fails horribly at the screen capture. It’s missing frames all over the place ; I actually end up with video that’s missing almost half the frames, and plays back at about double speed. Also, any audio is just messed up so badly I can hardly think how to describe it, best I can come up with is that I get perhaps a quarter second of sound then a few seconds pause (the video meanwhile is actually still playing back, albeit with no sense of time).

    Some things occur to me that might be the cause, or cure, of my troubles, some of which I might be able to fix, others not so much. What other things should I try ? (I’d prefer to avoid simply throwing money at the issue with random ideas that are baseless !)

    1) perhaps the CPU lacks some "extensions" to the instruction set (I recall years ago some CPUs gaining MMX extensions") so that the CPU is fast at mundane computing but sucks at video encoding.

    2) perhaps the fact that the old machine had a dedicated graphics card, while this new one is sharing main memory with the graphics system means that reading the screen pixels is much slower.

    3) perhaps the fact that this new machine has a single DDR4 memory stick in it means that I’m forcing all the memory reads and writes for the computations through the same memory as is holding the screen, and that’s too much (implying that adding an additional memory stick might jus possibly help ?)

    4) perhaps there’s some bios setting that would allow more efficient sharing of video memory ?

    5) my favorite, perhaps there’s a better compression library that I could use to get decent quality screen capture with much less CPU usage.

    I should also note that I have tried this with -threads 0, and the CPU usage hovers between 100% and 200% ; around 100% when the screen is static, and rising as I move windows around and otherwise create more output.

    6) the motherboard claims to have some kind of hardware video encoder built into it. I haven’t paid this any attention to this point, as I assumed it was for the purpose of taking HDMI input and encoding it, but maybe there’s a way to use this, if so, what libraries might I need to get ffmpeg to do this.

    Edits :

    • This is an off the shelf ffmpeg. I’m certainly willing to try building it myself if I have some idea what I should do different.
    • The motherboard claims to have hardware encoders, but I’m struggling to find out what they are (seems like it’s an Intel chip called "UHD Graphics 605" but nothing I can find suggests ffmpeg can work with that)
    • command line right now has been (without audio) :

      ffmpeg  -video_size 1280x720   -f x11grab  -i ${DISPLAY}+100,100  -vcodec libx264  -f alsa -i pulse -acodec ac3 -threads 0  ./video$(date +%F-%H-%M-%S).mp4

    Log from a short recording session is :

    ffmpeg version 3.4.6-0ubuntu0.18.04.1 Copyright (c) 2000-2019 the FFmpeg developers
     built with gcc 7 (Ubuntu 7.3.0-16ubuntu3)
     configuration: --prefix=/usr --extra-version=0ubuntu0.18.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
     libavutil      55. 78.100 / 55. 78.100
     libavcodec     57.107.100 / 57.107.100
     libavformat    57. 83.100 / 57. 83.100
     libavdevice    57. 10.100 / 57. 10.100
     libavfilter     6.107.100 /  6.107.100
     libavresample   3.  7.  0 /  3.  7.  0
     libswscale      4.  8.100 /  4.  8.100
     libswresample   2.  9.100 /  2.  9.100
     libpostproc    54.  7.100 / 54.  7.100
    [x11grab @ 0x561a723e5ac0] Stream #0: not enough frames to estimate rate; consider increasing probesize
    Input #0, x11grab, from ':0+100,100':
     Duration: N/A, start: 1578693116.465807, bitrate: N/A
       Stream #0:0: Video: rawvideo (BGR[0] / 0x524742), bgr0, 1280x720, 29.97 fps, 1000k tbr, 1000k tbn, 1000k tbc
    Unknown decoder 'libx264'
    simon@studio:~$ ffmpeg  -video_size 1280x720   -f x11grab  -i ${DISPLAY}+100,100  -vcodec libx264 -threads 0  ./video$(date +%F-%H-%M-%S).mp4
    ffmpeg version 3.4.6-0ubuntu0.18.04.1 Copyright (c) 2000-2019 the FFmpeg developers
     built with gcc 7 (Ubuntu 7.3.0-16ubuntu3)
     configuration: --prefix=/usr --extra-version=0ubuntu0.18.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
     libavutil      55. 78.100 / 55. 78.100
     libavcodec     57.107.100 / 57.107.100
     libavformat    57. 83.100 / 57. 83.100
     libavdevice    57. 10.100 / 57. 10.100
     libavfilter     6.107.100 /  6.107.100
     libavresample   3.  7.  0 /  3.  7.  0
     libswscale      4.  8.100 /  4.  8.100
     libswresample   2.  9.100 /  2.  9.100
     libpostproc    54.  7.100 / 54.  7.100
    [x11grab @ 0x558225bc29a0] Stream #0: not enough frames to estimate rate; consider increasing probesize
    Input #0, x11grab, from ':0+100,100':
     Duration: N/A, start: 1578693132.513351, bitrate: N/A
       Stream #0:0: Video: rawvideo (BGR[0] / 0x524742), bgr0, 1280x720, 29.97 fps, 1000k tbr, 1000k tbn, 1000k tbc
    Stream mapping:
     Stream #0:0 -> #0:0 (rawvideo (native) -> h264 (libx264))
    Press [q] to stop, [?] for help
    [libx264 @ 0x558225bcd360] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2
    [libx264 @ 0x558225bcd360] profile High 4:4:4 Predictive, level 3.1, 4:4:4 8-bit
    [libx264 @ 0x558225bcd360] 264 - core 152 r2854 e9a5903 - H.264/MPEG-4 AVC codec - Copyleft 2003-2017 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x1:0x111 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=0 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=4 threads=6 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
    Output #0, mp4, to './video2020-01-10-14-52-12.mp4':
     Metadata:
       encoder         : Lavf57.83.100
       Stream #0:0: Video: h264 (libx264) (avc1 / 0x31637661), yuv444p, 1280x720, q=-1--1, 29.97 fps, 30k tbn, 29.97 tbc
       Metadata:
         encoder         : Lavc57.107.100 libx264
       Side data:
         cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
    Past duration 0.806847 too large     256kB time=00:00:00.43 bitrate=4835.3kbits/s dup=16 drop=0 speed=0.207x    
    frame=  371 fps= 29 q=-1.0 Lsize=     639kB time=00:00:12.27 bitrate= 426.6kbits/s dup=16 drop=14 speed=0.971x    
    video:634kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.813096%
    [libx264 @ 0x558225bcd360] frame I:2     Avg QP:18.16  size:221502
    [libx264 @ 0x558225bcd360] frame P:93    Avg QP:14.97  size:  2007
    [libx264 @ 0x558225bcd360] frame B:276   Avg QP:20.13  size:    69
    [libx264 @ 0x558225bcd360] consecutive B-frames:  0.8%  0.0%  0.0% 99.2%
    [libx264 @ 0x558225bcd360] mb I  I16..4: 44.6%  0.0% 55.4%
    [libx264 @ 0x558225bcd360] mb P  I16..4:  0.2%  0.0%  0.3%  P16..4:  0.7%  0.1%  0.1%  0.0%  0.0%    skip:98.5%
    [libx264 @ 0x558225bcd360] mb B  I16..4:  0.0%  0.0%  0.0%  B16..8:  1.0%  0.0%  0.0%  direct: 0.0%  skip:99.0%  L0:50.9% L1:49.0% BI: 0.1%
    [libx264 @ 0x558225bcd360] coded y,u,v intra: 41.3% 37.5% 37.4% inter: 0.1% 0.0% 0.0%
    [libx264 @ 0x558225bcd360] i16 v,h,dc,p: 58% 41%  1%  0%
    [libx264 @ 0x558225bcd360] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 33% 30% 14%  2%  4%  4%  5%  3%  5%
    [libx264 @ 0x558225bcd360] Weighted P-Frames: Y:0.0% UV:0.0%
    [libx264 @ 0x558225bcd360] ref P L0: 59.2%  8.8% 25.5%  6.5%
    [libx264 @ 0x558225bcd360] ref B L0: 59.4% 39.0%  1.6%
    [libx264 @ 0x558225bcd360] ref B L1: 96.5%  3.5%
    [libx264 @ 0x558225bcd360] kb/s:419.29
    Exiting normally, received signal 2.
  • Processing Big Data Problems

    8 janvier 2011, par Multimedia Mike — Big Data

    I’m becoming more interested in big data problems, i.e., extracting useful information out of absurdly sized sets of input data. I know it’s a growing field and there is a lot to read on the subject. But you know how I roll— just think of a problem to solve and dive right in.

    Here’s how my adventure unfolded.

    The Corpus
    I need to run a command line program on a set of files I have collected. This corpus is on the order of 350,000 files. The files range from 7 bytes to 175 MB. Combined, they occupy around 164 GB of storage space.

    Oh, and said storage space resides on an external, USB 2.0-connected hard drive. Stop laughing.

    A file is named according to the SHA-1 hash of its data. The files are organized in a directory hierarchy according to the first 6 hex digits of the SHA-1 hash (e.g., a file named a4d5832f... is stored in a4/d5/83/a4d5832f...). All of this file hash, path, and size information is stored in an SQLite database.

    First Pass
    I wrote a Python script that read all the filenames from the database, fed them into a pool of worker processes using Python’s multiprocessing module, and wrote some resulting data for each file back to the SQLite database. My Eee PC has a single-core, hyperthreaded Atom which presents 2 CPUs to the system. Thus, 2 worker threads crunched the corpus. It took awhile. It took somewhere on the order of 9 or 10 or maybe even 12 hours. It took long enough that I’m in no hurry to re-run the test and get more precise numbers.

    At least I extracted my initial set of data from the corpus. Or did I ?

    Think About The Future

    A few days later, I went back to revisit the data only to notice that the SQLite database was corrupted. To add insult to that bit of injury, the script I had written to process the data was also completely corrupted (overwritten with something unrelated to Python code). BTW, this is was on a RAID brick configured for redundancy. So that’s strike 3 in my personal dealings with RAID technology.

    I moved the corpus to a different external drive and also verified the files after writing (easy to do since I already had the SHA-1 hashes on record).

    The corrupted script was pretty simple to rewrite, even a little better than before. Then I got to re-run it. However, this run was on a faster machine, a hyperthreaded, quad-core beast that exposes 8 CPUs to the system. The reason I wasn’t too concerned about the poor performance with my Eee PC is that I knew I was going to be able to run in on this monster later.

    So I let the rewritten script rip. The script gave me little updates regarding its progress. As it did so, I ran some rough calculations and realized that it wasn’t predicted to finish much sooner than it would have if I were running it on the Eee PC.

    Limiting Factors
    It had been suggested to me that I/O bandwidth of the external USB drive might be a limiting factor. This is when I started to take that idea very seriously.

    The first idea I had was to move the SQLite database to a different drive. The script records data to the database for every file processed, though it only commits once every 100 UPDATEs, so at least it’s not constantly syncing the disc. I ran before and after tests with a small subset of the corpus and noticed a substantial speedup thanks to this policy chance.

    Then I remembered hearing something about "atime" which is access time. Linux filesystems, per default, record the time that a file was last accessed. You can watch this in action by running 'stat <file> ; cat <file> > /dev/null ; stat <file>' and observe that the "Access" field has been updated to NOW(). This also means that every single file that gets read from the external drive still causes an additional write. To avoid this, I started mounting the external drive with '-o noatime' which instructs Linux not to record "last accessed" time for files.

    On the limited subset test, this more than doubled script performance. I then wondered about mounting the external drive as read-only. This had the same performance as noatime. I thought about using both options together but verified that access times are not updated for a read-only filesystem.

    A Note On Profiling
    Once you start accessing files in Linux, those files start getting cached in RAM. Thus, if you profile, say, reading a gigabyte file from a disk and get 31 MB/sec, and then repeat the same test, you’re likely to see the test complete instantaneously. That’s because the file is already sitting in memory, cached. This is useful in general application use, but not if you’re trying to profile disk performance.

    Thus, in between runs, do (as root) 'sync; echo 3 > /proc/sys/vm/drop_caches' in order to wipe caches (explained here).

    Even Better ?
    I re-ran the test using these little improvements. Now it takes somewhere around 5 or 6 hours to run.

    I contrived an artificially large file on the external drive and did some 'dd' tests to measure what the drive could really do. The drive consistently measured a bit over 31 MB/sec. If I could read and process the data at 30 MB/sec, the script would be done in about 95 minutes.

    But it’s probably rather unreasonable to expect that kind of transfer rate for lots of smaller files scattered around a filesystem. However, it can’t be that helpful to have 8 different processes constantly asking the HD for 8 different files at any one time.

    So I wrote a script called stream-corpus.py which simply fetched all the filenames from the database and loaded the contents of each in turn, leaving the data to be garbage-collected at Python’s leisure. This test completed in 174 minutes, just shy of 3 hours. I computed an average read speed of around 17 MB/sec.

    Single-Reader Script
    I began to theorize that if I only have one thread reading, performance should improve greatly. To test this hypothesis without having to do a lot of extra work, I cleared the caches and ran stream-corpus.py until 'top' reported that about half of the real memory had been filled with data. Then I let the main processing script loose on the data. As both scripts were using sorted lists of files, they iterated over the filenames in the same order.

    Result : The processing script tore through the files that had obviously been cached thanks to stream-corpus.py, degrading drastically once it had caught up to the streaming script.

    Thus, I was incented to reorganize the processing script just slightly. Now, there is a reader thread which reads each file and stuffs the name of the file into an IPC queue that one of the worker threads can pick up and process. Note that no file data is exchanged between threads. No need— the operating system is already implicitly holding onto the file data, waiting in case someone asks for it again before something needs that bit of RAM. Technically, this approach accesses each file multiple times. But it makes little practical difference thanks to caching.

    Result : About 183 minutes to process the complete corpus (which works out to a little over 16 MB/sec).

    Why Multiprocess
    Is it even worthwhile to bother multithreading this operation ? Monitoring the whole operation via 'top', most instances of the processing script are barely using any CPU time. Indeed, it’s likely that only one of the worker threads is doing any work most of the time, pulling a file out of the IPC queue as soon the reader thread triggers its load into cache. Right now, the processing is usually pretty quick. There are cases where the processing (external program) might hang (one of the reasons I’m running this project is to find those cases) ; the multiprocessing architecture at least allows other processes to take over until a hanging process is timed out and killed by its monitoring process.

    Further, the processing is pretty simple now but is likely to get more intensive in future iterations. Plus, there’s the possibility that I might move everything onto a more appropriately-connected storage medium which should help alleviate the bottleneck bravely battled in this post.

    There’s also the theoretical possibility that the reader thread could read too far ahead of the processing threads. Obviously, that’s not too much of an issue in the current setup. But to guard against it, the processes could share a variable that tracks the total number of bytes that have been processed. The reader thread adds filesizes to the count while the processing threads subtract file sizes. The reader thread would delay reading more if the number got above a certain threshold.

    Leftovers
    I wondered if the order of accessing the files mattered. I didn’t write them to the drive in any special order. The drive is formatted with Linux ext3. I ran stream-corpus.py on all the filenames sorted by filename (remember the SHA-1 naming convention described above) and also by sorting them randomly.

    Result : It helps immensely for the filenames to be sorted. The sorted variant was a little more than twice as fast as the random variant. Maybe it has to do with accessing all the files in a single directory before moving onto another directory.

    Further, I have long been under the impression that the best read speed you can expect from USB 2.0 was 27 Mbytes/sec (even though 480 Mbit/sec is bandied about in relation to the spec). This comes from profiling I performed with an external enclosure that supports both USB 2.0 and FireWire-400 (and eSata). FW-400 was able to read the same file at nearly 40 Mbytes/sec that USB 2.0 could only read at 27 Mbytes/sec. Other sources I have read corroborate this number. But this test (using different hardware), achieved over 31 Mbytes/sec.