Recherche avancée

Médias (0)

Mot : - Tags -/auteurs

Aucun média correspondant à vos critères n’est disponible sur le site.

Autres articles (62)

  • Récupération d’informations sur le site maître à l’installation d’une instance

    26 novembre 2010, par

    Utilité
    Sur le site principal, une instance de mutualisation est définie par plusieurs choses : Les données dans la table spip_mutus ; Son logo ; Son auteur principal (id_admin dans la table spip_mutus correspondant à un id_auteur de la table spip_auteurs)qui sera le seul à pouvoir créer définitivement l’instance de mutualisation ;
    Il peut donc être tout à fait judicieux de vouloir récupérer certaines de ces informations afin de compléter l’installation d’une instance pour, par exemple : récupérer le (...)

  • Les tâches Cron régulières de la ferme

    1er décembre 2010, par

    La gestion de la ferme passe par l’exécution à intervalle régulier de plusieurs tâches répétitives dites Cron.
    Le super Cron (gestion_mutu_super_cron)
    Cette tâche, planifiée chaque minute, a pour simple effet d’appeler le Cron de l’ensemble des instances de la mutualisation régulièrement. Couplée avec un Cron système sur le site central de la mutualisation, cela permet de simplement générer des visites régulières sur les différents sites et éviter que les tâches des sites peu visités soient trop (...)

  • Submit bugs and patches

    13 avril 2011

    Unfortunately a software is never perfect.
    If you think you have found a bug, report it using our ticket system. Please to help us to fix it by providing the following information : the browser you are using, including the exact version as precise an explanation as possible of the problem if possible, the steps taken resulting in the problem a link to the site / page in question
    If you think you have solved the bug, fill in a ticket and attach to it a corrective patch.
    You may also (...)

Sur d’autres sites (8602)

  • Beware the builtins

    14 janvier 2010, par Mans — Compilers

    GCC includes a large number of builtin functions allegedly providing optimised code for common operations not easily expressed directly in C. Rather than taking such claims at face value (this is GCC after all), I decided to conduct a small investigation to see how well a few of these functions are actually implemented for various targets.

    For my test, I selected the following functions :

    • __builtin_bswap32 : Byte-swap a 32-bit word.
    • __builtin_bswap64 : Byte-swap a 64-bit word.
    • __builtin_clz : Count leading zeros in a word.
    • __builtin_ctz : Count trailing zeros in a word.
    • __builtin_prefetch : Prefetch data into cache.

    To test the quality of these builtins, I wrapped each in a normal function, then compiled the code for these targets :

    • ARMv7
    • AVR32
    • MIPS
    • MIPS64
    • PowerPC
    • PowerPC64
    • x86
    • x86_64

    In all cases I used compiler flags were -O3 -fomit-frame-pointer plus any flags required to select a modern CPU model.

    ARM

    Both __builtin_clz and __builtin_prefetch generate the expected CLZ and PLD instructions respectively. The code for __builtin_ctz is reasonable for ARMv6 and earlier :

    rsb     r3, r0, #0
    and     r0, r3, r0
    clz     r0, r0
    rsb     r0, r0, #31
    

    For ARMv7 (in fact v6T2), however, using the new bit-reversal instruction would have been better :

    rbit    r0, r0
    clz     r0, r0
    

    I suspect this is simply a matter of the function not yet having been updated for ARMv7, which is perhaps even excusable given the relatively rare use cases for it.

    The byte-reversal functions are where it gets shocking. Rather than use the REV instruction found from ARMv6 on, both of them generate external calls to __bswapsi2 and __bswapdi2 in libgcc, which is plain C code :

    SItype
    __bswapsi2 (SItype u)
    
      return ((((u) & 0xff000000) >> 24)
              | (((u) & 0x00ff0000) >>  8)
              | (((u) & 0x0000ff00) <<  8)
              | (((u) & 0x000000ff) << 24)) ;
    
    

    DItype
    __bswapdi2 (DItype u)

    return ((((u) & 0xff00000000000000ull) >> 56)
    | (((u) & 0x00ff000000000000ull) >> 40)
    | (((u) & 0x0000ff0000000000ull) >> 24)
    | (((u) & 0x000000ff00000000ull) >> 8)
    | (((u) & 0x00000000ff000000ull) << 8)
    | (((u) & 0x0000000000ff0000ull) << 24)
    | (((u) & 0x000000000000ff00ull) << 40)
    | (((u) & 0x00000000000000ffull) << 56)) ;

    While the 32-bit version compiles to a reasonable-looking shift/mask/or job, the 64-bit one is a real WTF. Brace yourselves :

    push    r4, r5, r6, r7, r8, r9, sl, fp
    mov     r5, #0
    mov     r6, #65280 ; 0xff00
    sub     sp, sp, #40 ; 0x28
    and     r7, r0, r5
    and     r8, r1, r6
    str     r7, [sp, #8]
    str     r8, [sp, #12]
    mov     r9, #0
    mov     r4, r1
    and     r5, r0, r9
    mov     sl, #255 ; 0xff
    ldr     r9, [sp, #8]
    and     r6, r4, sl
    mov     ip, #16711680 ; 0xff0000
    str     r5, [sp, #16]
    str     r6, [sp, #20]
    lsl     r2, r0, #24
    and     ip, ip, r1
    lsr     r7, r4, #24
    mov     r1, #0
    lsr     r5, r9, #24
    mov     sl, #0
    mov     r9, #-16777216 ; 0xff000000
    and     fp, r0, r9
    lsr     r6, ip, #8
    orr     r9, r7, r1
    and     ip, r4, sl
    orr     sl, r1, r2
    str     r6, [sp]
    str     r9, [sp, #32]
    str     sl, [sp, #36] ; 0x24
    add     r8, sp, #32
    ldm     r8, r7, r8
    str     r1, [sp, #4]
    ldm     sp, r9, sl
    orr     r7, r7, r9
    orr     r8, r8, sl
    str     r7, [sp, #32]
    str     r8, [sp, #36] ; 0x24
    mov     r3, r0
    mov     r7, #16711680 ; 0xff0000
    mov     r8, #0
    and     r9, r3, r7
    and     sl, r4, r8
    ldr     r0, [sp, #16]
    str     fp, [sp, #24]
    str     ip, [sp, #28]
    stm     sp, r9, sl
    ldr     r7, [sp, #20]
    ldr     sl, [sp, #12]
    ldr     fp, [sp, #12]
    ldr     r8, [sp, #28]
    lsr     r0, r0, #8
    orr     r7, r0, r7, lsl #24
    lsr     r6, sl, #24
    orr     r5, r5, fp, lsl #8
    lsl     sl, r8, #8
    mov     fp, r7
    add     r8, sp, #32
    ldm     r8, r7, r8
    orr     r6, r6, r8
    ldr     r8, [sp, #20]
    ldr     r0, [sp, #24]
    orr     r5, r5, r7
    lsr     r8, r8, #8
    orr     sl, sl, r0, lsr #24
    mov     ip, r8
    ldr     r0, [sp, #4]
    orr     fp, fp, r5
    ldr     r5, [sp, #24]
    orr     ip, ip, r6
    ldr     r6, [sp]
    lsl     r9, r5, #8
    lsl     r8, r0, #24
    orr     fp, fp, r9
    lsl     r3, r3, #8
    orr     r8, r8, r6, lsr #8
    orr     ip, ip, sl
    lsl     r7, r6, #24
    and     r5, r3, #16711680 ; 0xff0000
    orr     r7, r7, fp
    orr     r8, r8, ip
    orr     r4, r1, r7
    orr     r5, r5, r8
    mov     r9, r6
    mov     r1, r5
    mov     r0, r4
    add     sp, sp, #40 ; 0x28
    pop     r4, r5, r6, r7, r8, r9, sl, fp
    bx      lr
    

    That’s right, 91 instructions to move 8 bytes around a bit. GCC definitely has a problem with 64-bit numbers. It is perhaps worth noting that the bswap_64 macro in glibc splits the 64-bit value into 32-bit halves which are then reversed independently, thus side-stepping this weakness of gcc.

    As a side note, ARM RVCT (armcc) compiles those functions perfectly into one and two REV instructions, respectively.

    AVR32

    There is not much to report here. The latest gcc version available is 4.2.4, which doesn’t appear to have the bswap functions. The other three are handled nicely, even using a bit-reverse for __builtin_ctz.

    MIPS / MIPS64

    The situation MIPS is similar to ARM. Both bswap builtins result in external libgcc calls, the rest giving sensible code.

    PowerPC

    I scarcely believe my eyes, but this one is actually not bad. The PowerPC has no byte-reversal instructions, yet someone seems to have taken the time to teach gcc a good instruction sequence for this operation. The PowerPC does have some powerful rotate-and-mask instructions which come in handy here. First the 32-bit version :

    rotlwi  r0,r3,8
    rlwimi  r0,r3,24,0,7
    rlwimi  r0,r3,24,16,23
    mr      r3,r0
    blr
    

    The 64-bit byte-reversal simply applies the above code on each half of the value :

    rotlwi  r0,r3,8
    rlwimi  r0,r3,24,0,7
    rlwimi  r0,r3,24,16,23
    rotlwi  r3,r4,8
    rlwimi  r3,r4,24,0,7
    rlwimi  r3,r4,24,16,23
    mr      r4,r0
    blr
    

    Although I haven’t analysed that code carefully, it looks pretty good.

    PowerPC64

    Doing 64-bit operations is easier on a 64-bit CPU, right ? For you and me perhaps, but not for gcc. Here __builtin_bswap64 gives us the now familiar __bswapdi2 call, and while not as bad as the ARM version, it is not pretty :

    rldicr  r0,r3,8,55
    rldicr  r10,r3,56,7
    rldicr  r0,r0,56,15
    rldicl  r11,r3,8,56
    rldicr  r9,r3,16,47
    or      r11,r10,r11
    rldicr  r9,r9,48,23
    rldicl  r10,r0,24,40
    rldicr  r0,r3,24,39
    or      r11,r11,r10
    rldicl  r9,r9,40,24
    rldicr  r0,r0,40,31
    or      r9,r11,r9
    rlwinm  r10,r3,0,0,7
    rldicl  r0,r0,56,8
    or      r0,r9,r0
    rldicr  r10,r10,8,55
    rlwinm  r11,r3,0,8,15
    or      r0,r0,r10
    rldicr  r11,r11,24,39
    rlwinm  r3,r3,0,16,23
    or      r0,r0,r11
    rldicr  r3,r3,40,23
    or      r3,r0,r3
    blr
    

    That is 6 times longer than the (presumably) hand-written 32-bit version.

    x86 / x86_64

    As one might expect, results on x86 are good. All the tested functions use the available special instructions. One word of caution though : the bit-counting instructions are very slow on some implementations, specifically the Atom, AMD chips, and the notoriously slow Pentium4E.

    Conclusion

    In conclusion, I would say gcc builtins can be useful to avoid fragile inline assembler. Before using them, however, one should make sure they are not in fact harmful on the required targets. Not even those builtins mapping directly to CPU instructions can be trusted.

  • VP8 : a retrospective

    13 juillet 2010, par Dark Shikari — DCT, VP8, speed

    I’ve been working the past few weeks to help finish up the ffmpeg VP8 decoder, the first community implementation of On2′s VP8 video format. Now that I’ve written a thousand or two lines of assembly code and optimized a good bit of the C code, I’d like to look back at VP8 and comment on a variety of things — both good and bad — that slipped the net the first time, along with things that have changed since the time of that blog post.

    These are less-so issues related to compression — that issue has been beaten to death, particularly in MSU’s recent comparison, where x264 beat the crap out of VP8 and the VP8 developers pulled a Pinocchio in the developer comments. But that was expected and isn’t particularly interesting, so I won’t go into that. VP8 doesn’t have to be the best in the world in order to be useful.

    When the ffmpeg VP8 decoder is complete (just a few more asm functions to go), we’ll hopefully be able to post some benchmarks comparing it to libvpx.

    1. The spec, er, I mean, bitstream guide.

    Google has reneged on their claim that a spec existed at all and renamed it a “bitstream guide”. This is probably after it was found that — not merely was it incomplete — but at least a dozen places in the spec differed wildly from what was actually in their own encoder and decoder software ! The deblocking filter, motion vector clamping, probability tables, and many more parts simply disagreed flat-out with the spec. Fortunately, Ronald Bultje, one of the main authors of the ffmpeg VP8 decoder, is rather skilled at reverse-engineering, so we were able to put together a matching implementation regardless.

    Most of the differences aren’t particularly important — they don’t have a huge effect on compression or anything — but make it vastly more difficult to implement a “working” VP8 decoder, or for that matter, decide what “working” really is. For example, Google’s decoder will, if told to “swap the ALT and GOLDEN reference frames”, overwrite both with GOLDEN, because it first sets GOLDEN = ALT, and then sets ALT = GOLDEN. Is this a bug ? Or is this how it’s supposed to work ? It’s hard to tell — there isn’t a spec to say so. Google says that whatever libvpx does is right, but I doubt they intended this.

    I expect a spec will eventually be written, but it was a bit obnoxious of Google — both to the community and to their own developers — to release so early that they didn’t even have their own documentation ready.

    2. The TM intra prediction mode.

    One thing I glossed over in the original piece was that On2 had added an extra intra prediction mode to the standard batch that H.264 came with — they replaced Planar with “TM pred”. For i4x4, which didn’t have a Planar mode, they just added it without replacing an old one, resulting in a total of 10 modes to H.264′s 9. After understanding and writing assembly code for TM pred, I have to say that it is quite a cool idea. Here’s how it works :

    1. Let us take a block of size 4×4, 8×8, or 16×16.

    2. Define the pixels bordering the top of this block (starting from the left) as T[0], T[1], T[2]…

    3. Define the pixels bordering the left of this block (starting from the top) as L[0], L[1], L[2]…

    4. Define the pixel above the top-left of the block as TL.

    5. Predict every pixel <X,Y> in the block to be equal to clip3( T[X] + L[Y] – TL, 0, 255).

    It’s effectively a generalization of gradient prediction to the block level — predict each pixel based on the gradient between its top and left pixels, and the topleft. According to the VP8 devs, it’s chosen by the encoder quite a lot of the time, which isn’t surprising ; it seems like a pretty good idea. As just one more intra pred mode, it’s not going to do magic for compression, but it’s a cool idea and elegantly simple.

    3. Performance and the deblocking filter.

    On2 advertised for quite some that VP8′s goal was to be significantly faster to decode than H.264. When I saw the spec, I waited for the punchline, but apparently they were serious. There’s nothing wrong with being of similar speed or a bit slower — but I was rather confused as to the fact that their design didn’t match their stated goal at all. What apparently happened is they had multiple profiles of VP8 — high and low complexity profiles. They marketed the performance of the low complexity ones while touting the quality of the high complexity ones, a tad dishonest. More importantly though, practically nobody is using the low complexity modes, so anyone writing a decoder has to be prepared to handle the high complexity ones, which are the default.

    The primary time-eater here is the deblocking filter. VP8, being an H.264 derivative, has much the same problem as H.264 does in terms of deblocking — it spends an absurd amount of time there. As I write this post, we’re about to finish some of the deblocking filter asm code, but before it’s committed, up to 70% or more of total decoding time is spent in the deblocking filter ! Like H.264, it suffers from the 4×4 transform problem : a 4×4 transform requires a total of 8 length-16 and 8 length-8 loopfilter calls per macroblock, while Theora, with only an 8×8 transform, requires half that.

    This problem is aggravated in VP8 by the fact that the deblocking filter isn’t strength-adaptive ; if even one 4×4 block in a macroblock contains coefficients, every single edge has to be deblocked. Furthermore, the deblocking filter itself is quite complicated ; the “inner edge” filter is a bit more complex than H.264′s and the “macroblock edge” filter is vastly more complicated, having two entirely different codepaths chosen on a per-pixel basis. Of course, in SIMD, this means you have to do both and mask them together at the end.

    There’s nothing wrong with a good-but-slow deblocking filter. But given the amount of deblocking one needs to do in a 4×4-transform-based format, it might have been a better choice to make the filter simpler. It’s pretty difficult to beat H.264 on compression, but it’s certainly not hard to beat it on speed — and yet it seems VP8 missed a perfectly good chance to do so. Another option would have been to pick an 8×8 transform instead of 4×4, reducing the amount of deblocking by a factor of 2.

    And yes, there’s a simple filter available in the low complexity profile, but it doesn’t help if nobody uses it.

    4. Tree-based arithmetic coding.

    Binary arithmetic coding has become the standard entropy coding method for a wide variety of compressed formats, ranging from LZMA to VP6, H.264 and VP8. It’s simple, relatively fast compared to other arithmetic coding schemes, and easy to make adaptive. The problem with this is that you have to come up with a method for converting non-binary symbols into a list of binary symbols, and then choosing what probabilities to use to code each one. Here’s an example from H.264, the sub-partition mode symbol, which is either 8×8, 8×4, 4×8, or 4×4. encode_decision( context, bit ) writes a binary decision (bit) into a numbered context (context).

    8×8 : encode_decision( 21, 0 ) ;

    8×4 : encode_decision( 21, 1 ) ; encode_decision( 22, 0 ) ;

    4×8 : encode_decision( 21, 1 ) ; encode_decision( 22, 1 ) ; encode_decision( 23, 1 ) ;

    4×4 : encode_decision( 21, 1 ) ; encode_decision( 22, 1 ) ; encode_decision( 23, 0 ) ;

    As can be seen, this is clearly like a Huffman tree. Wouldn’t it be nice if we could represent this in the form of an actual tree data structure instead of code ? On2 thought so — they designed a simple system in VP8 that allowed all binarization schemes in the entire format to be represented as simple tree data structures. This greatly reduces the complexity — not speed-wise, but implementation-wise — of the entropy coder. Personally, I quite like it.

    5. The inverse transform ordering.

    I should at some point write a post about common mistakes made in video formats that everyone keeps making. These are not issues that are patent worries or huge issues for compression — just stupid mistakes that are repeatedly made in new video formats, probably because someone just never asked the guy next to him “does this look stupid ?” before sticking it in the spec.

    One common mistake is the problem of transform ordering. Every sane 2D transform is “separable” — that is, it can be done by doing a 1D transform vertically and doing the 1D transform again horizontally (or vice versa). The original iDCT as used in JPEG, H.263, and MPEG-1/2/4 was an “idealized” iDCT — nobody had to use the exact same iDCT, theirs just had to give very close results to a reference implementation. This ended up resulting in a lot of practical problems. It was also slow ; the only way to get an accurate enough iDCT was to do all the intermediate math in 32-bit.

    Practically every modern format, accordingly, has specified an exact iDCT. This includes H.264, VC-1, RV40, Theora, VP8, and many more. Of course, with an exact iDCT comes an exact ordering — while the “real” iDCT can be done in any order, an exact iDCT usually requires an exact order. That is, it specifies horizontal and then vertical, or vertical and then horizontal.

    All of these transforms end up being implemented in SIMD. In SIMD, a vertical transform is generally the only option, so a transpose is added to the process instead of doing a horizontal transform. Accordingly, there are two ways to do it :

    1. Transpose, vertical transform, transpose, vertical transform.

    2. Vertical transform, transpose, vertical transform, transpose.

    These may seem to be equally good, but there’s one catch — if the transpose is done first, it can be completely eliminated by merging it into the coefficient decoding process. On many modern CPUs, particularly x86, transposes are very expensive, so eliminating one of the two gives a pretty significant speed benefit.

    H.264 did it way 1).

    VC-1 did it way 1).

    Theora (inherited from VP3) did it way 1).

    But no. VP8 has to do it way 2), where you can’t eliminate the transpose. Bah. It’s not a huge deal ; probably only 1-2% overall at most speed-wise, but it’s just a needless waste. What really bugs me is that VP3 got it right — why in the world did they screw it up this time around if they got it right beforehand ?

    RV40 is the other modern format I know that made this mistake.

    (NB : You can do transforms without a transpose, but it’s generally not worth it unless the intermediate needs 32-bit math, as in the case of the “real” iDCT.)

    6. Not supporting interlacing.

    THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU THANK YOU.

    Interlacing was the scourge of H.264. It weaseled its way into every nook and cranny of the spec, making every decoder a thousand lines longer. H.264 even included a highly complicated — and effective — dedicated interlaced coding scheme, MBAFF. The mere existence of MBAFF, despite its usefulness for broadcasters and others still stuck in the analog age with their 1080i, 576i , and 480i content, was a blight upon the video format.

    VP8 has once and for all avoided it.

    And if anyone suggests adding interlaced support to the experimental VP8 branch, find a straightjacket and padded cell for them before they cause any real damage.

  • Naive Sorenson Video 1 Encoder

    12 septembre 2010, par Multimedia Mike — General

    (Yes, the word is “naive” — or rather, “naïve” — not “native”. People always try to correct me when I use the word. Indeed, it should actually be written with 2 dots over the ‘i’ but who has a keyboard that can easily do that ?)

    At the most primitive level, programming a video encoder is about writing out a sequence of bits that the corresponding video decoder will understand. It’s sort of like creating a program — represented as a stream of opcodes — that will run on a given microprocessor or virtual machine. In fact, reading a video codec bitstream specification will reveal a lot of terminology along the lines of “transmitting information to the decoder” or “signaling the decoder to do xyz.”

    Creating a good encoder that will deliver decent quality at a reasonable bitrate is difficult. Creating a naive encoder that produces a technically compliant bitstream, not so much.



    When I wrote an FFmpeg encoder for Sorenson Video 1 (SVQ1), the first step was to just create a minimally compliant bitstream. The coarsest encoding mode that SVQ1 allows is to encode the average (mean) of each 16×16 block of samples. So I created an encoder that just encoded the mean of each block. Apple’s QuickTime Player was able to play the resulting video in all of its blocky glory. The result rather reminds me of the Super Nintendo’s mosaic effect.

    Level 5 blocks (mean-only 16×16 encoding) :



    Level 3 blocks (mean-only 8×8 encoding) :



    It’s one thing for your own decoder (in this case, FFmpeg’s own decoder) to be able to decode the data. The big test is whether the official decoder (in this case, Apple QuickTime Player) can decode the file.



    Now that’s a good feeling. After establishing that sort of baseline, it’s possible to adapt more and more features of the codec.