Hardwarebug : Everything is broken

Les articles publiés sur le site

1 | ... | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

Bit-field badness

30 janvier 2010, par Mans — Compilers, Optimisation
Consider the following C code which is based on an real-world situation.
```
struct bf1_31 {
    unsigned a:1;
    unsigned b:31;
};

void func(struct bf1_31 *p, int n, int a)
{
    int i = 0;
    do {
        if (p[i].a)
            p[i].b += a;
    } while (++i < n);
}
```
How would we best write this in ARM assembler? This is how I would do it:
```
func:
        ldr     r3,  [r0], #4
        tst     r3,  #1
        add     r3,  r3,  r2,  lsl #1
        strne   r3,  [r0, #-4]
        subs    r1,  r1,  #1
        bgt     func
        bx      lr
```
The add instruction is unconditional to avoid a dependency on the comparison. Unrolling the loop would mask the latency of the ldr instruction as well, but that is outside the scope of this experiment.

Now compile this code with gcc -march=armv5te -O3 and watch in horror:
```
func:
        push    {r4}
        mov     ip, #0
        mov     r4, r2
loop:
        ldrb    r3, [r0]
        add     ip, ip, #1
        tst     r3, #1
        ldrne   r3, [r0]
        andne   r2, r3, #1
        addne   r3, r4, r3, lsr #1
        orrne   r2, r2, r3, lsl #1
        strne   r2, [r0]
        cmp     ip, r1
        add     r0, r0, #4
        blt     loop
        pop     {r4}
        bx      lr
```
This is nothing short of awful:
- The same value is loaded from memory twice.
- A complicated mask/shift/or operation is used where a simple shifted add would suffice.
- Write-back addressing is not used.
- The loop control counts up and compares instead of counting down.
- Useless mov in the prologue; swapping the roles or r2 and r4 would avoid this.
- Using lr in place of r4 would allow the return to be done with pop {pc}, saving one instruction (ignoring for the moment that no callee-saved registers are needed at all).
Even for this trivial function the gcc-generated code is more than twice the optimal size and slower by approximately the same factor.

The main issue I wanted to illustrate is the poor handling of bit-fields by gcc. When accessing bitfields from memory, gcc issues a separate load for each field even when they are contained in the same aligned memory word. Although each load after the first will most likely hit L1 cache, this is still bad for several reasons:
- Loads have typically two or three cycles result latency compared to one cycle for data processing instructions. Any bit-field can be extracted from a register with two shifts, and on ARM the second of these can generally be achieved using a shifted second operand to a following instruction. The ARMv6T2 instruction set also adds the SBFX and UBFX instructions for extracting any signed or unsigned bit-field in one cycle.
- Most CPUs have more data processing units than load/store units. It is thus more likely for an ALU instruction than a load/store to issue without delay on a superscalar processor.
- Redundant memory accesses can trigger early flushing of store buffers rendering these less efficient.
No gcc bashing is complete without a comparison with another compiler, so without further ado, here is the ARM RVCT output (armcc --cpu 5te -O3):
```
func:
        mov     r3, #0
        push    {r4, lr}
loop:
        ldr     ip, [r0, r3, lsl #2]
        tst     ip, #1
        addne   ip, ip, r2, lsl #1
        strne   ip, [r0, r3, lsl #2]
        add     r3, r3, #1
        cmp     r3, r1
        blt     loop
        pop     {r4, pc}
```
This is much better, the core loop using only one instruction more than my version. The loop control is counting up, but at least this register is reused as offset for the memory accesses. More remarkable is the push/pop of two registers that are never used. I had not expected to see this from RVCT.

Even the best compilers are still no match for a human.
ARM compiler update

15 janvier 2010, par Mans — ARM, Compilers
Since my last shootout, all the tested vendors have updated their compilers. Here is a quick update on each of them.

Both the 4.3 and 4.4 branches of FSF GCC have had bugfix releases, bringing them to 4.3.4 and 4.4.2, respectively. Neither update contains anything particularly noteworthy.

The CodeSourcery 2009q3 release sees an update to a GCC 4.4 base, a significant change from the 4.3 base used in 2009q1. The update is a mixed blessing. In fact, it is mostly a curse and hardly a blessing at all. On the bright side, the floating-point speed regressions in 2009q1 are gone, 2009q3 being a few per cent faster even than 2007q3. Unfortunately, this improvement is completely overshadowed by a major speed regression on integer code, a whopping 24% in one case. This ties in with the slowdown previously observed with FSF GCC 4.4 compared to 4.3.

ARM RVCT 4.0 is now at Build 697. This update fixes some bugs and introduces others. Notably, it no longer builds FFmpeg correctly. The issue has been reported to ARM.

Texas Instruments, finally, have made a formal release, v4.6.1, of their TMS470 compiler incorporating various fixes allowing it to build a moderately patched FFmpeg. The performance remains somewhere between GCC and RVCT on average.

In light of the above, my recommendations remain unchanged:
- For a free compiler, choose CodeSourcery 2009q1. It beats GCC 4.3.4 by 5-10% in most cases.
- GNU purists are best served by GCC 4.3.4, which is up to 20% faster than 4.4.2 and rarely slower.
- When price is not a concern, ARM RCVT is a good option, outperforming GCC by up to a factor 2.
- In all cases, disable any auto-vectorisation features.
Regardless of which compiler is chosen, I cannot overstress the importance of testing. All compilers are crawling with bugs, and even the most innocent-looking code change can trigger one of them. When using a compiler other than GCC, extra caution is advised considering a lot of code is developed using only GCC and may thus fall prey to bugs unique to said other compiler.
Beware the builtins

14 janvier 2010, par Mans — Compilers
GCC includes a large number of builtin functions allegedly providing optimised code for common operations not easily expressed directly in C. Rather than taking such claims at face value (this is GCC after all), I decided to conduct a small investigation to see how well a few of these functions are actually implemented for various targets.

For my test, I selected the following functions:
- __builtin_bswap32: Byte-swap a 32-bit word.
- __builtin_bswap64: Byte-swap a 64-bit word.
- __builtin_clz: Count leading zeros in a word.
- __builtin_ctz: Count trailing zeros in a word.
- __builtin_prefetch: Prefetch data into cache.
To test the quality of these builtins, I wrapped each in a normal function, then compiled the code for these targets:
- ARMv7
- AVR32
- MIPS
- MIPS64
- PowerPC
- PowerPC64
- x86
- x86_64
In all cases I used compiler flags were -O3 -fomit-frame-pointer plus any flags required to select a modern CPU model.

ARM

Both __builtin_clz and __builtin_prefetch generate the expected CLZ and PLD instructions respectively. The code for __builtin_ctz is reasonable for ARMv6 and earlier:
```
rsb     r3, r0, #0
and     r0, r3, r0
clz     r0, r0
rsb     r0, r0, #31
```
For ARMv7 (in fact v6T2), however, using the new bit-reversal instruction would have been better:
```
rbit    r0, r0
clz     r0, r0
```
I suspect this is simply a matter of the function not yet having been updated for ARMv7, which is perhaps even excusable given the relatively rare use cases for it.

The byte-reversal functions are where it gets shocking. Rather than use the REV instruction found from ARMv6 on, both of them generate external calls to __bswapsi2 and __bswapdi2 in libgcc, which is plain C code:
```
SItype
__bswapsi2 (SItype u)
{
  return ((((u) & 0xff000000) >> 24)
          | (((u) & 0x00ff0000) >>  8)
          | (((u) & 0x0000ff00) <<  8)
          | (((u) & 0x000000ff) << 24));
}

DItype
__bswapdi2 (DItype u)
{
   return ((((u) & 0xff00000000000000ull) >> 56)
          | (((u) & 0x00ff000000000000ull) >> 40)
          | (((u) & 0x0000ff0000000000ull) >> 24)
          | (((u) & 0x000000ff00000000ull) >>  8)
          | (((u) & 0x00000000ff000000ull) <<  8)
          | (((u) & 0x0000000000ff0000ull) << 24)
          | (((u) & 0x000000000000ff00ull) << 40)
          | (((u) & 0x00000000000000ffull) << 56));
}
```
While the 32-bit version compiles to a reasonable-looking shift/mask/or job, the 64-bit one is a real WTF. Brace yourselves:
```
push    {r4, r5, r6, r7, r8, r9, sl, fp}
mov     r5, #0
mov     r6, #65280      ; 0xff00
sub     sp, sp, #40     ; 0x28
and     r7, r0, r5
and     r8, r1, r6
str     r7, [sp, #8]
str     r8, [sp, #12]
mov     r9, #0
mov     r4, r1
and     r5, r0, r9
mov     sl, #255        ; 0xff
ldr     r9, [sp, #8]
and     r6, r4, sl
mov     ip, #16711680   ; 0xff0000
str     r5, [sp, #16]
str     r6, [sp, #20]
lsl     r2, r0, #24
and     ip, ip, r1
lsr     r7, r4, #24
mov     r1, #0
lsr     r5, r9, #24
mov     sl, #0
mov     r9, #-16777216  ; 0xff000000
and     fp, r0, r9
lsr     r6, ip, #8
orr     r9, r7, r1
and     ip, r4, sl
orr     sl, r1, r2
str     r6, [sp]
str     r9, [sp, #32]
str     sl, [sp, #36]   ; 0x24
add     r8, sp, #32
ldm     r8, {r7, r8}
str     r1, [sp, #4]
ldm     sp, {r9, sl}
orr     r7, r7, r9
orr     r8, r8, sl
str     r7, [sp, #32]
str     r8, [sp, #36]   ; 0x24
mov     r3, r0
mov     r7, #16711680   ; 0xff0000
mov     r8, #0
and     r9, r3, r7
and     sl, r4, r8
ldr     r0, [sp, #16]
str     fp, [sp, #24]
str     ip, [sp, #28]
stm     sp, {r9, sl}
ldr     r7, [sp, #20]
ldr     sl, [sp, #12]
ldr     fp, [sp, #12]
ldr     r8, [sp, #28]
lsr     r0, r0, #8
orr     r7, r0, r7, lsl #24
lsr     r6, sl, #24
orr     r5, r5, fp, lsl #8
lsl     sl, r8, #8
mov     fp, r7
add     r8, sp, #32
ldm     r8, {r7, r8}
orr     r6, r6, r8
ldr     r8, [sp, #20]
ldr     r0, [sp, #24]
orr     r5, r5, r7
lsr     r8, r8, #8
orr     sl, sl, r0, lsr #24
mov     ip, r8
ldr     r0, [sp, #4]
orr     fp, fp, r5
ldr     r5, [sp, #24]
orr     ip, ip, r6
ldr     r6, [sp]
lsl     r9, r5, #8
lsl     r8, r0, #24
orr     fp, fp, r9
lsl     r3, r3, #8
orr     r8, r8, r6, lsr #8
orr     ip, ip, sl
lsl     r7, r6, #24
and     r5, r3, #16711680       ; 0xff0000
orr     r7, r7, fp
orr     r8, r8, ip
orr     r4, r1, r7
orr     r5, r5, r8
mov     r9, r6
mov     r1, r5
mov     r0, r4
add     sp, sp, #40     ; 0x28
pop     {r4, r5, r6, r7, r8, r9, sl, fp}
bx      lr
```
That’s right, 91 instructions to move 8 bytes around a bit. GCC definitely has a problem with 64-bit numbers. It is perhaps worth noting that the bswap_64 macro in glibc splits the 64-bit value into 32-bit halves which are then reversed independently, thus side-stepping this weakness of gcc.

As a side note, ARM RVCT (armcc) compiles those functions perfectly into one and two REV instructions, respectively.

AVR32

There is not much to report here. The latest gcc version available is 4.2.4, which doesn’t appear to have the bswap functions. The other three are handled nicely, even using a bit-reverse for __builtin_ctz.

MIPS / MIPS64

The situation MIPS is similar to ARM. Both bswap builtins result in external libgcc calls, the rest giving sensible code.

PowerPC

I scarcely believe my eyes, but this one is actually not bad. The PowerPC has no byte-reversal instructions, yet someone seems to have taken the time to teach gcc a good instruction sequence for this operation. The PowerPC does have some powerful rotate-and-mask instructions which come in handy here. First the 32-bit version:
```
rotlwi  r0,r3,8
rlwimi  r0,r3,24,0,7
rlwimi  r0,r3,24,16,23
mr      r3,r0
blr
```
The 64-bit byte-reversal simply applies the above code on each half of the value:
```
rotlwi  r0,r3,8
rlwimi  r0,r3,24,0,7
rlwimi  r0,r3,24,16,23
rotlwi  r3,r4,8
rlwimi  r3,r4,24,0,7
rlwimi  r3,r4,24,16,23
mr      r4,r0
blr
```
Although I haven’t analysed that code carefully, it looks pretty good.

PowerPC64

Doing 64-bit operations is easier on a 64-bit CPU, right? For you and me perhaps, but not for gcc. Here __builtin_bswap64 gives us the now familiar __bswapdi2 call, and while not as bad as the ARM version, it is not pretty:
```
rldicr  r0,r3,8,55
rldicr  r10,r3,56,7
rldicr  r0,r0,56,15
rldicl  r11,r3,8,56
rldicr  r9,r3,16,47
or      r11,r10,r11
rldicr  r9,r9,48,23
rldicl  r10,r0,24,40
rldicr  r0,r3,24,39
or      r11,r11,r10
rldicl  r9,r9,40,24
rldicr  r0,r0,40,31
or      r9,r11,r9
rlwinm  r10,r3,0,0,7
rldicl  r0,r0,56,8
or      r0,r9,r0
rldicr  r10,r10,8,55
rlwinm  r11,r3,0,8,15
or      r0,r0,r10
rldicr  r11,r11,24,39
rlwinm  r3,r3,0,16,23
or      r0,r0,r11
rldicr  r3,r3,40,23
or      r3,r0,r3
blr
```
That is 6 times longer than the (presumably) hand-written 32-bit version.

x86 / x86_64

As one might expect, results on x86 are good. All the tested functions use the available special instructions. One word of caution though: the bit-counting instructions are very slow on some implementations, specifically the Atom, AMD chips, and the notoriously slow Pentium4E.

Conclusion

In conclusion, I would say gcc builtins can be useful to avoid fragile inline assembler. Before using them, however, one should make sure they are not in fact harmful on the required targets. Not even those builtins mapping directly to CPU instructions can be trusted.

ARM compiler shoot-out, round 2

20 août 2009, par Mans — ARM, Compilers

In my recent test of ARM compilers, I had to leave out Texas Instrument’s compiler since it failed to build FFmpeg. Since then, the TI compiler team has been busy fixing bugs, and a snapshot I was given to test was able to build enough of a somewhat patched FFmpeg that I can now present round two in this shoot-out.

The contenders this time were the fastest GCC variant from round one, ARM RVCT, and newcomer TI TMS470. With the same rules as last time, the exact versions and optimisation options were like this:

CodeSourcery GCC 2009q1 (based on 4.3.3), -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros -fno-tree-vectorize
ARM RVCT 4.0 Build 591, -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a8 -std=c99 -fomit-frame-pointer -O3 -fno-math-errno -fno-signed-zeros
TI TMS470 4.7.0-a9229, --float_support=vfpv3 -mv=7a8 -O3 -mf=5

To keep things fair, I left the vectoriser off also with the TI compiler. The table below lists the decoding times for the sample files, this time normalised against the participating GCC compiler. Remember, smaller numbers are better. Also keep in mind that this test was done with a development snapshot of TMS470, not an approved release.

Sample name	Codec	Code type	GCC	RVCT	TI
cathedral	H.264 CABAC	integer	1.00	0.95	1.02
NeroAVC	H.264 CABAC	integer	1.00	0.96	1.05
indiana_jones_4	H.264 CAVLC	integer	1.00	0.92	1.02
NeroRecodeSample	MPEG-4 ASP	integer	1.00	1.01	1.08
Silent_Light	MP3	64-bit integer	1.00	0.48	0.72
When_I_Grow_Up	FLAC	integer	1.00	0.87	0.93
Lumme-Badloop	Vorbis	float	1.00	0.94	1.05
Canyon	AC-3	float	1.00	0.88	1.01
lotr	DTS	float	1.00	1.00	1.08

Overall, the TI TMS470 compiler comes off slightly worse than GCC. In two cases, however, it was significantly better than GCC, but not as good as RVCT. Incidentally, those were also the ones where RVCT scored the biggest win over GCC.

My conclusions from this test are twofold:

ARM’s own compiler is very hard to beat. They do seem to know how their chips work.
GCC is incredibly bad at 64-bit arithmetic on 32-bit machines.

The logical next step is to test these compilers with vectorisation enabled. FFmpeg should offer plenty of opportunities for this feature to shine. Unfortunately, that test will have to wait until the RVCT vectoriser is fixed. The current release does not compile FFmpeg with vectorisation enabled.

1 | ... | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

Hardwarebug : Everything is broken

Les articles publiés sur le site

Bit-field badness

ARM compiler update

Beware the builtins

ARM

AVR32

MIPS / MIPS64

PowerPC

PowerPC64

x86 / x86_64

Conclusion

ARM compiler shoot-out, round 2

Se connecter

Se connecter

Navigation

Sites Web

Boussole SPIP