git.libav.org Git - libav.git/rss log

Libav master git repository

http://git.libav.org/?p=libav.git;a=summary

Les articles publiés sur le site

  • aarch64 : vp9 : use alternative returns in the core loop filter function

    14 novembre 2016, par Janne Grunau
    aarch64: vp9: use alternative returns in the core loop filter function
    
    Since aarch64 has enough free general purpose registers use them to
    branch to the appropiate storage code. 1-2 cycles faster for the
    functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
    (up to 2 cycles faster/slower) on a cortex-a57.
    
    • [DBH] libavcodec/aarch64/vp9lpf_neon.S
  • libschroedingerdec : don’t produce empty frames

    13 novembre 2016, par Andreas Cadhalpun
    libschroedingerdec: don't produce empty frames
    
    They are not valid and can cause problems/crashes for API users.
    
    Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
    
    • [DBH] libavcodec/libschroedingerdec.c
  • aarch64 : vp9 : Implement NEON loop filters

    13 novembre 2016, par Martin Storsjö
    aarch64: vp9: Implement NEON loop filters
    
    This work is sponsored by, and copyright, Google.
    
    These are ported from the ARM version; thanks to the larger
    amount of registers available, we can do the loop filters with
    16 pixels at a time. The implementation is fully templated, with
    a single macro which can generate versions for both 8 and
    16 pixels wide, for both 4, 8 and 16 pixels loop filters
    (and the 4/8 mixed versions as well).
    
    For the 8 pixel wide versions, it is pretty close in speed (the
    v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
    and h_8_8 filters seem to get some gain in the load/transpose/store
    part). For the 16 pixels wide ones, we get a speedup of around
    1.2-1.4x compared to the 32 bit version.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                           ARM AArch64
    vp9_loop_filter_h_4_8_neon:          144.0   127.2
    vp9_loop_filter_h_8_8_neon:          207.0   182.5
    vp9_loop_filter_h_16_8_neon:         415.0   328.7
    vp9_loop_filter_h_16_16_neon:        672.0   558.6
    vp9_loop_filter_mix2_h_44_16_neon:   302.0   203.5
    vp9_loop_filter_mix2_h_48_16_neon:   365.0   305.2
    vp9_loop_filter_mix2_h_84_16_neon:   365.0   305.2
    vp9_loop_filter_mix2_h_88_16_neon:   376.0   305.2
    vp9_loop_filter_mix2_v_44_16_neon:   193.2   128.2
    vp9_loop_filter_mix2_v_48_16_neon:   246.7   218.4
    vp9_loop_filter_mix2_v_84_16_neon:   248.0   218.5
    vp9_loop_filter_mix2_v_88_16_neon:   302.0   218.2
    vp9_loop_filter_v_4_8_neon:           89.0    88.7
    vp9_loop_filter_v_8_8_neon:          141.0   137.7
    vp9_loop_filter_v_16_8_neon:         295.0   272.7
    vp9_loop_filter_v_16_16_neon:        546.0   453.7
    
    The speedup vs C code in checkasm tests is around 2-7x, which is
    pretty much the same as for the 32 bit version. Even if these functions
    are faster than their 32 bit equivalent, the C version that we compare
    to also became around 1.3-1.7x faster than the C version in 32 bit.
    
    Based on START_TIMER/STOP_TIMER wrapping around a few individual
    functions, the speedup vs C code is around 4-5x.
    
    Examples of runtimes vs C on a Cortex A57 (for a slightly older version
    of the patch):
                             A57 gcc-5.3  neon
    loop_filter_h_4_8_neon:        256.6  93.4
    loop_filter_h_8_8_neon:        307.3 139.1
    loop_filter_h_16_8_neon:       340.1 254.1
    loop_filter_h_16_16_neon:      827.0 407.9
    loop_filter_mix2_h_44_16_neon: 524.5 155.4
    loop_filter_mix2_h_48_16_neon: 644.5 173.3
    loop_filter_mix2_h_84_16_neon: 630.5 222.0
    loop_filter_mix2_h_88_16_neon: 697.3 222.0
    loop_filter_mix2_v_44_16_neon: 598.5 100.6
    loop_filter_mix2_v_48_16_neon: 651.5 127.0
    loop_filter_mix2_v_84_16_neon: 591.5 167.1
    loop_filter_mix2_v_88_16_neon: 855.1 166.7
    loop_filter_v_4_8_neon:        271.7  65.3
    loop_filter_v_8_8_neon:        312.5 106.9
    loop_filter_v_16_8_neon:       473.3 206.5
    loop_filter_v_16_16_neon:      976.1 327.8
    
    The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
    is again 30-50% faster than the cortex-a53.
    
    Signed-off-by: Martin Storsjö <martin@martin.st>
    
    • [DBH] libavcodec/aarch64/Makefile
    • [DBH] libavcodec/aarch64/vp9dsp_init_aarch64.c
    • [DBH] libavcodec/aarch64/vp9lpf_neon.S
  • aarch64 : vp9 : Add NEON itxfm routines

    13 novembre 2016, par Martin Storsjö
    aarch64: vp9: Add NEON itxfm routines
    
    This work is sponsored by, and copyright, Google.
    
    These are ported from the ARM version; thanks to the larger
    amount of registers available, we can do the 16x16 and 32x32
    transforms in slices 8 pixels wide instead of 4. This gives
    a speedup of around 1.4x compared to the 32 bit version.
    
    The fact that aarch64 doesn't have the same d/q register
    aliasing makes some of the macros quite a bit simpler as well.
    
    Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                           ARM  AArch64
    vp9_inv_adst_adst_4x4_add_neon:       90.0     87.7
    vp9_inv_adst_adst_8x8_add_neon:      400.0    354.7
    vp9_inv_adst_adst_16x16_add_neon:   2526.5   1827.2
    vp9_inv_dct_dct_4x4_add_neon:         74.0     72.7
    vp9_inv_dct_dct_8x8_add_neon:        271.0    256.7
    vp9_inv_dct_dct_16x16_add_neon:     1960.7   1372.7
    vp9_inv_dct_dct_32x32_add_neon:    11988.9   8088.3
    vp9_inv_wht_wht_4x4_add_neon:         63.0     57.7
    
    The speedup vs C code (2-4x) is smaller than in the 32 bit case,
    mostly because the C code ends up significantly faster (around
    1.6x faster, with GCC 5.4) when built for aarch64.
    
    Examples of runtimes vs C on a Cortex A57 (for a slightly older version
    of the patch):
                                    A57 gcc-5.3   neon
    vp9_inv_adst_adst_4x4_add_neon:       152.2   60.0
    vp9_inv_adst_adst_8x8_add_neon:       948.2  288.0
    vp9_inv_adst_adst_16x16_add_neon:    4830.4 1380.5
    vp9_inv_dct_dct_4x4_add_neon:         153.0   58.6
    vp9_inv_dct_dct_8x8_add_neon:         789.2  180.2
    vp9_inv_dct_dct_16x16_add_neon:      3639.6  917.1
    vp9_inv_dct_dct_32x32_add_neon:     20462.1 4985.0
    vp9_inv_wht_wht_4x4_add_neon:          91.0   49.8
    
    The asm is around factor 3-4 faster than C on the cortex-a57 and the asm
    is around 30-50% faster on the a57 compared to the a53.
    
    Signed-off-by: Martin Storsjö <martin@martin.st>
    
    • [DBH] libavcodec/aarch64/Makefile
    • [DBH] libavcodec/aarch64/vp9dsp_init_aarch64.c
    • [DBH] libavcodec/aarch64/vp9itxfm_neon.S
  • vp9 : split superframes in the filtering stage before actual decoding

    13 novembre 2016, par Anton Khirnov
    vp9: split superframes in the filtering stage before actual decoding
    
    Significantly increases the efficiency of frame threading, since
    individual frames in a superframe can now be decoded in parallel.
    
    • [DBH] configure
    • [DBH] libavcodec/vp9.c