"Optimized implementations"?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

"Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
<https://compiler-rt.llvm.org/index.html> boasts:

| The builtins library provides optimized implementations of this
| and other low-level routines, either in target-independent C form,
| or as a heavily-optimized assembly.

Really?

Left: inperformant code shipped in    # Right: slightly improved code,
      clang_rt.builtins-*             #        which the optimiser REALLY
                                      #        should have generated

___cmpdi2:
        mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
        xor     eax, eax              #       xor     eax, eax
        cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
        jl      @f                    #       jg      @f
        mov     eax, 2                #       mov     eax, 2
        jg      @f                    #       jl      @f
        mov     ecx, [esp+4]          #
        mov     edx, [esp+12]         #       mov     ecx, [esp+12]
        mov     eax, 0                #       xor     eax, eax
        cmp     ecx, edx              #       cmp     ecx, [esp+4]
        jb      @f                    #       ja      @f
        cmp     edx, ecx              #
        mov     eax, 1                #
        adc     eax, 0                #       adc     eax, 1
@@:                                   # @@:
        ret                           #       ret

                                      # 3 instructions less, 10 bytes saved

___ucmpdi2:
        mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
        xor     eax, eax              #       xor     eax, eax
        cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
        jb      @f                    #       ja      @f
        mov     eax, 2                #       mov     eax, 2
        ja      @f                    #       jb      @f
        mov     ecx, [esp+4]          #
        mov     edx, [esp+12]         #       mov     ecx, [esp+12]
        mov     eax, 0                #       xor     eax, eax
        cmp     ecx, edx              #       cmp     ecx, [esp+4]
        jb      @f                    #       ja      @f
        cmp     edx, ecx              #
        mov     eax, 1                #
        adc     eax, 0                #       adc     eax, 1
@@:                                   # @@:
        ret                           #       ret

                                      # 3 instructions less, 10 bytes saved


Now properly written code, of course branch-free, faster and shorter:

# Copyright (C) 2004-2020, Stefan Kanthak <[hidden email]>

___cmpdi2:
        mov     ecx, [esp+4]
        mov     edx, [esp+12]
        cmp     ecx, edx
        mov     eax, [esp+8]
        sbb     eax, [esp+16]
        setl    ah
        cmp     edx, ecx
        mov     edx, [esp+16]
        sbb     edx, [esp+8]
        setl    al
        sub     al, ah
        movsx   eax, al
        inc     eax
        ret

___ucmpdi2:
        mov     ecx, [esp+4]
        mov     edx, [esp+12]
        cmp     ecx, edx
        mov     eax, [esp+8]
        sbb     eax, [esp+16]
        sbb     eax, eax
        cmp     edx, ecx
        mov     edx, [esp+16]
        sbb     edx, [esp+8]
        adc     eax, 1
        ret


AGAIN:
Remove every occurance of the word "optimized" on the above web page.

'nuff said
Stefan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2, or ___ucmpdi2 on X86 so its not clear the performance of this matters at all.

~Craig


On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <[hidden email]> wrote:
<https://compiler-rt.llvm.org/index.html> boasts:

| The builtins library provides optimized implementations of this
| and other low-level routines, either in target-independent C form,
| or as a heavily-optimized assembly.

Really?

Left: inperformant code shipped in    # Right: slightly improved code,
      clang_rt.builtins-*             #        which the optimiser REALLY
                                      #        should have generated

___cmpdi2:
        mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
        xor     eax, eax              #       xor     eax, eax
        cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
        jl      @f                    #       jg      @f
        mov     eax, 2                #       mov     eax, 2
        jg      @f                    #       jl      @f
        mov     ecx, [esp+4]          #
        mov     edx, [esp+12]         #       mov     ecx, [esp+12]
        mov     eax, 0                #       xor     eax, eax
        cmp     ecx, edx              #       cmp     ecx, [esp+4]
        jb      @f                    #       ja      @f
        cmp     edx, ecx              #
        mov     eax, 1                #
        adc     eax, 0                #       adc     eax, 1
@@:                                   # @@:
        ret                           #       ret

                                      # 3 instructions less, 10 bytes saved

___ucmpdi2:
        mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
        xor     eax, eax              #       xor     eax, eax
        cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
        jb      @f                    #       ja      @f
        mov     eax, 2                #       mov     eax, 2
        ja      @f                    #       jb      @f
        mov     ecx, [esp+4]          #
        mov     edx, [esp+12]         #       mov     ecx, [esp+12]
        mov     eax, 0                #       xor     eax, eax
        cmp     ecx, edx              #       cmp     ecx, [esp+4]
        jb      @f                    #       ja      @f
        cmp     edx, ecx              #
        mov     eax, 1                #
        adc     eax, 0                #       adc     eax, 1
@@:                                   # @@:
        ret                           #       ret

                                      # 3 instructions less, 10 bytes saved


Now properly written code, of course branch-free, faster and shorter:

# Copyright (C) 2004-2020, Stefan Kanthak <[hidden email]>

___cmpdi2:
        mov     ecx, [esp+4]
        mov     edx, [esp+12]
        cmp     ecx, edx
        mov     eax, [esp+8]
        sbb     eax, [esp+16]
        setl    ah
        cmp     edx, ecx
        mov     edx, [esp+16]
        sbb     edx, [esp+8]
        setl    al
        sub     al, ah
        movsx   eax, al
        inc     eax
        ret

___ucmpdi2:
        mov     ecx, [esp+4]
        mov     edx, [esp+12]
        cmp     ecx, edx
        mov     eax, [esp+8]
        sbb     eax, [esp+16]
        sbb     eax, eax
        cmp     edx, ecx
        mov     edx, [esp+16]
        sbb     edx, [esp+8]
        adc     eax, 1
        ret


AGAIN:
Remove every occurance of the word "optimized" on the above web page.

'nuff said
Stefan
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
"Craig Topper" <[hidden email]> wrote;

> Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2, or
> ___ucmpdi2 on X86 so its not clear the performance of this matters at all.

So you can safely remove them for X86 and all the other targets where such
unoptimized code is never called!
But fix these routines for targets where they are called.

The statement does NOT make any exceptions, and it does not say
| ships unoptimized routines the compiler never calls
but
| optimized target-independent implementations

Stefan

BTW: do builtins like __builtin_*parity* exist?
     If yes: do they generate the same bad code?

> On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
> [hidden email]> wrote:
>
>> <https://compiler-rt.llvm.org/index.html> boasts:
>>
>> | The builtins library provides optimized implementations of this
>> | and other low-level routines, either in target-independent C form,
>> | or as a heavily-optimized assembly.
>>
>> Really?
>>
>> Left: inperformant code shipped in    # Right: slightly improved code,
>>       clang_rt.builtins-*             #        which the optimiser REALLY
>>                                       #        should have generated
>>
>> ___cmpdi2:
>>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>>         xor     eax, eax              #       xor     eax, eax
>>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>>         jl      @f                    #       jg      @f
>>         mov     eax, 2                #       mov     eax, 2
>>         jg      @f                    #       jl      @f
>>         mov     ecx, [esp+4]          #
>>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>>         mov     eax, 0                #       xor     eax, eax
>>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>>         jb      @f                    #       ja      @f
>>         cmp     edx, ecx              #
>>         mov     eax, 1                #
>>         adc     eax, 0                #       adc     eax, 1
>> @@:                                   # @@:
>>         ret                           #       ret
>>
>>                                       # 3 instructions less, 10 bytes saved
>>
>> ___ucmpdi2:
>>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>>         xor     eax, eax              #       xor     eax, eax
>>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>>         jb      @f                    #       ja      @f
>>         mov     eax, 2                #       mov     eax, 2
>>         ja      @f                    #       jb      @f
>>         mov     ecx, [esp+4]          #
>>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>>         mov     eax, 0                #       xor     eax, eax
>>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>>         jb      @f                    #       ja      @f
>>         cmp     edx, ecx              #
>>         mov     eax, 1                #
>>         adc     eax, 0                #       adc     eax, 1
>> @@:                                   # @@:
>>         ret                           #       ret
>>
>>                                       # 3 instructions less, 10 bytes saved
>>
>>
>> Now properly written code, of course branch-free, faster and shorter:
>>
>> # Copyright (C) 2004-2020, Stefan Kanthak <[hidden email]>
>>
>> ___cmpdi2:
>>         mov     ecx, [esp+4]
>>         mov     edx, [esp+12]
>>         cmp     ecx, edx
>>         mov     eax, [esp+8]
>>         sbb     eax, [esp+16]
>>         setl    ah
>>         cmp     edx, ecx
>>         mov     edx, [esp+16]
>>         sbb     edx, [esp+8]
>>         setl    al
>>         sub     al, ah
>>         movsx   eax, al
>>         inc     eax
>>         ret
>>
>> ___ucmpdi2:
>>         mov     ecx, [esp+4]
>>         mov     edx, [esp+12]
>>         cmp     ecx, edx
>>         mov     eax, [esp+8]
>>         sbb     eax, [esp+16]
>>         sbb     eax, eax
>>         cmp     edx, ecx
>>         mov     edx, [esp+16]
>>         sbb     edx, [esp+8]
>>         adc     eax, 1
>>         ret
>>
>>
>> AGAIN:
>> Remove every occurance of the word "optimized" on the above web page.
>>
>> 'nuff said
>> Stefan
>> _______________________________________________
>> cfe-dev mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
__builtin_parity uses setnp on older x86 and popcnt with sse4.2

On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote;



> Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2, or

> ___ucmpdi2 on X86 so its not clear the performance of this matters at all.



So you can safely remove them for X86 and all the other targets where such

unoptimized code is never called!

But fix these routines for targets where they are called.



The statement does NOT make any exceptions, and it does not say

| ships unoptimized routines the compiler never calls

but

| optimized target-independent implementations



Stefan



BTW: do builtins like __builtin_*parity* exist?

     If yes: do they generate the same bad code?



> On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <

> [hidden email]> wrote:

>

>> <https://compiler-rt.llvm.org/index.html> boasts:

>>

>> | The builtins library provides optimized implementations of this

>> | and other low-level routines, either in target-independent C form,

>> | or as a heavily-optimized assembly.

>>

>> Really?

>>

>> Left: inperformant code shipped in    # Right: slightly improved code,

>>       clang_rt.builtins-*             #        which the optimiser REALLY

>>                                       #        should have generated

>>

>> ___cmpdi2:

>>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]

>>         xor     eax, eax              #       xor     eax, eax

>>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]

>>         jl      @f                    #       jg      @f

>>         mov     eax, 2                #       mov     eax, 2

>>         jg      @f                    #       jl      @f

>>         mov     ecx, [esp+4]          #

>>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]

>>         mov     eax, 0                #       xor     eax, eax

>>         cmp     ecx, edx              #       cmp     ecx, [esp+4]

>>         jb      @f                    #       ja      @f

>>         cmp     edx, ecx              #

>>         mov     eax, 1                #

>>         adc     eax, 0                #       adc     eax, 1

>> @@:                                   # @@:

>>         ret                           #       ret

>>

>>                                       # 3 instructions less, 10 bytes saved

>>

>> ___ucmpdi2:

>>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]

>>         xor     eax, eax              #       xor     eax, eax

>>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]

>>         jb      @f                    #       ja      @f

>>         mov     eax, 2                #       mov     eax, 2

>>         ja      @f                    #       jb      @f

>>         mov     ecx, [esp+4]          #

>>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]

>>         mov     eax, 0                #       xor     eax, eax

>>         cmp     ecx, edx              #       cmp     ecx, [esp+4]

>>         jb      @f                    #       ja      @f

>>         cmp     edx, ecx              #

>>         mov     eax, 1                #

>>         adc     eax, 0                #       adc     eax, 1

>> @@:                                   # @@:

>>         ret                           #       ret

>>

>>                                       # 3 instructions less, 10 bytes saved

>>

>>

>> Now properly written code, of course branch-free, faster and shorter:

>>

>> # Copyright (C) 2004-2020, Stefan Kanthak <[hidden email]>

>>

>> ___cmpdi2:

>>         mov     ecx, [esp+4]

>>         mov     edx, [esp+12]

>>         cmp     ecx, edx

>>         mov     eax, [esp+8]

>>         sbb     eax, [esp+16]

>>         setl    ah

>>         cmp     edx, ecx

>>         mov     edx, [esp+16]

>>         sbb     edx, [esp+8]

>>         setl    al

>>         sub     al, ah

>>         movsx   eax, al

>>         inc     eax

>>         ret

>>

>> ___ucmpdi2:

>>         mov     ecx, [esp+4]

>>         mov     edx, [esp+12]

>>         cmp     ecx, edx

>>         mov     eax, [esp+8]

>>         sbb     eax, [esp+16]

>>         sbb     eax, eax

>>         cmp     edx, ecx

>>         mov     edx, [esp+16]

>>         sbb     edx, [esp+8]

>>         adc     eax, 1

>>         ret

>>

>>

>> AGAIN:

>> Remove every occurance of the word "optimized" on the above web page.

>>

>> 'nuff said

>> Stefan

>> _______________________________________________

>> cfe-dev mailing list

>> [hidden email]

>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

>>



--
~Craig

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
Turn on optimizations.

~Craig


On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote:

> __builtin_parity uses setnp on older x86 and popcnt with sse4.2

Reality check, PLEASE:

--- bug.c ---
int main(int argc, char *argv[]) {
    return __builtin_parity(argc);
}
--- EOF ---

clang -o- -target i386-pc-linux -S bug.c
clang version 10.0.0
Target: i386-pc-linux

        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    8(%ebp), %eax
        movl    $0, -4(%ebp)
        movl    8(%ebp), %ecx
        movl    %ecx, %edx
        shrl    %edx
        andl    $1431655765, %edx       # imm = 0x55555555
        subl    %edx, %ecx
        movl    %ecx, %edx
        andl    $858993459, %edx        # imm = 0x33333333
        shrl    $2, %ecx
        andl    $858993459, %ecx        # imm = 0x33333333
        addl    %ecx, %edx
        movl    %edx, %ecx
        shrl    $4, %ecx
        addl    %ecx, %edx
        andl    $252645135, %edx        # imm = 0xF0F0F0F
        imull   $16843009, %edx, %ecx   # imm = 0x1010101
        shrl    $24, %ecx
        andl    $1, %ecx
        movl    %eax, -8(%ebp)          # 4-byte Spill
        movl    %ecx, %eax
        addl    $8, %esp
        popl    %ebp
        retl


clang -o- -target amd64-pc-linux -S bug.c

        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movl    $0, -4(%rbp)
        movl    %edi, -8(%rbp)
        movl    -8(%rbp), %eax
        movl    %eax, %ecx
        shrl    %ecx
        andl    $1431655765, %ecx       # imm = 0x55555555
        subl    %ecx, %eax
        movl    %eax, %ecx
        andl    $858993459, %ecx        # imm = 0x33333333
        shrl    $2, %eax
        andl    $858993459, %eax        # imm = 0x33333333
        addl    %eax, %ecx
        movl    %ecx, %eax
        shrl    $4, %eax
        addl    %eax, %ecx
        andl    $252645135, %ecx        # imm = 0xF0F0F0F
        imull   $16843009, %ecx, %eax   # imm = 0x1010101
        shrl    $24, %eax
        andl    $1, %eax
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq

JFTR: this is the same unoptimised code as shipped in the builtins library!

Stefan

> On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <[hidden email]>
> wrote:
>
>> "Craig Topper" <[hidden email]> wrote;
>>
>>
>>
>> > Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2, or
>>
>> > ___ucmpdi2 on X86 so its not clear the performance of this matters at
>> all.
>>
>>
>>
>> So you can safely remove them for X86 and all the other targets where such
>>
>> unoptimized code is never called!
>>
>> But fix these routines for targets where they are called.
>>
>>
>>
>> The statement does NOT make any exceptions, and it does not say
>>
>> | ships unoptimized routines the compiler never calls
>>
>> but
>>
>> | optimized target-independent implementations
>>
>>
>>
>> Stefan
>>
>>
>>
>> BTW: do builtins like __builtin_*parity* exist?
>>
>>      If yes: do they generate the same bad code?
>>
>>
>>
>> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>>
>> > [hidden email]> wrote:
>>
>> >
>>
>> >> <https://compiler-rt.llvm.org/index.html> boasts:
>>
>> >>
>>
>> >> | The builtins library provides optimized implementations of this
>>
>> >> | and other low-level routines, either in target-independent C form,
>>
>> >> | or as a heavily-optimized assembly.
>>
>> >>
>>
>> >> Really?
>>
>> >>
>>
>> >> Left: inperformant code shipped in    # Right: slightly improved code,
>>
>> >>       clang_rt.builtins-*             #        which the optimiser
>> REALLY
>>
>> >>                                       #        should have generated
>>
>> >>
>>
>> >> ___cmpdi2:
>>
>> >>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>>
>> >>         xor     eax, eax              #       xor     eax, eax
>>
>> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>>
>> >>         jl      @f                    #       jg      @f
>>
>> >>         mov     eax, 2                #       mov     eax, 2
>>
>> >>         jg      @f                    #       jl      @f
>>
>> >>         mov     ecx, [esp+4]          #
>>
>> >>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>>
>> >>         mov     eax, 0                #       xor     eax, eax
>>
>> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>>
>> >>         jb      @f                    #       ja      @f
>>
>> >>         cmp     edx, ecx              #
>>
>> >>         mov     eax, 1                #
>>
>> >>         adc     eax, 0                #       adc     eax, 1
>>
>> >> @@:                                   # @@:
>>
>> >>         ret                           #       ret
>>
>> >>
>>
>> >>                                       # 3 instructions less, 10 bytes
>> saved
>>
>> >>
>>
>> >> ___ucmpdi2:
>>
>> >>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>>
>> >>         xor     eax, eax              #       xor     eax, eax
>>
>> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>>
>> >>         jb      @f                    #       ja      @f
>>
>> >>         mov     eax, 2                #       mov     eax, 2
>>
>> >>         ja      @f                    #       jb      @f
>>
>> >>         mov     ecx, [esp+4]          #
>>
>> >>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>>
>> >>         mov     eax, 0                #       xor     eax, eax
>>
>> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>>
>> >>         jb      @f                    #       ja      @f
>>
>> >>         cmp     edx, ecx              #
>>
>> >>         mov     eax, 1                #
>>
>> >>         adc     eax, 0                #       adc     eax, 1
>>
>> >> @@:                                   # @@:
>>
>> >>         ret                           #       ret
>>
>> >>
>>
>> >>                                       # 3 instructions less, 10 bytes
>> saved
>>
>> >>
>>
>> >>
>>
>> >> Now properly written code, of course branch-free, faster and shorter:
>>
>> >>
>>
>> >> # Copyright (C) 2004-2020, Stefan Kanthak <[hidden email]>
>>
>> >>
>>
>> >> ___cmpdi2:
>>
>> >>         mov     ecx, [esp+4]
>>
>> >>         mov     edx, [esp+12]
>>
>> >>         cmp     ecx, edx
>>
>> >>         mov     eax, [esp+8]
>>
>> >>         sbb     eax, [esp+16]
>>
>> >>         setl    ah
>>
>> >>         cmp     edx, ecx
>>
>> >>         mov     edx, [esp+16]
>>
>> >>         sbb     edx, [esp+8]
>>
>> >>         setl    al
>>
>> >>         sub     al, ah
>>
>> >>         movsx   eax, al
>>
>> >>         inc     eax
>>
>> >>         ret
>>
>> >>
>>
>> >> ___ucmpdi2:
>>
>> >>         mov     ecx, [esp+4]
>>
>> >>         mov     edx, [esp+12]
>>
>> >>         cmp     ecx, edx
>>
>> >>         mov     eax, [esp+8]
>>
>> >>         sbb     eax, [esp+16]
>>
>> >>         sbb     eax, eax
>>
>> >>         cmp     edx, ecx
>>
>> >>         mov     edx, [esp+16]
>>
>> >>         sbb     edx, [esp+8]
>>
>> >>         adc     eax, 1
>>
>> >>         ret
>>
>> >>
>>
>> >>
>>
>> >> AGAIN:
>>
>> >> Remove every occurance of the word "optimized" on the above web page.
>>
>> >>
>>
>> >> 'nuff said
>>
>> >> Stefan
>>
>> >> _______________________________________________
>>
>> >> cfe-dev mailing list
>>
>> >> [hidden email]
>>
>> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>> >>
>>
>>
>
> --
> ~Craig
>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
The -O0 code isn't the same as the builtins library, it's worse.

~Craig


On Wed, Sep 9, 2020 at 8:50 AM Craig Topper <[hidden email]> wrote:
Turn on optimizations.

~Craig


On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote:

> __builtin_parity uses setnp on older x86 and popcnt with sse4.2

Reality check, PLEASE:

--- bug.c ---
int main(int argc, char *argv[]) {
    return __builtin_parity(argc);
}
--- EOF ---

clang -o- -target i386-pc-linux -S bug.c
clang version 10.0.0
Target: i386-pc-linux

        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    8(%ebp), %eax
        movl    $0, -4(%ebp)
        movl    8(%ebp), %ecx
        movl    %ecx, %edx
        shrl    %edx
        andl    $1431655765, %edx       # imm = 0x55555555
        subl    %edx, %ecx
        movl    %ecx, %edx
        andl    $858993459, %edx        # imm = 0x33333333
        shrl    $2, %ecx
        andl    $858993459, %ecx        # imm = 0x33333333
        addl    %ecx, %edx
        movl    %edx, %ecx
        shrl    $4, %ecx
        addl    %ecx, %edx
        andl    $252645135, %edx        # imm = 0xF0F0F0F
        imull   $16843009, %edx, %ecx   # imm = 0x1010101
        shrl    $24, %ecx
        andl    $1, %ecx
        movl    %eax, -8(%ebp)          # 4-byte Spill
        movl    %ecx, %eax
        addl    $8, %esp
        popl    %ebp
        retl


clang -o- -target amd64-pc-linux -S bug.c

        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movl    $0, -4(%rbp)
        movl    %edi, -8(%rbp)
        movl    -8(%rbp), %eax
        movl    %eax, %ecx
        shrl    %ecx
        andl    $1431655765, %ecx       # imm = 0x55555555
        subl    %ecx, %eax
        movl    %eax, %ecx
        andl    $858993459, %ecx        # imm = 0x33333333
        shrl    $2, %eax
        andl    $858993459, %eax        # imm = 0x33333333
        addl    %eax, %ecx
        movl    %ecx, %eax
        shrl    $4, %eax
        addl    %eax, %ecx
        andl    $252645135, %ecx        # imm = 0xF0F0F0F
        imull   $16843009, %ecx, %eax   # imm = 0x1010101
        shrl    $24, %eax
        andl    $1, %eax
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq

JFTR: this is the same unoptimised code as shipped in the builtins library!

Stefan

> On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <[hidden email]>
> wrote:
>
>> "Craig Topper" <[hidden email]> wrote;
>>
>>
>>
>> > Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2, or
>>
>> > ___ucmpdi2 on X86 so its not clear the performance of this matters at
>> all.
>>
>>
>>
>> So you can safely remove them for X86 and all the other targets where such
>>
>> unoptimized code is never called!
>>
>> But fix these routines for targets where they are called.
>>
>>
>>
>> The statement does NOT make any exceptions, and it does not say
>>
>> | ships unoptimized routines the compiler never calls
>>
>> but
>>
>> | optimized target-independent implementations
>>
>>
>>
>> Stefan
>>
>>
>>
>> BTW: do builtins like __builtin_*parity* exist?
>>
>>      If yes: do they generate the same bad code?
>>
>>
>>
>> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>>
>> > [hidden email]> wrote:
>>
>> >
>>
>> >> <https://compiler-rt.llvm.org/index.html> boasts:
>>
>> >>
>>
>> >> | The builtins library provides optimized implementations of this
>>
>> >> | and other low-level routines, either in target-independent C form,
>>
>> >> | or as a heavily-optimized assembly.
>>
>> >>
>>
>> >> Really?
>>
>> >>
>>
>> >> Left: inperformant code shipped in    # Right: slightly improved code,
>>
>> >>       clang_rt.builtins-*             #        which the optimiser
>> REALLY
>>
>> >>                                       #        should have generated
>>
>> >>
>>
>> >> ___cmpdi2:
>>
>> >>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>>
>> >>         xor     eax, eax              #       xor     eax, eax
>>
>> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>>
>> >>         jl      @f                    #       jg      @f
>>
>> >>         mov     eax, 2                #       mov     eax, 2
>>
>> >>         jg      @f                    #       jl      @f
>>
>> >>         mov     ecx, [esp+4]          #
>>
>> >>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>>
>> >>         mov     eax, 0                #       xor     eax, eax
>>
>> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>>
>> >>         jb      @f                    #       ja      @f
>>
>> >>         cmp     edx, ecx              #
>>
>> >>         mov     eax, 1                #
>>
>> >>         adc     eax, 0                #       adc     eax, 1
>>
>> >> @@:                                   # @@:
>>
>> >>         ret                           #       ret
>>
>> >>
>>
>> >>                                       # 3 instructions less, 10 bytes
>> saved
>>
>> >>
>>
>> >> ___ucmpdi2:
>>
>> >>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>>
>> >>         xor     eax, eax              #       xor     eax, eax
>>
>> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>>
>> >>         jb      @f                    #       ja      @f
>>
>> >>         mov     eax, 2                #       mov     eax, 2
>>
>> >>         ja      @f                    #       jb      @f
>>
>> >>         mov     ecx, [esp+4]          #
>>
>> >>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>>
>> >>         mov     eax, 0                #       xor     eax, eax
>>
>> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>>
>> >>         jb      @f                    #       ja      @f
>>
>> >>         cmp     edx, ecx              #
>>
>> >>         mov     eax, 1                #
>>
>> >>         adc     eax, 0                #       adc     eax, 1
>>
>> >> @@:                                   # @@:
>>
>> >>         ret                           #       ret
>>
>> >>
>>
>> >>                                       # 3 instructions less, 10 bytes
>> saved
>>
>> >>
>>
>> >>
>>
>> >> Now properly written code, of course branch-free, faster and shorter:
>>
>> >>
>>
>> >> # Copyright (C) 2004-2020, Stefan Kanthak <[hidden email]>
>>
>> >>
>>
>> >> ___cmpdi2:
>>
>> >>         mov     ecx, [esp+4]
>>
>> >>         mov     edx, [esp+12]
>>
>> >>         cmp     ecx, edx
>>
>> >>         mov     eax, [esp+8]
>>
>> >>         sbb     eax, [esp+16]
>>
>> >>         setl    ah
>>
>> >>         cmp     edx, ecx
>>
>> >>         mov     edx, [esp+16]
>>
>> >>         sbb     edx, [esp+8]
>>
>> >>         setl    al
>>
>> >>         sub     al, ah
>>
>> >>         movsx   eax, al
>>
>> >>         inc     eax
>>
>> >>         ret
>>
>> >>
>>
>> >> ___ucmpdi2:
>>
>> >>         mov     ecx, [esp+4]
>>
>> >>         mov     edx, [esp+12]
>>
>> >>         cmp     ecx, edx
>>
>> >>         mov     eax, [esp+8]
>>
>> >>         sbb     eax, [esp+16]
>>
>> >>         sbb     eax, eax
>>
>> >>         cmp     edx, ecx
>>
>> >>         mov     edx, [esp+16]
>>
>> >>         sbb     edx, [esp+8]
>>
>> >>         adc     eax, 1
>>
>> >>         ret
>>
>> >>
>>
>> >>
>>
>> >> AGAIN:
>>
>> >> Remove every occurance of the word "optimized" on the above web page.
>>
>> >>
>>
>> >> 'nuff said
>>
>> >> Stefan
>>
>> >> _______________________________________________
>>
>> >> cfe-dev mailing list
>>
>> >> [hidden email]
>>
>> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>> >>
>>
>>
>
> --
> ~Craig
>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
In reply to this post by Hollman, Daisy Sophia via cfe-dev
The code in the builtins library is written in C and can't use __builtin_parity since it needs to compile with compilers that don't have __builtin_parity like MSVC. So to get the optimized code, we would have to have a version written in assembly for X86. Probably two assembly versions since MASM and gcc use different assembly syntax. So we'd have 3 different versions of parity for 3 different bit widths. Maybe some macros could allow us to share some bit widths or something. But ultimately it was a question of where to spend effort.

~Craig


On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote:

> Turn on optimizations.

What's the (deeper) reason NOT always to generate the optimised code for
__builtins_*?
Alternative question: do you also ship a library with unoptimised code which
is linked when the -O option is not given?

Just wondering about surprising behaviour
Stefan

> On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <[hidden email]>
> wrote:
>
>> "Craig Topper" <[hidden email]> wrote:
>>
>> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2
>>
>> Reality check, PLEASE:
>>
>> --- bug.c ---
>> int main(int argc, char *argv[]) {
>>     return __builtin_parity(argc);
>> }
>> --- EOF ---
>>
>> clang -o- -target i386-pc-linux -S bug.c
>> clang version 10.0.0
>> Target: i386-pc-linux
>>
>>         pushl   %ebp
>>         movl    %esp, %ebp
>>         subl    $8, %esp
>>         movl    8(%ebp), %eax
>>         movl    $0, -4(%ebp)
>>         movl    8(%ebp), %ecx
>>         movl    %ecx, %edx
>>         shrl    %edx
>>         andl    $1431655765, %edx       # imm = 0x55555555
>>         subl    %edx, %ecx
>>         movl    %ecx, %edx
>>         andl    $858993459, %edx        # imm = 0x33333333
>>         shrl    $2, %ecx
>>         andl    $858993459, %ecx        # imm = 0x33333333
>>         addl    %ecx, %edx
>>         movl    %edx, %ecx
>>         shrl    $4, %ecx
>>         addl    %ecx, %edx
>>         andl    $252645135, %edx        # imm = 0xF0F0F0F
>>         imull   $16843009, %edx, %ecx   # imm = 0x1010101
>>         shrl    $24, %ecx
>>         andl    $1, %ecx
>>         movl    %eax, -8(%ebp)          # 4-byte Spill
>>         movl    %ecx, %eax
>>         addl    $8, %esp
>>         popl    %ebp
>>         retl
>>
>>
>> clang -o- -target amd64-pc-linux -S bug.c
>>
>>         pushq   %rbp
>>         .cfi_def_cfa_offset 16
>>         .cfi_offset %rbp, -16
>>         movq    %rsp, %rbp
>>         .cfi_def_cfa_register %rbp
>>         movl    $0, -4(%rbp)
>>         movl    %edi, -8(%rbp)
>>         movl    -8(%rbp), %eax
>>         movl    %eax, %ecx
>>         shrl    %ecx
>>         andl    $1431655765, %ecx       # imm = 0x55555555
>>         subl    %ecx, %eax
>>         movl    %eax, %ecx
>>         andl    $858993459, %ecx        # imm = 0x33333333
>>         shrl    $2, %eax
>>         andl    $858993459, %eax        # imm = 0x33333333
>>         addl    %eax, %ecx
>>         movl    %ecx, %eax
>>         shrl    $4, %eax
>>         addl    %eax, %ecx
>>         andl    $252645135, %ecx        # imm = 0xF0F0F0F
>>         imull   $16843009, %ecx, %eax   # imm = 0x1010101
>>         shrl    $24, %eax
>>         andl    $1, %eax
>>         popq    %rbp
>>         .cfi_def_cfa %rsp, 8
>>         retq
>>
>> JFTR: this is the same unoptimised code as shipped in the builtins library!
>>
>> Stefan
>>
>> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <[hidden email]>
>> > wrote:
>> >
>> >> "Craig Topper" <[hidden email]> wrote;
>> >>
>> >>
>> >>
>> >> > Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2,
>> or
>> >>
>> >> > ___ucmpdi2 on X86 so its not clear the performance of this matters at
>> >> all.
>> >>
>> >>
>> >>
>> >> So you can safely remove them for X86 and all the other targets where
>> such
>> >>
>> >> unoptimized code is never called!
>> >>
>> >> But fix these routines for targets where they are called.
>> >>
>> >>
>> >>
>> >> The statement does NOT make any exceptions, and it does not say
>> >>
>> >> | ships unoptimized routines the compiler never calls
>> >>
>> >> but
>> >>
>> >> | optimized target-independent implementations
>> >>
>> >>
>> >>
>> >> Stefan
>> >>
>> >>
>> >>
>> >> BTW: do builtins like __builtin_*parity* exist?
>> >>
>> >>      If yes: do they generate the same bad code?
>> >>
>> >>
>> >>
>> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>> >>
>> >> > [hidden email]> wrote:
>> >>
>> >> >
>> >>
>> >> >> <https://compiler-rt.llvm.org/index.html> boasts:
>> >>
>> >> >>
>> >>
>> >> >> | The builtins library provides optimized implementations of this
>> >>
>> >> >> | and other low-level routines, either in target-independent C form,
>> >>
>> >> >> | or as a heavily-optimized assembly.
>> >>
>> >> >>
>> >>
>> >> >> Really?
>> >>
>> >> >>
>> >>
>> >> >> Left: inperformant code shipped in    # Right: slightly improved
>> code,
>> >>
>> >> >>       clang_rt.builtins-*             #        which the optimiser
>> >> REALLY
>> >>
>> >> >>                                       #        should have generated
>> >>
>> >> >>
>> >>
>> >> >> ___cmpdi2:
>> >>
>> >> >>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>> >>
>> >> >>         xor     eax, eax              #       xor     eax, eax
>> >>
>> >> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>> >>
>> >> >>         jl      @f                    #       jg      @f
>> >>
>> >> >>         mov     eax, 2                #       mov     eax, 2
>> >>
>> >> >>         jg      @f                    #       jl      @f
>> >>
>> >> >>         mov     ecx, [esp+4]          #
>> >>
>> >> >>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>> >>
>> >> >>         mov     eax, 0                #       xor     eax, eax
>> >>
>> >> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>> >>
>> >> >>         jb      @f                    #       ja      @f
>> >>
>> >> >>         cmp     edx, ecx              #
>> >>
>> >> >>         mov     eax, 1                #
>> >>
>> >> >>         adc     eax, 0                #       adc     eax, 1
>> >>
>> >> >> @@:                                   # @@:
>> >>
>> >> >>         ret                           #       ret
>> >>
>> >> >>
>> >>
>> >> >>                                       # 3 instructions less, 10 bytes
>> >> saved
>> >>
>> >> >>
>> >>
>> >> >> ___ucmpdi2:
>> >>
>> >> >>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>> >>
>> >> >>         xor     eax, eax              #       xor     eax, eax
>> >>
>> >> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>> >>
>> >> >>         jb      @f                    #       ja      @f
>> >>
>> >> >>         mov     eax, 2                #       mov     eax, 2
>> >>
>> >> >>         ja      @f                    #       jb      @f
>> >>
>> >> >>         mov     ecx, [esp+4]          #
>> >>
>> >> >>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>> >>
>> >> >>         mov     eax, 0                #       xor     eax, eax
>> >>
>> >> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>> >>
>> >> >>         jb      @f                    #       ja      @f
>> >>
>> >> >>         cmp     edx, ecx              #
>> >>
>> >> >>         mov     eax, 1                #
>> >>
>> >> >>         adc     eax, 0                #       adc     eax, 1
>> >>
>> >> >> @@:                                   # @@:
>> >>
>> >> >>         ret                           #       ret
>> >>
>> >> >>
>> >>
>> >> >>                                       # 3 instructions less, 10 bytes
>> >> saved
>> >>
>> >> >>
>> >>
>> >> >>
>> >>
>> >> >> Now properly written code, of course branch-free, faster and shorter:
>> >>
>> >> >>
>> >>
>> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <[hidden email]>
>> >>
>> >> >>
>> >>
>> >> >> ___cmpdi2:
>> >>
>> >> >>         mov     ecx, [esp+4]
>> >>
>> >> >>         mov     edx, [esp+12]
>> >>
>> >> >>         cmp     ecx, edx
>> >>
>> >> >>         mov     eax, [esp+8]
>> >>
>> >> >>         sbb     eax, [esp+16]
>> >>
>> >> >>         setl    ah
>> >>
>> >> >>         cmp     edx, ecx
>> >>
>> >> >>         mov     edx, [esp+16]
>> >>
>> >> >>         sbb     edx, [esp+8]
>> >>
>> >> >>         setl    al
>> >>
>> >> >>         sub     al, ah
>> >>
>> >> >>         movsx   eax, al
>> >>
>> >> >>         inc     eax
>> >>
>> >> >>         ret
>> >>
>> >> >>
>> >>
>> >> >> ___ucmpdi2:
>> >>
>> >> >>         mov     ecx, [esp+4]
>> >>
>> >> >>         mov     edx, [esp+12]
>> >>
>> >> >>         cmp     ecx, edx
>> >>
>> >> >>         mov     eax, [esp+8]
>> >>
>> >> >>         sbb     eax, [esp+16]
>> >>
>> >> >>         sbb     eax, eax
>> >>
>> >> >>         cmp     edx, ecx
>> >>
>> >> >>         mov     edx, [esp+16]
>> >>
>> >> >>         sbb     edx, [esp+8]
>> >>
>> >> >>         adc     eax, 1
>> >>
>> >> >>         ret
>> >>
>> >> >>
>> >>
>> >> >>
>> >>
>> >> >> AGAIN:
>> >>
>> >> >> Remove every occurance of the word "optimized" on the above web page.
>> >>
>> >> >>
>> >>
>> >> >> 'nuff said
>> >>
>> >> >> Stefan
>> >>
>> >> >> _______________________________________________
>> >>
>> >> >> cfe-dev mailing list
>> >>
>> >> >> [hidden email]
>> >>
>> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >>
>> >> >>
>> >>
>> >>
>> >
>> > --
>> > ~Craig
>> >
>>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
I guess you were probably asking about __builtin_parity not the builtins library. For __builtin_parity, the frontend emits a llvm.ctpop+ and w/ 1. The backend has a peephole to pattern match this to the parity sequence. But the peephole is part of a larger optimization pass that isn't running in -O0 to save compile time. So it doesn't get pattern matched so we emit the AND and an expand popcount. To fix this correctly for -O0 we need to add a llvm.parity intrinsic that the frontend can emit directly instead of the ctpop+and. Then we wouldn't need the peephole pattern match.


~Craig


On Wed, Sep 9, 2020 at 9:25 AM Craig Topper <[hidden email]> wrote:
The code in the builtins library is written in C and can't use __builtin_parity since it needs to compile with compilers that don't have __builtin_parity like MSVC. So to get the optimized code, we would have to have a version written in assembly for X86. Probably two assembly versions since MASM and gcc use different assembly syntax. So we'd have 3 different versions of parity for 3 different bit widths. Maybe some macros could allow us to share some bit widths or something. But ultimately it was a question of where to spend effort.

~Craig


On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote:

> Turn on optimizations.

What's the (deeper) reason NOT always to generate the optimised code for
__builtins_*?
Alternative question: do you also ship a library with unoptimised code which
is linked when the -O option is not given?

Just wondering about surprising behaviour
Stefan

> On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <[hidden email]>
> wrote:
>
>> "Craig Topper" <[hidden email]> wrote:
>>
>> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2
>>
>> Reality check, PLEASE:
>>
>> --- bug.c ---
>> int main(int argc, char *argv[]) {
>>     return __builtin_parity(argc);
>> }
>> --- EOF ---
>>
>> clang -o- -target i386-pc-linux -S bug.c
>> clang version 10.0.0
>> Target: i386-pc-linux
>>
>>         pushl   %ebp
>>         movl    %esp, %ebp
>>         subl    $8, %esp
>>         movl    8(%ebp), %eax
>>         movl    $0, -4(%ebp)
>>         movl    8(%ebp), %ecx
>>         movl    %ecx, %edx
>>         shrl    %edx
>>         andl    $1431655765, %edx       # imm = 0x55555555
>>         subl    %edx, %ecx
>>         movl    %ecx, %edx
>>         andl    $858993459, %edx        # imm = 0x33333333
>>         shrl    $2, %ecx
>>         andl    $858993459, %ecx        # imm = 0x33333333
>>         addl    %ecx, %edx
>>         movl    %edx, %ecx
>>         shrl    $4, %ecx
>>         addl    %ecx, %edx
>>         andl    $252645135, %edx        # imm = 0xF0F0F0F
>>         imull   $16843009, %edx, %ecx   # imm = 0x1010101
>>         shrl    $24, %ecx
>>         andl    $1, %ecx
>>         movl    %eax, -8(%ebp)          # 4-byte Spill
>>         movl    %ecx, %eax
>>         addl    $8, %esp
>>         popl    %ebp
>>         retl
>>
>>
>> clang -o- -target amd64-pc-linux -S bug.c
>>
>>         pushq   %rbp
>>         .cfi_def_cfa_offset 16
>>         .cfi_offset %rbp, -16
>>         movq    %rsp, %rbp
>>         .cfi_def_cfa_register %rbp
>>         movl    $0, -4(%rbp)
>>         movl    %edi, -8(%rbp)
>>         movl    -8(%rbp), %eax
>>         movl    %eax, %ecx
>>         shrl    %ecx
>>         andl    $1431655765, %ecx       # imm = 0x55555555
>>         subl    %ecx, %eax
>>         movl    %eax, %ecx
>>         andl    $858993459, %ecx        # imm = 0x33333333
>>         shrl    $2, %eax
>>         andl    $858993459, %eax        # imm = 0x33333333
>>         addl    %eax, %ecx
>>         movl    %ecx, %eax
>>         shrl    $4, %eax
>>         addl    %eax, %ecx
>>         andl    $252645135, %ecx        # imm = 0xF0F0F0F
>>         imull   $16843009, %ecx, %eax   # imm = 0x1010101
>>         shrl    $24, %eax
>>         andl    $1, %eax
>>         popq    %rbp
>>         .cfi_def_cfa %rsp, 8
>>         retq
>>
>> JFTR: this is the same unoptimised code as shipped in the builtins library!
>>
>> Stefan
>>
>> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <[hidden email]>
>> > wrote:
>> >
>> >> "Craig Topper" <[hidden email]> wrote;
>> >>
>> >>
>> >>
>> >> > Clang never generates calls to ___paritysi2, ___paritydi2, ___cmpdi2,
>> or
>> >>
>> >> > ___ucmpdi2 on X86 so its not clear the performance of this matters at
>> >> all.
>> >>
>> >>
>> >>
>> >> So you can safely remove them for X86 and all the other targets where
>> such
>> >>
>> >> unoptimized code is never called!
>> >>
>> >> But fix these routines for targets where they are called.
>> >>
>> >>
>> >>
>> >> The statement does NOT make any exceptions, and it does not say
>> >>
>> >> | ships unoptimized routines the compiler never calls
>> >>
>> >> but
>> >>
>> >> | optimized target-independent implementations
>> >>
>> >>
>> >>
>> >> Stefan
>> >>
>> >>
>> >>
>> >> BTW: do builtins like __builtin_*parity* exist?
>> >>
>> >>      If yes: do they generate the same bad code?
>> >>
>> >>
>> >>
>> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>> >>
>> >> > [hidden email]> wrote:
>> >>
>> >> >
>> >>
>> >> >> <https://compiler-rt.llvm.org/index.html> boasts:
>> >>
>> >> >>
>> >>
>> >> >> | The builtins library provides optimized implementations of this
>> >>
>> >> >> | and other low-level routines, either in target-independent C form,
>> >>
>> >> >> | or as a heavily-optimized assembly.
>> >>
>> >> >>
>> >>
>> >> >> Really?
>> >>
>> >> >>
>> >>
>> >> >> Left: inperformant code shipped in    # Right: slightly improved
>> code,
>> >>
>> >> >>       clang_rt.builtins-*             #        which the optimiser
>> >> REALLY
>> >>
>> >> >>                                       #        should have generated
>> >>
>> >> >>
>> >>
>> >> >> ___cmpdi2:
>> >>
>> >> >>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>> >>
>> >> >>         xor     eax, eax              #       xor     eax, eax
>> >>
>> >> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>> >>
>> >> >>         jl      @f                    #       jg      @f
>> >>
>> >> >>         mov     eax, 2                #       mov     eax, 2
>> >>
>> >> >>         jg      @f                    #       jl      @f
>> >>
>> >> >>         mov     ecx, [esp+4]          #
>> >>
>> >> >>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>> >>
>> >> >>         mov     eax, 0                #       xor     eax, eax
>> >>
>> >> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>> >>
>> >> >>         jb      @f                    #       ja      @f
>> >>
>> >> >>         cmp     edx, ecx              #
>> >>
>> >> >>         mov     eax, 1                #
>> >>
>> >> >>         adc     eax, 0                #       adc     eax, 1
>> >>
>> >> >> @@:                                   # @@:
>> >>
>> >> >>         ret                           #       ret
>> >>
>> >> >>
>> >>
>> >> >>                                       # 3 instructions less, 10 bytes
>> >> saved
>> >>
>> >> >>
>> >>
>> >> >> ___ucmpdi2:
>> >>
>> >> >>         mov     ecx, [esp+16]         #       mov     ecx, [esp+16]
>> >>
>> >> >>         xor     eax, eax              #       xor     eax, eax
>> >>
>> >> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>> >>
>> >> >>         jb      @f                    #       ja      @f
>> >>
>> >> >>         mov     eax, 2                #       mov     eax, 2
>> >>
>> >> >>         ja      @f                    #       jb      @f
>> >>
>> >> >>         mov     ecx, [esp+4]          #
>> >>
>> >> >>         mov     edx, [esp+12]         #       mov     ecx, [esp+12]
>> >>
>> >> >>         mov     eax, 0                #       xor     eax, eax
>> >>
>> >> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>> >>
>> >> >>         jb      @f                    #       ja      @f
>> >>
>> >> >>         cmp     edx, ecx              #
>> >>
>> >> >>         mov     eax, 1                #
>> >>
>> >> >>         adc     eax, 0                #       adc     eax, 1
>> >>
>> >> >> @@:                                   # @@:
>> >>
>> >> >>         ret                           #       ret
>> >>
>> >> >>
>> >>
>> >> >>                                       # 3 instructions less, 10 bytes
>> >> saved
>> >>
>> >> >>
>> >>
>> >> >>
>> >>
>> >> >> Now properly written code, of course branch-free, faster and shorter:
>> >>
>> >> >>
>> >>
>> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <[hidden email]>
>> >>
>> >> >>
>> >>
>> >> >> ___cmpdi2:
>> >>
>> >> >>         mov     ecx, [esp+4]
>> >>
>> >> >>         mov     edx, [esp+12]
>> >>
>> >> >>         cmp     ecx, edx
>> >>
>> >> >>         mov     eax, [esp+8]
>> >>
>> >> >>         sbb     eax, [esp+16]
>> >>
>> >> >>         setl    ah
>> >>
>> >> >>         cmp     edx, ecx
>> >>
>> >> >>         mov     edx, [esp+16]
>> >>
>> >> >>         sbb     edx, [esp+8]
>> >>
>> >> >>         setl    al
>> >>
>> >> >>         sub     al, ah
>> >>
>> >> >>         movsx   eax, al
>> >>
>> >> >>         inc     eax
>> >>
>> >> >>         ret
>> >>
>> >> >>
>> >>
>> >> >> ___ucmpdi2:
>> >>
>> >> >>         mov     ecx, [esp+4]
>> >>
>> >> >>         mov     edx, [esp+12]
>> >>
>> >> >>         cmp     ecx, edx
>> >>
>> >> >>         mov     eax, [esp+8]
>> >>
>> >> >>         sbb     eax, [esp+16]
>> >>
>> >> >>         sbb     eax, eax
>> >>
>> >> >>         cmp     edx, ecx
>> >>
>> >> >>         mov     edx, [esp+16]
>> >>
>> >> >>         sbb     edx, [esp+8]
>> >>
>> >> >>         adc     eax, 1
>> >>
>> >> >>         ret
>> >>
>> >> >>
>> >>
>> >> >>
>> >>
>> >> >> AGAIN:
>> >>
>> >> >> Remove every occurance of the word "optimized" on the above web page.
>> >>
>> >> >>
>> >>
>> >> >> 'nuff said
>> >>
>> >> >> Stefan
>> >>
>> >> >> _______________________________________________
>> >>
>> >> >> cfe-dev mailing list
>> >>
>> >> >> [hidden email]
>> >>
>> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >>
>> >> >>
>> >>
>> >>
>> >
>> > --
>> > ~Craig
>> >
>>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
In reply to this post by Hollman, Daisy Sophia via cfe-dev
Currently we skip the build of __udivmodti4(), __multi3(), __modti3() etc. when compiling compiler-rt with MSVC. The problem is more than just the ABI. A bunch of the C code also assumes for example that you can do >> on __uint128_t. But if the __uint128_t is a struct as would be needed for MSVC that code won't compile. Similar for addition, subtraction, etc.

Relevant code from the compiler-rt where we disable 128-bit.

// MSVC doesn't have a working 128bit integer type. Users should really compile
// compiler-rt with clang, but if they happen to be doing a standalone build for
// asan or something else, disable the 128 bit parts so things sort of work.
#if defined(_MSC_VER) && !defined(__clang__)
#undef CRT_HAS_128BIT
#endif


 ~Craig


On Wed, Sep 9, 2020 at 10:27 AM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote:

I should have added "rhetorical" to my alternative question. Sorry.
The main question is but still open: is there a (deeper) reason to not
always generate the optimised code for __builtins_*?

> The code in the builtins library is written in C and can't use
> __builtin_parity since it needs to compile with compilers that don't have
> __builtin_parity like MSVC.

Aahhhhh: how do you build the __udivmodti4(), __multi3(), __modti3() etc.
routines for clang_rt.builtins-x86_64.lib with MSVC?
As you know by now, there are some fidgety but strictly necessary details
missing in the declaration of functions returning __uint128_t or __int128_t!

regards
Stefan

> So to get the optimized code, we would have to
> have a version written in assembly for X86. Probably two assembly versions
> since MASM and gcc use different assembly syntax. So we'd have 3 different
> versions of parity for 3 different bit widths. Maybe some macros could
> allow us to share some bit widths or something. But ultimately it was a
> question of where to spend effort.
>
> ~Craig
>
>
> On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <[hidden email]>
> wrote:
>
>> "Craig Topper" <[hidden email]> wrote:
>>
>> > Turn on optimizations.
>>
>> What's the (deeper) reason NOT always to generate the optimised code for
>> __builtins_*?
>> Alternative question: do you also ship a library with unoptimised code
>> which
>> is linked when the -O option is not given?
>>
>> Just wondering about surprising behaviour
>> Stefan
>>
>> > On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <[hidden email]>
>> > wrote:
>> >
>> >> "Craig Topper" <[hidden email]> wrote:
>> >>
>> >> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2
>> >>
>> >> Reality check, PLEASE:
>> >>
>> >> --- bug.c ---
>> >> int main(int argc, char *argv[]) {
>> >>     return __builtin_parity(argc);
>> >> }
>> >> --- EOF ---
>> >>
>> >> clang -o- -target i386-pc-linux -S bug.c
>> >> clang version 10.0.0
>> >> Target: i386-pc-linux
>> >>
>> >>         pushl   %ebp
>> >>         movl    %esp, %ebp
>> >>         subl    $8, %esp
>> >>         movl    8(%ebp), %eax
>> >>         movl    $0, -4(%ebp)
>> >>         movl    8(%ebp), %ecx
>> >>         movl    %ecx, %edx
>> >>         shrl    %edx
>> >>         andl    $1431655765, %edx       # imm = 0x55555555
>> >>         subl    %edx, %ecx
>> >>         movl    %ecx, %edx
>> >>         andl    $858993459, %edx        # imm = 0x33333333
>> >>         shrl    $2, %ecx
>> >>         andl    $858993459, %ecx        # imm = 0x33333333
>> >>         addl    %ecx, %edx
>> >>         movl    %edx, %ecx
>> >>         shrl    $4, %ecx
>> >>         addl    %ecx, %edx
>> >>         andl    $252645135, %edx        # imm = 0xF0F0F0F
>> >>         imull   $16843009, %edx, %ecx   # imm = 0x1010101
>> >>         shrl    $24, %ecx
>> >>         andl    $1, %ecx
>> >>         movl    %eax, -8(%ebp)          # 4-byte Spill
>> >>         movl    %ecx, %eax
>> >>         addl    $8, %esp
>> >>         popl    %ebp
>> >>         retl
>> >>
>> >>
>> >> clang -o- -target amd64-pc-linux -S bug.c
>> >>
>> >>         pushq   %rbp
>> >>         .cfi_def_cfa_offset 16
>> >>         .cfi_offset %rbp, -16
>> >>         movq    %rsp, %rbp
>> >>         .cfi_def_cfa_register %rbp
>> >>         movl    $0, -4(%rbp)
>> >>         movl    %edi, -8(%rbp)
>> >>         movl    -8(%rbp), %eax
>> >>         movl    %eax, %ecx
>> >>         shrl    %ecx
>> >>         andl    $1431655765, %ecx       # imm = 0x55555555
>> >>         subl    %ecx, %eax
>> >>         movl    %eax, %ecx
>> >>         andl    $858993459, %ecx        # imm = 0x33333333
>> >>         shrl    $2, %eax
>> >>         andl    $858993459, %eax        # imm = 0x33333333
>> >>         addl    %eax, %ecx
>> >>         movl    %ecx, %eax
>> >>         shrl    $4, %eax
>> >>         addl    %eax, %ecx
>> >>         andl    $252645135, %ecx        # imm = 0xF0F0F0F
>> >>         imull   $16843009, %ecx, %eax   # imm = 0x1010101
>> >>         shrl    $24, %eax
>> >>         andl    $1, %eax
>> >>         popq    %rbp
>> >>         .cfi_def_cfa %rsp, 8
>> >>         retq
>> >>
>> >> JFTR: this is the same unoptimised code as shipped in the builtins
>> library!
>> >>
>> >> Stefan
>> >>
>> >> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <
>> [hidden email]>
>> >> > wrote:
>> >> >
>> >> >> "Craig Topper" <[hidden email]> wrote;
>> >> >>
>> >> >>
>> >> >>
>> >> >> > Clang never generates calls to ___paritysi2, ___paritydi2,
>> ___cmpdi2,
>> >> or
>> >> >>
>> >> >> > ___ucmpdi2 on X86 so its not clear the performance of this matters
>> at
>> >> >> all.
>> >> >>
>> >> >>
>> >> >>
>> >> >> So you can safely remove them for X86 and all the other targets where
>> >> such
>> >> >>
>> >> >> unoptimized code is never called!
>> >> >>
>> >> >> But fix these routines for targets where they are called.
>> >> >>
>> >> >>
>> >> >>
>> >> >> The statement does NOT make any exceptions, and it does not say
>> >> >>
>> >> >> | ships unoptimized routines the compiler never calls
>> >> >>
>> >> >> but
>> >> >>
>> >> >> | optimized target-independent implementations
>> >> >>
>> >> >>
>> >> >>
>> >> >> Stefan
>> >> >>
>> >> >>
>> >> >>
>> >> >> BTW: do builtins like __builtin_*parity* exist?
>> >> >>
>> >> >>      If yes: do they generate the same bad code?
>> >> >>
>> >> >>
>> >> >>
>> >> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>> >> >>
>> >> >> > [hidden email]> wrote:
>> >> >>
>> >> >> >
>> >> >>
>> >> >> >> <https://compiler-rt.llvm.org/index.html> boasts:
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> | The builtins library provides optimized implementations of this
>> >> >>
>> >> >> >> | and other low-level routines, either in target-independent C
>> form,
>> >> >>
>> >> >> >> | or as a heavily-optimized assembly.
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> Really?
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> Left: inperformant code shipped in    # Right: slightly improved
>> >> code,
>> >> >>
>> >> >> >>       clang_rt.builtins-*             #        which the optimiser
>> >> >> REALLY
>> >> >>
>> >> >> >>                                       #        should have
>> generated
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> ___cmpdi2:
>> >> >>
>> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
>> [esp+16]
>> >> >>
>> >> >> >>         xor     eax, eax              #       xor     eax, eax
>> >> >>
>> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>> >> >>
>> >> >> >>         jl      @f                    #       jg      @f
>> >> >>
>> >> >> >>         mov     eax, 2                #       mov     eax, 2
>> >> >>
>> >> >> >>         jg      @f                    #       jl      @f
>> >> >>
>> >> >> >>         mov     ecx, [esp+4]          #
>> >> >>
>> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
>> [esp+12]
>> >> >>
>> >> >> >>         mov     eax, 0                #       xor     eax, eax
>> >> >>
>> >> >> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>> >> >>
>> >> >> >>         jb      @f                    #       ja      @f
>> >> >>
>> >> >> >>         cmp     edx, ecx              #
>> >> >>
>> >> >> >>         mov     eax, 1                #
>> >> >>
>> >> >> >>         adc     eax, 0                #       adc     eax, 1
>> >> >>
>> >> >> >> @@:                                   # @@:
>> >> >>
>> >> >> >>         ret                           #       ret
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >>                                       # 3 instructions less, 10
>> bytes
>> >> >> saved
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> ___ucmpdi2:
>> >> >>
>> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
>> [esp+16]
>> >> >>
>> >> >> >>         xor     eax, eax              #       xor     eax, eax
>> >> >>
>> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx, [esp+8]
>> >> >>
>> >> >> >>         jb      @f                    #       ja      @f
>> >> >>
>> >> >> >>         mov     eax, 2                #       mov     eax, 2
>> >> >>
>> >> >> >>         ja      @f                    #       jb      @f
>> >> >>
>> >> >> >>         mov     ecx, [esp+4]          #
>> >> >>
>> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
>> [esp+12]
>> >> >>
>> >> >> >>         mov     eax, 0                #       xor     eax, eax
>> >> >>
>> >> >> >>         cmp     ecx, edx              #       cmp     ecx, [esp+4]
>> >> >>
>> >> >> >>         jb      @f                    #       ja      @f
>> >> >>
>> >> >> >>         cmp     edx, ecx              #
>> >> >>
>> >> >> >>         mov     eax, 1                #
>> >> >>
>> >> >> >>         adc     eax, 0                #       adc     eax, 1
>> >> >>
>> >> >> >> @@:                                   # @@:
>> >> >>
>> >> >> >>         ret                           #       ret
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >>                                       # 3 instructions less, 10
>> bytes
>> >> >> saved
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> Now properly written code, of course branch-free, faster and
>> shorter:
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <
>> [hidden email]>
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> ___cmpdi2:
>> >> >>
>> >> >> >>         mov     ecx, [esp+4]
>> >> >>
>> >> >> >>         mov     edx, [esp+12]
>> >> >>
>> >> >> >>         cmp     ecx, edx
>> >> >>
>> >> >> >>         mov     eax, [esp+8]
>> >> >>
>> >> >> >>         sbb     eax, [esp+16]
>> >> >>
>> >> >> >>         setl    ah
>> >> >>
>> >> >> >>         cmp     edx, ecx
>> >> >>
>> >> >> >>         mov     edx, [esp+16]
>> >> >>
>> >> >> >>         sbb     edx, [esp+8]
>> >> >>
>> >> >> >>         setl    al
>> >> >>
>> >> >> >>         sub     al, ah
>> >> >>
>> >> >> >>         movsx   eax, al
>> >> >>
>> >> >> >>         inc     eax
>> >> >>
>> >> >> >>         ret
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> ___ucmpdi2:
>> >> >>
>> >> >> >>         mov     ecx, [esp+4]
>> >> >>
>> >> >> >>         mov     edx, [esp+12]
>> >> >>
>> >> >> >>         cmp     ecx, edx
>> >> >>
>> >> >> >>         mov     eax, [esp+8]
>> >> >>
>> >> >> >>         sbb     eax, [esp+16]
>> >> >>
>> >> >> >>         sbb     eax, eax
>> >> >>
>> >> >> >>         cmp     edx, ecx
>> >> >>
>> >> >> >>         mov     edx, [esp+16]
>> >> >>
>> >> >> >>         sbb     edx, [esp+8]
>> >> >>
>> >> >> >>         adc     eax, 1
>> >> >>
>> >> >> >>         ret
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> AGAIN:
>> >> >>
>> >> >> >> Remove every occurance of the word "optimized" on the above web
>> page.
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> 'nuff said
>> >> >>
>> >> >> >> Stefan
>> >> >>
>> >> >> >> _______________________________________________
>> >> >>
>> >> >> >> cfe-dev mailing list
>> >> >>
>> >> >> >> [hidden email]
>> >> >>
>> >> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >> >
>> >> > --
>> >> > ~Craig
>> >> >
>> >>
>>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
Those two lines of code are slightly different. The first asumes sr to be 0-63. The second allows sr to be 0-127. So we might end up with a runtime check on bit 6 and a cmove to handle 127-64 and 63-0 differently. Maybe the compiler figures out from the surrounding code that bit 6 is 0, but I'd have to check.

   foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
instead of just
    foo.all <<= sr, 


~Craig


On Wed, Sep 9, 2020 at 10:53 AM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote:

> Currently we skip the build of __udivmodti4(), __multi3(), __modti3() etc.
> when compiling compiler-rt with MSVC. The problem is more than just the
> ABI. A bunch of the C code also assumes for example that you can do >> on
> __uint128_t.

Correct.
OTOH, at least udivmodti4.c contains a whole lot of rather ugly "compound"
shifts
    foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
instead of just
    foo.all <<= sr,

The same holds of course for udivmoddi4.c too!
JFTR: ANSI/ISO C introduced 64-bit data types in the last millennium.

Again my question: who reviews such code?
Let alone its poor^Wdisastrous performance...

> But if the __uint128_t is a struct as would be needed for MSVC
> that code won't compile. Similar for addition, subtraction, etc.

I know (not that I tried, reading alone is sufficient to notice that).

> Relevant code from the compiler-rt where we disable 128-bit.
>
> // MSVC doesn't have a working 128bit integer type. Users should really
> compile
> // compiler-rt with clang, but if they happen to be doing a standalone
> build for
> // asan or something else, disable the 128 bit parts so things sort of work.
> #if defined(_MSC_VER) && !defined(__clang__)
> #undef CRT_HAS_128BIT
> #endif

I looked only at COMPILER_RT_ABI and didn't notice that #undef

Stefan

> On Wed, Sep 9, 2020 at 10:27 AM Stefan Kanthak <[hidden email]>
> wrote:
>
>> "Craig Topper" <[hidden email]> wrote:
>>
>> I should have added "rhetorical" to my alternative question. Sorry.
>> The main question is but still open: is there a (deeper) reason to not
>> always generate the optimised code for __builtins_*?
>>
>> > The code in the builtins library is written in C and can't use
>> > __builtin_parity since it needs to compile with compilers that don't have
>> > __builtin_parity like MSVC.
>>
>> Aahhhhh: how do you build the __udivmodti4(), __multi3(), __modti3() etc.
>> routines for clang_rt.builtins-x86_64.lib with MSVC?
>> As you know by now, there are some fidgety but strictly necessary details
>> missing in the declaration of functions returning __uint128_t or
>> __int128_t!
>>
>> regards
>> Stefan
>>
>> > So to get the optimized code, we would have to
>> > have a version written in assembly for X86. Probably two assembly
>> versions
>> > since MASM and gcc use different assembly syntax. So we'd have 3
>> different
>> > versions of parity for 3 different bit widths. Maybe some macros could
>> > allow us to share some bit widths or something. But ultimately it was a
>> > question of where to spend effort.
>> >
>> > ~Craig
>> >
>> >
>> > On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <[hidden email]>
>> > wrote:
>> >
>> >> "Craig Topper" <[hidden email]> wrote:
>> >>
>> >> > Turn on optimizations.
>> >>
>> >> What's the (deeper) reason NOT always to generate the optimised code for
>> >> __builtins_*?
>> >> Alternative question: do you also ship a library with unoptimised code
>> >> which
>> >> is linked when the -O option is not given?
>> >>
>> >> Just wondering about surprising behaviour
>> >> Stefan
>> >>
>> >> > On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <
>> [hidden email]>
>> >> > wrote:
>> >> >
>> >> >> "Craig Topper" <[hidden email]> wrote:
>> >> >>
>> >> >> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2
>> >> >>
>> >> >> Reality check, PLEASE:
>> >> >>
>> >> >> --- bug.c ---
>> >> >> int main(int argc, char *argv[]) {
>> >> >>     return __builtin_parity(argc);
>> >> >> }
>> >> >> --- EOF ---
>> >> >>
>> >> >> clang -o- -target i386-pc-linux -S bug.c
>> >> >> clang version 10.0.0
>> >> >> Target: i386-pc-linux
>> >> >>
>> >> >>         pushl   %ebp
>> >> >>         movl    %esp, %ebp
>> >> >>         subl    $8, %esp
>> >> >>         movl    8(%ebp), %eax
>> >> >>         movl    $0, -4(%ebp)
>> >> >>         movl    8(%ebp), %ecx
>> >> >>         movl    %ecx, %edx
>> >> >>         shrl    %edx
>> >> >>         andl    $1431655765, %edx       # imm = 0x55555555
>> >> >>         subl    %edx, %ecx
>> >> >>         movl    %ecx, %edx
>> >> >>         andl    $858993459, %edx        # imm = 0x33333333
>> >> >>         shrl    $2, %ecx
>> >> >>         andl    $858993459, %ecx        # imm = 0x33333333
>> >> >>         addl    %ecx, %edx
>> >> >>         movl    %edx, %ecx
>> >> >>         shrl    $4, %ecx
>> >> >>         addl    %ecx, %edx
>> >> >>         andl    $252645135, %edx        # imm = 0xF0F0F0F
>> >> >>         imull   $16843009, %edx, %ecx   # imm = 0x1010101
>> >> >>         shrl    $24, %ecx
>> >> >>         andl    $1, %ecx
>> >> >>         movl    %eax, -8(%ebp)          # 4-byte Spill
>> >> >>         movl    %ecx, %eax
>> >> >>         addl    $8, %esp
>> >> >>         popl    %ebp
>> >> >>         retl
>> >> >>
>> >> >>
>> >> >> clang -o- -target amd64-pc-linux -S bug.c
>> >> >>
>> >> >>         pushq   %rbp
>> >> >>         .cfi_def_cfa_offset 16
>> >> >>         .cfi_offset %rbp, -16
>> >> >>         movq    %rsp, %rbp
>> >> >>         .cfi_def_cfa_register %rbp
>> >> >>         movl    $0, -4(%rbp)
>> >> >>         movl    %edi, -8(%rbp)
>> >> >>         movl    -8(%rbp), %eax
>> >> >>         movl    %eax, %ecx
>> >> >>         shrl    %ecx
>> >> >>         andl    $1431655765, %ecx       # imm = 0x55555555
>> >> >>         subl    %ecx, %eax
>> >> >>         movl    %eax, %ecx
>> >> >>         andl    $858993459, %ecx        # imm = 0x33333333
>> >> >>         shrl    $2, %eax
>> >> >>         andl    $858993459, %eax        # imm = 0x33333333
>> >> >>         addl    %eax, %ecx
>> >> >>         movl    %ecx, %eax
>> >> >>         shrl    $4, %eax
>> >> >>         addl    %eax, %ecx
>> >> >>         andl    $252645135, %ecx        # imm = 0xF0F0F0F
>> >> >>         imull   $16843009, %ecx, %eax   # imm = 0x1010101
>> >> >>         shrl    $24, %eax
>> >> >>         andl    $1, %eax
>> >> >>         popq    %rbp
>> >> >>         .cfi_def_cfa %rsp, 8
>> >> >>         retq
>> >> >>
>> >> >> JFTR: this is the same unoptimised code as shipped in the builtins
>> >> library!
>> >> >>
>> >> >> Stefan
>> >> >>
>> >> >> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <
>> >> [hidden email]>
>> >> >> > wrote:
>> >> >> >
>> >> >> >> "Craig Topper" <[hidden email]> wrote;
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> > Clang never generates calls to ___paritysi2, ___paritydi2,
>> >> ___cmpdi2,
>> >> >> or
>> >> >> >>
>> >> >> >> > ___ucmpdi2 on X86 so its not clear the performance of this
>> matters
>> >> at
>> >> >> >> all.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> So you can safely remove them for X86 and all the other targets
>> where
>> >> >> such
>> >> >> >>
>> >> >> >> unoptimized code is never called!
>> >> >> >>
>> >> >> >> But fix these routines for targets where they are called.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> The statement does NOT make any exceptions, and it does not say
>> >> >> >>
>> >> >> >> | ships unoptimized routines the compiler never calls
>> >> >> >>
>> >> >> >> but
>> >> >> >>
>> >> >> >> | optimized target-independent implementations
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Stefan
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> BTW: do builtins like __builtin_*parity* exist?
>> >> >> >>
>> >> >> >>      If yes: do they generate the same bad code?
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>> >> >> >>
>> >> >> >> > [hidden email]> wrote:
>> >> >> >>
>> >> >> >> >
>> >> >> >>
>> >> >> >> >> <https://compiler-rt.llvm.org/index.html> boasts:
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> | The builtins library provides optimized implementations of
>> this
>> >> >> >>
>> >> >> >> >> | and other low-level routines, either in target-independent C
>> >> form,
>> >> >> >>
>> >> >> >> >> | or as a heavily-optimized assembly.
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> Really?
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> Left: inperformant code shipped in    # Right: slightly
>> improved
>> >> >> code,
>> >> >> >>
>> >> >> >> >>       clang_rt.builtins-*             #        which the
>> optimiser
>> >> >> >> REALLY
>> >> >> >>
>> >> >> >> >>                                       #        should have
>> >> generated
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> ___cmpdi2:
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
>> >> [esp+16]
>> >> >> >>
>> >> >> >> >>         xor     eax, eax              #       xor     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx,
>> [esp+8]
>> >> >> >>
>> >> >> >> >>         jl      @f                    #       jg      @f
>> >> >> >>
>> >> >> >> >>         mov     eax, 2                #       mov     eax, 2
>> >> >> >>
>> >> >> >> >>         jg      @f                    #       jl      @f
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+4]          #
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
>> >> [esp+12]
>> >> >> >>
>> >> >> >> >>         mov     eax, 0                #       xor     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     ecx, edx              #       cmp     ecx,
>> [esp+4]
>> >> >> >>
>> >> >> >> >>         jb      @f                    #       ja      @f
>> >> >> >>
>> >> >> >> >>         cmp     edx, ecx              #
>> >> >> >>
>> >> >> >> >>         mov     eax, 1                #
>> >> >> >>
>> >> >> >> >>         adc     eax, 0                #       adc     eax, 1
>> >> >> >>
>> >> >> >> >> @@:                                   # @@:
>> >> >> >>
>> >> >> >> >>         ret                           #       ret
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >>                                       # 3 instructions less, 10
>> >> bytes
>> >> >> >> saved
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> ___ucmpdi2:
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
>> >> [esp+16]
>> >> >> >>
>> >> >> >> >>         xor     eax, eax              #       xor     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx,
>> [esp+8]
>> >> >> >>
>> >> >> >> >>         jb      @f                    #       ja      @f
>> >> >> >>
>> >> >> >> >>         mov     eax, 2                #       mov     eax, 2
>> >> >> >>
>> >> >> >> >>         ja      @f                    #       jb      @f
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+4]          #
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
>> >> [esp+12]
>> >> >> >>
>> >> >> >> >>         mov     eax, 0                #       xor     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     ecx, edx              #       cmp     ecx,
>> [esp+4]
>> >> >> >>
>> >> >> >> >>         jb      @f                    #       ja      @f
>> >> >> >>
>> >> >> >> >>         cmp     edx, ecx              #
>> >> >> >>
>> >> >> >> >>         mov     eax, 1                #
>> >> >> >>
>> >> >> >> >>         adc     eax, 0                #       adc     eax, 1
>> >> >> >>
>> >> >> >> >> @@:                                   # @@:
>> >> >> >>
>> >> >> >> >>         ret                           #       ret
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >>                                       # 3 instructions less, 10
>> >> bytes
>> >> >> >> saved
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> Now properly written code, of course branch-free, faster and
>> >> shorter:
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <
>> >> [hidden email]>
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> ___cmpdi2:
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+4]
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+12]
>> >> >> >>
>> >> >> >> >>         cmp     ecx, edx
>> >> >> >>
>> >> >> >> >>         mov     eax, [esp+8]
>> >> >> >>
>> >> >> >> >>         sbb     eax, [esp+16]
>> >> >> >>
>> >> >> >> >>         setl    ah
>> >> >> >>
>> >> >> >> >>         cmp     edx, ecx
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+16]
>> >> >> >>
>> >> >> >> >>         sbb     edx, [esp+8]
>> >> >> >>
>> >> >> >> >>         setl    al
>> >> >> >>
>> >> >> >> >>         sub     al, ah
>> >> >> >>
>> >> >> >> >>         movsx   eax, al
>> >> >> >>
>> >> >> >> >>         inc     eax
>> >> >> >>
>> >> >> >> >>         ret
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> ___ucmpdi2:
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+4]
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+12]
>> >> >> >>
>> >> >> >> >>         cmp     ecx, edx
>> >> >> >>
>> >> >> >> >>         mov     eax, [esp+8]
>> >> >> >>
>> >> >> >> >>         sbb     eax, [esp+16]
>> >> >> >>
>> >> >> >> >>         sbb     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     edx, ecx
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+16]
>> >> >> >>
>> >> >> >> >>         sbb     edx, [esp+8]
>> >> >> >>
>> >> >> >> >>         adc     eax, 1
>> >> >> >>
>> >> >> >> >>         ret
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> AGAIN:
>> >> >> >>
>> >> >> >> >> Remove every occurance of the word "optimized" on the above web
>> >> page.
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> 'nuff said
>> >> >> >>
>> >> >> >> >> Stefan
>> >> >> >>
>> >> >> >> >> _______________________________________________
>> >> >> >>
>> >> >> >> >> cfe-dev mailing list
>> >> >> >>
>> >> >> >> >> [hidden email]
>> >> >> >>
>> >> >> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> > --
>> >> >> > ~Craig
>> >> >> >
>> >> >>
>> >>
>>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: "Optimized implementations"?

Hollman, Daisy Sophia via cfe-dev
Sorry the first allows sr to be 1-63. The 64-sr makes 0 not allowed.

~Craig


On Wed, Sep 9, 2020 at 11:03 AM Craig Topper <[hidden email]> wrote:
Those two lines of code are slightly different. The first asumes sr to be 0-63. The second allows sr to be 0-127. So we might end up with a runtime check on bit 6 and a cmove to handle 127-64 and 63-0 differently. Maybe the compiler figures out from the surrounding code that bit 6 is 0, but I'd have to check.

   foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
instead of just
    foo.all <<= sr, 


~Craig


On Wed, Sep 9, 2020 at 10:53 AM Stefan Kanthak <[hidden email]> wrote:
"Craig Topper" <[hidden email]> wrote:

> Currently we skip the build of __udivmodti4(), __multi3(), __modti3() etc.
> when compiling compiler-rt with MSVC. The problem is more than just the
> ABI. A bunch of the C code also assumes for example that you can do >> on
> __uint128_t.

Correct.
OTOH, at least udivmodti4.c contains a whole lot of rather ugly "compound"
shifts
    foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
instead of just
    foo.all <<= sr,

The same holds of course for udivmoddi4.c too!
JFTR: ANSI/ISO C introduced 64-bit data types in the last millennium.

Again my question: who reviews such code?
Let alone its poor^Wdisastrous performance...

> But if the __uint128_t is a struct as would be needed for MSVC
> that code won't compile. Similar for addition, subtraction, etc.

I know (not that I tried, reading alone is sufficient to notice that).

> Relevant code from the compiler-rt where we disable 128-bit.
>
> // MSVC doesn't have a working 128bit integer type. Users should really
> compile
> // compiler-rt with clang, but if they happen to be doing a standalone
> build for
> // asan or something else, disable the 128 bit parts so things sort of work.
> #if defined(_MSC_VER) && !defined(__clang__)
> #undef CRT_HAS_128BIT
> #endif

I looked only at COMPILER_RT_ABI and didn't notice that #undef

Stefan

> On Wed, Sep 9, 2020 at 10:27 AM Stefan Kanthak <[hidden email]>
> wrote:
>
>> "Craig Topper" <[hidden email]> wrote:
>>
>> I should have added "rhetorical" to my alternative question. Sorry.
>> The main question is but still open: is there a (deeper) reason to not
>> always generate the optimised code for __builtins_*?
>>
>> > The code in the builtins library is written in C and can't use
>> > __builtin_parity since it needs to compile with compilers that don't have
>> > __builtin_parity like MSVC.
>>
>> Aahhhhh: how do you build the __udivmodti4(), __multi3(), __modti3() etc.
>> routines for clang_rt.builtins-x86_64.lib with MSVC?
>> As you know by now, there are some fidgety but strictly necessary details
>> missing in the declaration of functions returning __uint128_t or
>> __int128_t!
>>
>> regards
>> Stefan
>>
>> > So to get the optimized code, we would have to
>> > have a version written in assembly for X86. Probably two assembly
>> versions
>> > since MASM and gcc use different assembly syntax. So we'd have 3
>> different
>> > versions of parity for 3 different bit widths. Maybe some macros could
>> > allow us to share some bit widths or something. But ultimately it was a
>> > question of where to spend effort.
>> >
>> > ~Craig
>> >
>> >
>> > On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <[hidden email]>
>> > wrote:
>> >
>> >> "Craig Topper" <[hidden email]> wrote:
>> >>
>> >> > Turn on optimizations.
>> >>
>> >> What's the (deeper) reason NOT always to generate the optimised code for
>> >> __builtins_*?
>> >> Alternative question: do you also ship a library with unoptimised code
>> >> which
>> >> is linked when the -O option is not given?
>> >>
>> >> Just wondering about surprising behaviour
>> >> Stefan
>> >>
>> >> > On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <
>> [hidden email]>
>> >> > wrote:
>> >> >
>> >> >> "Craig Topper" <[hidden email]> wrote:
>> >> >>
>> >> >> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2
>> >> >>
>> >> >> Reality check, PLEASE:
>> >> >>
>> >> >> --- bug.c ---
>> >> >> int main(int argc, char *argv[]) {
>> >> >>     return __builtin_parity(argc);
>> >> >> }
>> >> >> --- EOF ---
>> >> >>
>> >> >> clang -o- -target i386-pc-linux -S bug.c
>> >> >> clang version 10.0.0
>> >> >> Target: i386-pc-linux
>> >> >>
>> >> >>         pushl   %ebp
>> >> >>         movl    %esp, %ebp
>> >> >>         subl    $8, %esp
>> >> >>         movl    8(%ebp), %eax
>> >> >>         movl    $0, -4(%ebp)
>> >> >>         movl    8(%ebp), %ecx
>> >> >>         movl    %ecx, %edx
>> >> >>         shrl    %edx
>> >> >>         andl    $1431655765, %edx       # imm = 0x55555555
>> >> >>         subl    %edx, %ecx
>> >> >>         movl    %ecx, %edx
>> >> >>         andl    $858993459, %edx        # imm = 0x33333333
>> >> >>         shrl    $2, %ecx
>> >> >>         andl    $858993459, %ecx        # imm = 0x33333333
>> >> >>         addl    %ecx, %edx
>> >> >>         movl    %edx, %ecx
>> >> >>         shrl    $4, %ecx
>> >> >>         addl    %ecx, %edx
>> >> >>         andl    $252645135, %edx        # imm = 0xF0F0F0F
>> >> >>         imull   $16843009, %edx, %ecx   # imm = 0x1010101
>> >> >>         shrl    $24, %ecx
>> >> >>         andl    $1, %ecx
>> >> >>         movl    %eax, -8(%ebp)          # 4-byte Spill
>> >> >>         movl    %ecx, %eax
>> >> >>         addl    $8, %esp
>> >> >>         popl    %ebp
>> >> >>         retl
>> >> >>
>> >> >>
>> >> >> clang -o- -target amd64-pc-linux -S bug.c
>> >> >>
>> >> >>         pushq   %rbp
>> >> >>         .cfi_def_cfa_offset 16
>> >> >>         .cfi_offset %rbp, -16
>> >> >>         movq    %rsp, %rbp
>> >> >>         .cfi_def_cfa_register %rbp
>> >> >>         movl    $0, -4(%rbp)
>> >> >>         movl    %edi, -8(%rbp)
>> >> >>         movl    -8(%rbp), %eax
>> >> >>         movl    %eax, %ecx
>> >> >>         shrl    %ecx
>> >> >>         andl    $1431655765, %ecx       # imm = 0x55555555
>> >> >>         subl    %ecx, %eax
>> >> >>         movl    %eax, %ecx
>> >> >>         andl    $858993459, %ecx        # imm = 0x33333333
>> >> >>         shrl    $2, %eax
>> >> >>         andl    $858993459, %eax        # imm = 0x33333333
>> >> >>         addl    %eax, %ecx
>> >> >>         movl    %ecx, %eax
>> >> >>         shrl    $4, %eax
>> >> >>         addl    %eax, %ecx
>> >> >>         andl    $252645135, %ecx        # imm = 0xF0F0F0F
>> >> >>         imull   $16843009, %ecx, %eax   # imm = 0x1010101
>> >> >>         shrl    $24, %eax
>> >> >>         andl    $1, %eax
>> >> >>         popq    %rbp
>> >> >>         .cfi_def_cfa %rsp, 8
>> >> >>         retq
>> >> >>
>> >> >> JFTR: this is the same unoptimised code as shipped in the builtins
>> >> library!
>> >> >>
>> >> >> Stefan
>> >> >>
>> >> >> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <
>> >> [hidden email]>
>> >> >> > wrote:
>> >> >> >
>> >> >> >> "Craig Topper" <[hidden email]> wrote;
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> > Clang never generates calls to ___paritysi2, ___paritydi2,
>> >> ___cmpdi2,
>> >> >> or
>> >> >> >>
>> >> >> >> > ___ucmpdi2 on X86 so its not clear the performance of this
>> matters
>> >> at
>> >> >> >> all.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> So you can safely remove them for X86 and all the other targets
>> where
>> >> >> such
>> >> >> >>
>> >> >> >> unoptimized code is never called!
>> >> >> >>
>> >> >> >> But fix these routines for targets where they are called.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> The statement does NOT make any exceptions, and it does not say
>> >> >> >>
>> >> >> >> | ships unoptimized routines the compiler never calls
>> >> >> >>
>> >> >> >> but
>> >> >> >>
>> >> >> >> | optimized target-independent implementations
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Stefan
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> BTW: do builtins like __builtin_*parity* exist?
>> >> >> >>
>> >> >> >>      If yes: do they generate the same bad code?
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>> >> >> >>
>> >> >> >> > [hidden email]> wrote:
>> >> >> >>
>> >> >> >> >
>> >> >> >>
>> >> >> >> >> <https://compiler-rt.llvm.org/index.html> boasts:
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> | The builtins library provides optimized implementations of
>> this
>> >> >> >>
>> >> >> >> >> | and other low-level routines, either in target-independent C
>> >> form,
>> >> >> >>
>> >> >> >> >> | or as a heavily-optimized assembly.
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> Really?
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> Left: inperformant code shipped in    # Right: slightly
>> improved
>> >> >> code,
>> >> >> >>
>> >> >> >> >>       clang_rt.builtins-*             #        which the
>> optimiser
>> >> >> >> REALLY
>> >> >> >>
>> >> >> >> >>                                       #        should have
>> >> generated
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> ___cmpdi2:
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
>> >> [esp+16]
>> >> >> >>
>> >> >> >> >>         xor     eax, eax              #       xor     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx,
>> [esp+8]
>> >> >> >>
>> >> >> >> >>         jl      @f                    #       jg      @f
>> >> >> >>
>> >> >> >> >>         mov     eax, 2                #       mov     eax, 2
>> >> >> >>
>> >> >> >> >>         jg      @f                    #       jl      @f
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+4]          #
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
>> >> [esp+12]
>> >> >> >>
>> >> >> >> >>         mov     eax, 0                #       xor     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     ecx, edx              #       cmp     ecx,
>> [esp+4]
>> >> >> >>
>> >> >> >> >>         jb      @f                    #       ja      @f
>> >> >> >>
>> >> >> >> >>         cmp     edx, ecx              #
>> >> >> >>
>> >> >> >> >>         mov     eax, 1                #
>> >> >> >>
>> >> >> >> >>         adc     eax, 0                #       adc     eax, 1
>> >> >> >>
>> >> >> >> >> @@:                                   # @@:
>> >> >> >>
>> >> >> >> >>         ret                           #       ret
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >>                                       # 3 instructions less, 10
>> >> bytes
>> >> >> >> saved
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> ___ucmpdi2:
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
>> >> [esp+16]
>> >> >> >>
>> >> >> >> >>         xor     eax, eax              #       xor     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx,
>> [esp+8]
>> >> >> >>
>> >> >> >> >>         jb      @f                    #       ja      @f
>> >> >> >>
>> >> >> >> >>         mov     eax, 2                #       mov     eax, 2
>> >> >> >>
>> >> >> >> >>         ja      @f                    #       jb      @f
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+4]          #
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
>> >> [esp+12]
>> >> >> >>
>> >> >> >> >>         mov     eax, 0                #       xor     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     ecx, edx              #       cmp     ecx,
>> [esp+4]
>> >> >> >>
>> >> >> >> >>         jb      @f                    #       ja      @f
>> >> >> >>
>> >> >> >> >>         cmp     edx, ecx              #
>> >> >> >>
>> >> >> >> >>         mov     eax, 1                #
>> >> >> >>
>> >> >> >> >>         adc     eax, 0                #       adc     eax, 1
>> >> >> >>
>> >> >> >> >> @@:                                   # @@:
>> >> >> >>
>> >> >> >> >>         ret                           #       ret
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >>                                       # 3 instructions less, 10
>> >> bytes
>> >> >> >> saved
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> Now properly written code, of course branch-free, faster and
>> >> shorter:
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <
>> >> [hidden email]>
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> ___cmpdi2:
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+4]
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+12]
>> >> >> >>
>> >> >> >> >>         cmp     ecx, edx
>> >> >> >>
>> >> >> >> >>         mov     eax, [esp+8]
>> >> >> >>
>> >> >> >> >>         sbb     eax, [esp+16]
>> >> >> >>
>> >> >> >> >>         setl    ah
>> >> >> >>
>> >> >> >> >>         cmp     edx, ecx
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+16]
>> >> >> >>
>> >> >> >> >>         sbb     edx, [esp+8]
>> >> >> >>
>> >> >> >> >>         setl    al
>> >> >> >>
>> >> >> >> >>         sub     al, ah
>> >> >> >>
>> >> >> >> >>         movsx   eax, al
>> >> >> >>
>> >> >> >> >>         inc     eax
>> >> >> >>
>> >> >> >> >>         ret
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> ___ucmpdi2:
>> >> >> >>
>> >> >> >> >>         mov     ecx, [esp+4]
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+12]
>> >> >> >>
>> >> >> >> >>         cmp     ecx, edx
>> >> >> >>
>> >> >> >> >>         mov     eax, [esp+8]
>> >> >> >>
>> >> >> >> >>         sbb     eax, [esp+16]
>> >> >> >>
>> >> >> >> >>         sbb     eax, eax
>> >> >> >>
>> >> >> >> >>         cmp     edx, ecx
>> >> >> >>
>> >> >> >> >>         mov     edx, [esp+16]
>> >> >> >>
>> >> >> >> >>         sbb     edx, [esp+8]
>> >> >> >>
>> >> >> >> >>         adc     eax, 1
>> >> >> >>
>> >> >> >> >>         ret
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> AGAIN:
>> >> >> >>
>> >> >> >> >> Remove every occurance of the word "optimized" on the above web
>> >> page.
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >> >> 'nuff said
>> >> >> >>
>> >> >> >> >> Stefan
>> >> >> >>
>> >> >> >> >> _______________________________________________
>> >> >> >>
>> >> >> >> >> cfe-dev mailing list
>> >> >> >>
>> >> >> >> >> [hidden email]
>> >> >> >>
>> >> >> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> > --
>> >> >> > ~Craig
>> >> >> >
>> >> >>
>> >>
>>

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev