Loop optimised into div (?)

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Loop optimised into div (?)

Kristof Beyls via cfe-dev
Hi all,

For the following code 

int countHundred( int num )
{
  int count = 0;
  while ( num >= 100) { count++ ; num = num - 100; }
  return count;
}

the loop shown above is optimised away and replaced by ‘udiv’ instruction. 

This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430. 

Where does that happen? Is there a way, to prevent this?

Thanks

John.



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
Am Mi., 23. Okt. 2019 um 12:29 Uhr schrieb Joan Lluch via cfe-dev
<[hidden email]>:

> For the following code
>
> int countHundred( int num )
> {
>   int count = 0;
>   while ( num >= 100) { count++ ; num = num - 100; }
>   return count;
> }
>
> the loop shown above is optimised away and replaced by ‘udiv’ instruction.
>
> This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
>
> Where does that happen? Is there a way, to prevent this?

Is is a capability of ScalarEvolution and may happen by e.g.
LoopStrengthReducePass.


Michael
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
On Wed, Oct 23, 2019 at 10:29 AM Joan Lluch via cfe-dev
<[hidden email]> wrote:

>
> Hi all,
>
> For the following code
>
> int countHundred( int num )
> {
>   int count = 0;
>   while ( num >= 100) { count++ ; num = num - 100; }
>   return count;
> }
>
> the loop shown above is optimised away and replaced by ‘udiv’ instruction.
>
> This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.

That's not necessarily the case, unless you know in advance that "num"
is a relatively small positive number.

Even on a 16 bit machine, large positive numbers could have loop trip
counts up to 327 which is going to be much slower than a library call
using even a simple shift-and-subtract division algorithm.

I hope the udiv isn't used until after an initial test to make sure
"num" is >100 or at least positive, otherwise using udiv is an actual
bug.

I'd suggest doing some testing to find out at which point a library
call is actually faster than this loop (probably anything above 2000
or 4000 or so), modifying the compiler to not do the transform if the
trip count is *known* to be less than this (i.e. continue to use the
library routine if the trip count is unknown), and also rewrite the
code to:

#define ITER_LIMIT 24 // for example -- determine by experiment and
modify compiler appropriately
int countHundred( int num )
{
  if (num > (ITER_LIMIT * 100)) return num / 100;
  int count = 0;
  while ( num >= 100) { count++ ; num = num - 100; }
  return count;
}
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
On Sun, Oct 27, 2019 at 7:38 PM Bruce Hoult <[hidden email]> wrote:

>
> On Wed, Oct 23, 2019 at 10:29 AM Joan Lluch via cfe-dev
> <[hidden email]> wrote:
> >
> > Hi all,
> >
> > For the following code
> >
> > int countHundred( int num )
> > {
> >   int count = 0;
> >   while ( num >= 100) { count++ ; num = num - 100; }
> >   return count;
> > }
> >
> > the loop shown above is optimised away and replaced by ‘udiv’ instruction.
> >
> > This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
>
> That's not necessarily the case, unless you know in advance that "num"
> is a relatively small positive number.
>
> Even on a 16 bit machine, large positive numbers could have loop trip
> counts up to 327 which is going to be much slower than a library call
> using even a simple shift-and-subtract division algorithm.
>
> I hope the udiv isn't used until after an initial test to make sure
> "num" is >100 or at least positive, otherwise using udiv is an actual
> bug.
>
> I'd suggest doing some testing to find out at which point a library
> call is actually faster than this loop (probably anything above 2000
> or 4000 or so), modifying the compiler to not do the transform if the
> trip count is *known* to be less than this (i.e. continue to use the
> library routine if the trip count is unknown), and also rewrite the
> code to:
>
> #define ITER_LIMIT 24 // for example -- determine by experiment and
> modify compiler appropriately
> int countHundred( int num )
> {
>   if (num > (ITER_LIMIT * 100)) return num / 100;
>   int count = 0;
>   while ( num >= 100) { count++ ; num = num - 100; }
>   return count;
> }

I'm seeing the same behaviour with RISC-V rv32i (no multiply or divide
instructions), even with -O1.

So this is something that should be done in general for ISAs without
hardware divide.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
Hi Bruce,

I understand what you say, but the optimisation you suggest requires the compiler to generate run-time checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required.

My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.

I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.

My point is that the user programer must be assumed to know what she/he is doing, and if she/he decided to place a loop instead of a division on a target with expensive division, then we must just keep the loop.

Based on a previous answer, I figured out that this optimisation is made in "IndVarSimplify::rewriteLoopExitValues”. This is part of the "target-independent” optimisations, and therefore there’s no target specific hook that can be implemented to prevent it. There’s however a hidden "-mllvm -replexitval=never” clang command line option that will prevent ALL exit loop values to be replaced by linear expressions, which will solve this particular issue, but it will also stop other desired optimisations, so it’s not a perfect solution.

In relation to (supposed) “target-independent” optimisations, I presented related issues under the subject "[llvm-dev] [AVR] [MSP430] Code gen improvements for 8 bit and 16 bit targets”. One of my posts under that subject is a long introduction to the whole issue, starting from the eventuality that some LLVM target-independent passes make assumptions that are not really applicable for all targets.

John



> On 28 Oct 2019, at 04:30, Bruce Hoult <[hidden email]> wrote:
>
> On Sun, Oct 27, 2019 at 7:38 PM Bruce Hoult <[hidden email]> wrote:
>>
>> On Wed, Oct 23, 2019 at 10:29 AM Joan Lluch via cfe-dev
>> <[hidden email]> wrote:
>>>
>>> Hi all,
>>>
>>> For the following code
>>>
>>> int countHundred( int num )
>>> {
>>>  int count = 0;
>>>  while ( num >= 100) { count++ ; num = num - 100; }
>>>  return count;
>>> }
>>>
>>> the loop shown above is optimised away and replaced by ‘udiv’ instruction.
>>>
>>> This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
>>
>> That's not necessarily the case, unless you know in advance that "num"
>> is a relatively small positive number.
>>
>> Even on a 16 bit machine, large positive numbers could have loop trip
>> counts up to 327 which is going to be much slower than a library call
>> using even a simple shift-and-subtract division algorithm.
>>
>> I hope the udiv isn't used until after an initial test to make sure
>> "num" is >100 or at least positive, otherwise using udiv is an actual
>> bug.
>>
>> I'd suggest doing some testing to find out at which point a library
>> call is actually faster than this loop (probably anything above 2000
>> or 4000 or so), modifying the compiler to not do the transform if the
>> trip count is *known* to be less than this (i.e. continue to use the
>> library routine if the trip count is unknown), and also rewrite the
>> code to:
>>
>> #define ITER_LIMIT 24 // for example -- determine by experiment and
>> modify compiler appropriately
>> int countHundred( int num )
>> {
>>  if (num > (ITER_LIMIT * 100)) return num / 100;
>>  int count = 0;
>>  while ( num >= 100) { count++ ; num = num - 100; }
>>  return count;
>> }
>
> I'm seeing the same behaviour with RISC-V rv32i (no multiply or divide
> instructions), even with -O1.
>
> So this is something that should be done in general for ISAs without
> hardware divide.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
On Mon, Oct 28, 2019 at 2:27 AM Joan Lluch <[hidden email]> wrote:
>
> Hi Bruce,
>
> I understand what you say, but the optimisation you suggest requires the compiler to generate run-time checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required.

No no -- I was not suggesting the compiler add runtime checks. I was
suggesting that if the compiler didn't know the input value then it
should use a divide, and keep the loop only if it knows for sure the
input number is very small -- for example because that code is guarded
by the programmer writing an if() statement as I showed. Or possibly
an assert().

> My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.

I have sympathy for this position, but at the same time a lot of code
is generated mechanically or using macros, and it's good for the
compiler to clean it up.

> I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.

Aha. I see, yes.

I think the divisions are not as expensive as you might be assuming.
The key is to do the same as for the subtact version and divide by
10000, then when the remainder is less than 10000 divide by 1000, and
so on. Don't repeatedly divide by 10 as in the recursive algorithm --
that *will* be slow.

I don't have an MSP430 and libraries handy, but more or less
understand the instruction set, and I can tell LLVM to generate
assembly language for it.

I had a look at the actual code for RISC-V rv32i. If I force compiling
to a subtraction loop then the complete countHundred(num) function
executes 5 + 3 * floor(num/100) instructions. On MSP430 or ARM it
would be  5 + 4 * floor(num/100) as compare-and-branch needs two
instructions on those.

With the udiv the complete countHundred(num) function executes 20 + 10
* ceil(log2(num/100)) instructions on rv32i using Newlib.

Averaged over all values of num from 0 to 999, the iterative version
uses 18.5 instructions per call to countHundred() while the udiv
version uses an average of 39.8 instructions. The udiv version is
faster for all values of num >= 2200 (which I know won't happen in
your case, but the compiler doesn't know that).

So in this use-case the iterative subtraction is a little over twice
faster. That's *something*, for sure, but it's not like there's an
order of magnitude difference or anything like that. Unless it's very
critical you're unlikely to notice.

I expect results would be roughly similar for the MSP430 version of udiv.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
On Mon, Oct 28, 2019 at 02:17:50PM -0700, Bruce Hoult via cfe-dev wrote:
> > I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
>
> Aha. I see, yes.
>
> I think the divisions are not as expensive as you might be assuming.
> The key is to do the same as for the subtact version and divide by
> 10000, then when the remainder is less than 10000 divide by 1000, and
> so on. Don't repeatedly divide by 10 as in the recursive algorithm --
> that *will* be slow.

It is also important to note that on any somewhat sane architecture, no
division will actually be created here. Now if your architecture has
neither a division nor a multiplication in hardware... Even then,
division by 10 needs ~10 adds and shift in total.

Joerg
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfe-dev
<[hidden email]> wrote:

>
> On Mon, Oct 28, 2019 at 02:17:50PM -0700, Bruce Hoult via cfe-dev wrote:
> > > I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
> >
> > Aha. I see, yes.
> >
> > I think the divisions are not as expensive as you might be assuming.
> > The key is to do the same as for the subtact version and divide by
> > 10000, then when the remainder is less than 10000 divide by 1000, and
> > so on. Don't repeatedly divide by 10 as in the recursive algorithm --
> > that *will* be slow.
>
> It is also important to note that on any somewhat sane architecture, no
> division will actually be created here. Now if your architecture has
> neither a division nor a multiplication in hardware... Even then,
> division by 10 needs ~10 adds and shift in total.

How do you do that?

The only ways I know of, division by 10 needs up to wordSize-3
iterations of a loop, with each loop containing two
compare-and-branches, a subtract and an add (or OR) controlled by one
of the compares, and two shifts. That's eight instructions on most
ISAs.

If you want to optimize for most of the inputs being small then you
need a second loop that initially shifts 10 left bit by bit until it
becomes bigger than the input you're dividing (or the MSB gets set).
That's six instructions per loop on most ISAs, so you win if most
inputs have fewer than about 57% of the bits significant.
Alternatively, if you've got a Count_Leading_Zeros instruction and
constant time multiple-bit dynamic count shifts then you can do this
second (initial) loop in constant time instead.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
On Mon, Oct 28, 2019 at 03:15:37PM -0700, Bruce Hoult via cfe-dev wrote:

> On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfe-dev
> <[hidden email]> wrote:
> >
> > On Mon, Oct 28, 2019 at 02:17:50PM -0700, Bruce Hoult via cfe-dev wrote:
> > > > I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
> > >
> > > Aha. I see, yes.
> > >
> > > I think the divisions are not as expensive as you might be assuming.
> > > The key is to do the same as for the subtact version and divide by
> > > 10000, then when the remainder is less than 10000 divide by 1000, and
> > > so on. Don't repeatedly divide by 10 as in the recursive algorithm --
> > > that *will* be slow.
> >
> > It is also important to note that on any somewhat sane architecture, no
> > division will actually be created here. Now if your architecture has
> > neither a division nor a multiplication in hardware... Even then,
> > division by 10 needs ~10 adds and shift in total.
>
> How do you do that?
>
> The only ways I know of, division by 10 needs up to wordSize-3
> iterations of a loop, with each loop containing two
> compare-and-branches, a subtract and an add (or OR) controlled by one
> of the compares, and two shifts. That's eight instructions on most
> ISAs.

Division by ten is essentially multiplication by 0xcccd for 16bit ops
and shifting the result by 15 or so. That's multiplying by 0xc (shift
and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).
and adding the original value. Roughly at least. Double the number if
there is no good way to express 32bit numbers.

Joerg
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev


> -----Original Message-----
> From: cfe-dev <[hidden email]> On Behalf Of Joerg
> Sonnenberger via cfe-dev
> Sent: Monday, October 28, 2019 6:54 PM
> To: [hidden email]
> Subject: Re: [cfe-dev] Loop optimised into div (?)
>
> On Mon, Oct 28, 2019 at 03:15:37PM -0700, Bruce Hoult via cfe-dev wrote:
> > On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfe-dev
> > <[hidden email]> wrote:
> > >
> > > On Mon, Oct 28, 2019 at 02:17:50PM -0700, Bruce Hoult via cfe-dev
> wrote:
> > > > > I found this when compiling a function that will convert a binary
> number to a sequence of decimal digits expressed in ascii. Every digit is
> determined by counting the number of subtractions required for power-or-
> ten factors . For that algorithm, only up to 9 subtractions are required
> at every step (or factor). Replacing that by divisions is a lot more
> expensive for the affected targets.
> > > >
> > > > Aha. I see, yes.
> > > >
> > > > I think the divisions are not as expensive as you might be assuming.
> > > > The key is to do the same as for the subtact version and divide by
> > > > 10000, then when the remainder is less than 10000 divide by 1000,
> and
> > > > so on. Don't repeatedly divide by 10 as in the recursive algorithm -
> -
> > > > that *will* be slow.
> > >
> > > It is also important to note that on any somewhat sane architecture,
> no
> > > division will actually be created here. Now if your architecture has
> > > neither a division nor a multiplication in hardware... Even then,
> > > division by 10 needs ~10 adds and shift in total.
> >
> > How do you do that?
> >
> > The only ways I know of, division by 10 needs up to wordSize-3
> > iterations of a loop, with each loop containing two
> > compare-and-branches, a subtract and an add (or OR) controlled by one
> > of the compares, and two shifts. That's eight instructions on most
> > ISAs.
>
> Division by ten is essentially multiplication by 0xcccd for 16bit ops
> and shifting the result by 15 or so. That's multiplying by 0xc (shift
> and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).
> and adding the original value. Roughly at least. Double the number if
> there is no good way to express 32bit numbers.
>
> Joerg

The book _Hacker's Delight_ has a whole chapter on integer division by
a constant, with proofs, turning into mostly multiplies and shifts.

I remember seeing a particularly clever sequence for converting binary
to decimal, years ago, that used these tricks in parallel to do more
than one digit at a time, allowing it to use fewer total multiplies.
(Which aren't exactly cheap, either.)
--paulr

> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
Hi Bruce,

I appreciate your testing of this for the rv32i target. I am doing some work on 8 bit and 16 bit targets. For these targets, I do not agree that it is best to replace loops by divisions even if the compiler doesn’t know the number of iterations. I’m not really aware of  “automatically generated” code, in particular for these targets, but if that causes a problem, then maybe we must blame the “automator”, not the compiler. In any case, I’m pretty sure that most code for these targets is implemented by humans.

So my preferred approach is still to trust the user source code and to avoid the introduction of expensive target operations if the user has not explicitly used them (division is very expensive for 8 bit targets with no hardware divide). If the user implemented a loop that could be obviously turned into a division, then there are high chances than the loop is better than the division, on THAT particular scenario. Any compiler attempts to reverse this will most probably result in worse performance. A compiler is just a tool, it should not attempt to replace human designed algorithms, but just to generate the best possible assembly for the original source code.

Please also understand that I only used the ‘countHundred’ function to show the subject. I strongly doubt that a normal programmer would use that function to implement a divide-by-one-hundred as opposed to an actual division, so the point remains that if a programmer wanted to use a loop then there must be a good reason for that, which the compiler should respect. I’m talking at all times about targets with no hardware divide, of course for the remaining targets the current transformation is not a problem.

John


> On 28 Oct 2019, at 22:17, Bruce Hoult <[hidden email]> wrote:
>
> On Mon, Oct 28, 2019 at 2:27 AM Joan Lluch <[hidden email]> wrote:
>>
>> Hi Bruce,
>>
>> I understand what you say, but the optimisation you suggest requires the compiler to generate run-time checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required.
>
> No no -- I was not suggesting the compiler add runtime checks. I was
> suggesting that if the compiler didn't know the input value then it
> should use a divide, and keep the loop only if it knows for sure the
> input number is very small -- for example because that code is guarded
> by the programmer writing an if() statement as I showed. Or possibly
> an assert().
>
>> My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.
>
> I have sympathy for this position, but at the same time a lot of code
> is generated mechanically or using macros, and it's good for the
> compiler to clean it up.
>
>> I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
>
> Aha. I see, yes.
>
> I think the divisions are not as expensive as you might be assuming.
> The key is to do the same as for the subtact version and divide by
> 10000, then when the remainder is less than 10000 divide by 1000, and
> so on. Don't repeatedly divide by 10 as in the recursive algorithm --
> that *will* be slow.
>
> I don't have an MSP430 and libraries handy, but more or less
> understand the instruction set, and I can tell LLVM to generate
> assembly language for it.
>
> I had a look at the actual code for RISC-V rv32i. If I force compiling
> to a subtraction loop then the complete countHundred(num) function
> executes 5 + 3 * floor(num/100) instructions. On MSP430 or ARM it
> would be  5 + 4 * floor(num/100) as compare-and-branch needs two
> instructions on those.
>
> With the udiv the complete countHundred(num) function executes 20 + 10
> * ceil(log2(num/100)) instructions on rv32i using Newlib.
>
> Averaged over all values of num from 0 to 999, the iterative version
> uses 18.5 instructions per call to countHundred() while the udiv
> version uses an average of 39.8 instructions. The udiv version is
> faster for all values of num >= 2200 (which I know won't happen in
> your case, but the compiler doesn't know that).
>
> So in this use-case the iterative subtraction is a little over twice
> faster. That's *something*, for sure, but it's not like there's an
> order of magnitude difference or anything like that. Unless it's very
> critical you're unlikely to notice.
>
> I expect results would be roughly similar for the MSP430 version of udiv.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
On Tue, Oct 29, 2019 at 12:54:38AM +0100, Joan Lluch via cfe-dev wrote:
> I’m talking at all times about targets with no hardware divide,
> of course for the remaining targets the current transformation is not
> a problem.

Again, the problem is not really the support for divide. If anything at
all, it is the lacking support for multiplication that would hurt this.
I'm going to dare to say that lack of hardware multiplication is enough
reason to maybe not want to use a general purpose compiler framework...

Joerg
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Loop optimised into div (?)

Kristof Beyls via cfe-dev
In reply to this post by Kristof Beyls via cfe-dev
On Mon, Oct 28, 2019 at 4:26 PM Robinson, Paul via cfe-dev
<[hidden email]> wrote:

>
>
>
> > -----Original Message-----
> > From: cfe-dev <[hidden email]> On Behalf Of Joerg
> > Sonnenberger via cfe-dev
> > Sent: Monday, October 28, 2019 6:54 PM
> > To: [hidden email]
> > Subject: Re: [cfe-dev] Loop optimised into div (?)
> >
> > On Mon, Oct 28, 2019 at 03:15:37PM -0700, Bruce Hoult via cfe-dev wrote:
> > > On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfe-dev
> > > <[hidden email]> wrote:
> > > >
> > > > On Mon, Oct 28, 2019 at 02:17:50PM -0700, Bruce Hoult via cfe-dev
> > wrote:
> > > > > > I found this when compiling a function that will convert a binary
> > number to a sequence of decimal digits expressed in ascii. Every digit is
> > determined by counting the number of subtractions required for power-or-
> > ten factors . For that algorithm, only up to 9 subtractions are required
> > at every step (or factor). Replacing that by divisions is a lot more
> > expensive for the affected targets.
> > > > >
> > > > > Aha. I see, yes.
> > > > >
> > > > > I think the divisions are not as expensive as you might be assuming.
> > > > > The key is to do the same as for the subtact version and divide by
> > > > > 10000, then when the remainder is less than 10000 divide by 1000,
> > and
> > > > > so on. Don't repeatedly divide by 10 as in the recursive algorithm -
> > -
> > > > > that *will* be slow.
> > > >
> > > > It is also important to note that on any somewhat sane architecture,
> > no
> > > > division will actually be created here. Now if your architecture has
> > > > neither a division nor a multiplication in hardware... Even then,
> > > > division by 10 needs ~10 adds and shift in total.
> > >
> > > How do you do that?
> > >
> > > The only ways I know of, division by 10 needs up to wordSize-3
> > > iterations of a loop, with each loop containing two
> > > compare-and-branches, a subtract and an add (or OR) controlled by one
> > > of the compares, and two shifts. That's eight instructions on most
> > > ISAs.
> >
> > Division by ten is essentially multiplication by 0xcccd for 16bit ops
> > and shifting the result by 15 or so. That's multiplying by 0xc (shift
> > and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).
> > and adding the original value. Roughly at least. Double the number if
> > there is no good way to express 32bit numbers.
> >
> > Joerg
>
> The book _Hacker's Delight_ has a whole chapter on integer division by
> a constant, with proofs, turning into mostly multiplies and shifts.

Here are functions based on multiplying by 0xcccd and probably the
best Hacker's delight algorithm:

uint16_t div10(uint16_t n){
  uint32_t p = n * (uint32_t)0xcccd;
  return p>>19;
}

uint16_t divu10(uint16_t n) {
    uint16_t q, r;
    q = (n >> 1) + (n >> 2);
    q = q + (q >> 4);
    q = q + (q >> 8);
    //q = q + (q >> 16);
    q = q >> 3;
    r = n - (((q << 2) + q) << 1);
    return q + (r > 9);
}

I've verified they both work for all values of n by compiling for
x86_64 and risc-v. I can compile for msp430, but I don't have runtime
libraries or an emulator for it.

The MSP430 code generated for the above functions:

00000018 <div10>:
  18: 0b 12        push r11
  1a: 0a 12        push r10
  1c: 09 12        push r9
  1e: 08 12        push r8
  20: 08 4f        mov r15, r8
  22: 09 43        clr r9
  24: 0c 48        mov r8, r12
  26: 0d 49        mov r9, r13
  28: 0c 5c        rla r12
  2a: 0d 6d        rlc r13
  2c: 0c 58        add r8, r12
  2e: 0d 69        addc r9, r13
  30: 0c 5c        rla r12
  32: 0d 6d        rlc r13
  34: 0c 5c        rla r12
  36: 0d 6d        rlc r13
  38: 0e 4c        mov r12, r14
  3a: 0f 4d        mov r13, r15
  3c: 0e 5e        rla r14
  3e: 0f 6f        rlc r15
  40: 0e 5e        rla r14
  42: 0f 6f        rlc r15
  44: 0e 5e        rla r14
  46: 0f 6f        rlc r15
  48: 0e 5e        rla r14
  4a: 0f 6f        rlc r15
  4c: 0e 5c        add r12, r14
  4e: 0f 6d        addc r13, r15
  50: 0a 4e        mov r14, r10
  52: 0b 4f        mov r15, r11
  54: 4b ee        xor.b r14, r11
  56: 0b ee        xor r14, r11
  58: 8b 10        swpb r11
  5a: 4a 4a        mov.b r10, r10
  5c: 8a 10        swpb r10
  5e: 0e 5a        add r10, r14
  60: 0f 6b        addc r11, r15
  62: 0e 58        add r8, r14
  64: 0f 69        addc r9, r15
  66: 0d 4f        mov r15, r13
  68: 0e 4d        mov r13, r14
  6a: 0f 43        clr r15
  6c: 12 c3        clrc
  6e: 0f 10        rrc r15
  70: 0e 10        rrc r14
  72: 12 c3        clrc
  74: 0f 10        rrc r15
  76: 0e 10        rrc r14
  78: 12 c3        clrc
  7a: 0f 10        rrc r15
  7c: 0e 10        rrc r14
  7e: 0f 4e        mov r14, r15
  80: 38 41        pop r8
  82: 39 41        pop r9
  84: 3a 41        pop r10
  86: 3b 41        pop r11
  88: 30 41        ret

0000008a <divu10>:
  8a: 0e 4f        mov r15, r14
  8c: 12 c3        clrc
  8e: 0e 10        rrc r14
  90: 12 c3        clrc
  92: 0e 10        rrc r14
  94: 0d 4f        mov r15, r13
  96: 12 c3        clrc
  98: 0d 10        rrc r13
  9a: 0d 5e        add r14, r13
  9c: 0e 4d        mov r13, r14
  9e: 12 c3        clrc
  a0: 0e 10        rrc r14
  a2: 12 c3        clrc
  a4: 0e 10        rrc r14
  a6: 12 c3        clrc
  a8: 0e 10        rrc r14
  aa: 12 c3        clrc
  ac: 0e 10        rrc r14
  ae: 0d 5e        add r14, r13
  b0: 0e 4d        mov r13, r14
  b2: 8e 10        swpb r14
  b4: 4e 4e        mov.b r14, r14
  b6: 0e 5d        add r13, r14
  b8: 12 c3        clrc
  ba: 0e 10        rrc r14
  bc: 12 c3        clrc
  be: 0e 10        rrc r14
  c0: 12 c3        clrc
  c2: 0e 10        rrc r14
  c4: 0d 4e        mov r14, r13
  c6: 0d 5d        rla r13
  c8: 0d 5d        rla r13
  ca: 0d 5e        add r14, r13
  cc: 0d 5d        rla r13
  ce: 0f 8d        sub r13, r15
  d0: 0d 4f        mov r15, r13
  d2: 1f 43        mov #1, r15 ;r3 As==01
  d4: 3d 90 0a 00 cmp #10, r13 ;#0x000a
  d8: 01 2c        jc $+4      ;abs 0xdc
  da: 0f 43        clr r15
  dc: 0f 5e        add r14, r15
  de: 30 41        ret

Not even counting the pushes, pops, and returns, those have 48 and 41
instructions, independent of the input data (except one conditionally
executed clr r15 in the 2nd one).

You will recall that the simple loop that subtracts a power of ten
between 0 and 9 times uses on average 19.5 instructions on MSP430.
Calling the risc-v general purpose udiv library function uses on
average 40 instructions. I don't have the MSP430 library routine, but
it will be similar.

Both of the above methods are just as bad or worse than a generic
division function known to return a value between 0 and 9.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev