

Hi all,
For the following code
int countHundred( int num ) { int count = 0; while ( num >= 100) { count++ ; num = num  100; } return count; }
the loop shown above is optimised away and replaced by ‘udiv’ instruction.
This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
Where does that happen? Is there a way, to prevent this?
Thanks
John.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Am Mi., 23. Okt. 2019 um 12:29 Uhr schrieb Joan Lluch via cfedev
< [hidden email]>:
> For the following code
>
> int countHundred( int num )
> {
> int count = 0;
> while ( num >= 100) { count++ ; num = num  100; }
> return count;
> }
>
> the loop shown above is optimised away and replaced by ‘udiv’ instruction.
>
> This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
>
> Where does that happen? Is there a way, to prevent this?
Is is a capability of ScalarEvolution and may happen by e.g.
LoopStrengthReducePass.
Michael
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Kristof Beyls via cfedev
On Wed, Oct 23, 2019 at 10:29 AM Joan Lluch via cfedev
< [hidden email]> wrote:
>
> Hi all,
>
> For the following code
>
> int countHundred( int num )
> {
> int count = 0;
> while ( num >= 100) { count++ ; num = num  100; }
> return count;
> }
>
> the loop shown above is optimised away and replaced by ‘udiv’ instruction.
>
> This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
That's not necessarily the case, unless you know in advance that "num"
is a relatively small positive number.
Even on a 16 bit machine, large positive numbers could have loop trip
counts up to 327 which is going to be much slower than a library call
using even a simple shiftandsubtract division algorithm.
I hope the udiv isn't used until after an initial test to make sure
"num" is >100 or at least positive, otherwise using udiv is an actual
bug.
I'd suggest doing some testing to find out at which point a library
call is actually faster than this loop (probably anything above 2000
or 4000 or so), modifying the compiler to not do the transform if the
trip count is *known* to be less than this (i.e. continue to use the
library routine if the trip count is unknown), and also rewrite the
code to:
#define ITER_LIMIT 24 // for example  determine by experiment and
modify compiler appropriately
int countHundred( int num )
{
if (num > (ITER_LIMIT * 100)) return num / 100;
int count = 0;
while ( num >= 100) { count++ ; num = num  100; }
return count;
}
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Sun, Oct 27, 2019 at 7:38 PM Bruce Hoult < [hidden email]> wrote:
>
> On Wed, Oct 23, 2019 at 10:29 AM Joan Lluch via cfedev
> < [hidden email]> wrote:
> >
> > Hi all,
> >
> > For the following code
> >
> > int countHundred( int num )
> > {
> > int count = 0;
> > while ( num >= 100) { count++ ; num = num  100; }
> > return count;
> > }
> >
> > the loop shown above is optimised away and replaced by ‘udiv’ instruction.
> >
> > This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
>
> That's not necessarily the case, unless you know in advance that "num"
> is a relatively small positive number.
>
> Even on a 16 bit machine, large positive numbers could have loop trip
> counts up to 327 which is going to be much slower than a library call
> using even a simple shiftandsubtract division algorithm.
>
> I hope the udiv isn't used until after an initial test to make sure
> "num" is >100 or at least positive, otherwise using udiv is an actual
> bug.
>
> I'd suggest doing some testing to find out at which point a library
> call is actually faster than this loop (probably anything above 2000
> or 4000 or so), modifying the compiler to not do the transform if the
> trip count is *known* to be less than this (i.e. continue to use the
> library routine if the trip count is unknown), and also rewrite the
> code to:
>
> #define ITER_LIMIT 24 // for example  determine by experiment and
> modify compiler appropriately
> int countHundred( int num )
> {
> if (num > (ITER_LIMIT * 100)) return num / 100;
> int count = 0;
> while ( num >= 100) { count++ ; num = num  100; }
> return count;
> }
I'm seeing the same behaviour with RISCV rv32i (no multiply or divide
instructions), even with O1.
So this is something that should be done in general for ISAs without
hardware divide.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Hi Bruce,
I understand what you say, but the optimisation you suggest requires the compiler to generate runtime checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required.
My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.
I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for powerorten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
My point is that the user programer must be assumed to know what she/he is doing, and if she/he decided to place a loop instead of a division on a target with expensive division, then we must just keep the loop.
Based on a previous answer, I figured out that this optimisation is made in "IndVarSimplify::rewriteLoopExitValues”. This is part of the "targetindependent” optimisations, and therefore there’s no target specific hook that can be implemented to prevent it. There’s however a hidden "mllvm replexitval=never” clang command line option that will prevent ALL exit loop values to be replaced by linear expressions, which will solve this particular issue, but it will also stop other desired optimisations, so it’s not a perfect solution.
In relation to (supposed) “targetindependent” optimisations, I presented related issues under the subject "[llvmdev] [AVR] [MSP430] Code gen improvements for 8 bit and 16 bit targets”. One of my posts under that subject is a long introduction to the whole issue, starting from the eventuality that some LLVM targetindependent passes make assumptions that are not really applicable for all targets.
John
> On 28 Oct 2019, at 04:30, Bruce Hoult < [hidden email]> wrote:
>
> On Sun, Oct 27, 2019 at 7:38 PM Bruce Hoult < [hidden email]> wrote:
>>
>> On Wed, Oct 23, 2019 at 10:29 AM Joan Lluch via cfedev
>> < [hidden email]> wrote:
>>>
>>> Hi all,
>>>
>>> For the following code
>>>
>>> int countHundred( int num )
>>> {
>>> int count = 0;
>>> while ( num >= 100) { count++ ; num = num  100; }
>>> return count;
>>> }
>>>
>>> the loop shown above is optimised away and replaced by ‘udiv’ instruction.
>>>
>>> This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
>>
>> That's not necessarily the case, unless you know in advance that "num"
>> is a relatively small positive number.
>>
>> Even on a 16 bit machine, large positive numbers could have loop trip
>> counts up to 327 which is going to be much slower than a library call
>> using even a simple shiftandsubtract division algorithm.
>>
>> I hope the udiv isn't used until after an initial test to make sure
>> "num" is >100 or at least positive, otherwise using udiv is an actual
>> bug.
>>
>> I'd suggest doing some testing to find out at which point a library
>> call is actually faster than this loop (probably anything above 2000
>> or 4000 or so), modifying the compiler to not do the transform if the
>> trip count is *known* to be less than this (i.e. continue to use the
>> library routine if the trip count is unknown), and also rewrite the
>> code to:
>>
>> #define ITER_LIMIT 24 // for example  determine by experiment and
>> modify compiler appropriately
>> int countHundred( int num )
>> {
>> if (num > (ITER_LIMIT * 100)) return num / 100;
>> int count = 0;
>> while ( num >= 100) { count++ ; num = num  100; }
>> return count;
>> }
>
> I'm seeing the same behaviour with RISCV rv32i (no multiply or divide
> instructions), even with O1.
>
> So this is something that should be done in general for ISAs without
> hardware divide.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Mon, Oct 28, 2019 at 2:27 AM Joan Lluch < [hidden email]> wrote:
>
> Hi Bruce,
>
> I understand what you say, but the optimisation you suggest requires the compiler to generate runtime checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required.
No no  I was not suggesting the compiler add runtime checks. I was
suggesting that if the compiler didn't know the input value then it
should use a divide, and keep the loop only if it knows for sure the
input number is very small  for example because that code is guarded
by the programmer writing an if() statement as I showed. Or possibly
an assert().
> My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.
I have sympathy for this position, but at the same time a lot of code
is generated mechanically or using macros, and it's good for the
compiler to clean it up.
> I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for powerorten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
Aha. I see, yes.
I think the divisions are not as expensive as you might be assuming.
The key is to do the same as for the subtact version and divide by
10000, then when the remainder is less than 10000 divide by 1000, and
so on. Don't repeatedly divide by 10 as in the recursive algorithm 
that *will* be slow.
I don't have an MSP430 and libraries handy, but more or less
understand the instruction set, and I can tell LLVM to generate
assembly language for it.
I had a look at the actual code for RISCV rv32i. If I force compiling
to a subtraction loop then the complete countHundred(num) function
executes 5 + 3 * floor(num/100) instructions. On MSP430 or ARM it
would be 5 + 4 * floor(num/100) as compareandbranch needs two
instructions on those.
With the udiv the complete countHundred(num) function executes 20 + 10
* ceil(log2(num/100)) instructions on rv32i using Newlib.
Averaged over all values of num from 0 to 999, the iterative version
uses 18.5 instructions per call to countHundred() while the udiv
version uses an average of 39.8 instructions. The udiv version is
faster for all values of num >= 2200 (which I know won't happen in
your case, but the compiler doesn't know that).
So in this usecase the iterative subtraction is a little over twice
faster. That's *something*, for sure, but it's not like there's an
order of magnitude difference or anything like that. Unless it's very
critical you're unlikely to notice.
I expect results would be roughly similar for the MSP430 version of udiv.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Mon, Oct 28, 2019 at 02:17:50PM 0700, Bruce Hoult via cfedev wrote:
> > I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for powerorten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
>
> Aha. I see, yes.
>
> I think the divisions are not as expensive as you might be assuming.
> The key is to do the same as for the subtact version and divide by
> 10000, then when the remainder is less than 10000 divide by 1000, and
> so on. Don't repeatedly divide by 10 as in the recursive algorithm 
> that *will* be slow.
It is also important to note that on any somewhat sane architecture, no
division will actually be created here. Now if your architecture has
neither a division nor a multiplication in hardware... Even then,
division by 10 needs ~10 adds and shift in total.
Joerg
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfedev
< [hidden email]> wrote:
>
> On Mon, Oct 28, 2019 at 02:17:50PM 0700, Bruce Hoult via cfedev wrote:
> > > I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for powerorten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
> >
> > Aha. I see, yes.
> >
> > I think the divisions are not as expensive as you might be assuming.
> > The key is to do the same as for the subtact version and divide by
> > 10000, then when the remainder is less than 10000 divide by 1000, and
> > so on. Don't repeatedly divide by 10 as in the recursive algorithm 
> > that *will* be slow.
>
> It is also important to note that on any somewhat sane architecture, no
> division will actually be created here. Now if your architecture has
> neither a division nor a multiplication in hardware... Even then,
> division by 10 needs ~10 adds and shift in total.
How do you do that?
The only ways I know of, division by 10 needs up to wordSize3
iterations of a loop, with each loop containing two
compareandbranches, a subtract and an add (or OR) controlled by one
of the compares, and two shifts. That's eight instructions on most
ISAs.
If you want to optimize for most of the inputs being small then you
need a second loop that initially shifts 10 left bit by bit until it
becomes bigger than the input you're dividing (or the MSB gets set).
That's six instructions per loop on most ISAs, so you win if most
inputs have fewer than about 57% of the bits significant.
Alternatively, if you've got a Count_Leading_Zeros instruction and
constant time multiplebit dynamic count shifts then you can do this
second (initial) loop in constant time instead.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Mon, Oct 28, 2019 at 03:15:37PM 0700, Bruce Hoult via cfedev wrote:
> On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfedev
> < [hidden email]> wrote:
> >
> > On Mon, Oct 28, 2019 at 02:17:50PM 0700, Bruce Hoult via cfedev wrote:
> > > > I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for powerorten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
> > >
> > > Aha. I see, yes.
> > >
> > > I think the divisions are not as expensive as you might be assuming.
> > > The key is to do the same as for the subtact version and divide by
> > > 10000, then when the remainder is less than 10000 divide by 1000, and
> > > so on. Don't repeatedly divide by 10 as in the recursive algorithm 
> > > that *will* be slow.
> >
> > It is also important to note that on any somewhat sane architecture, no
> > division will actually be created here. Now if your architecture has
> > neither a division nor a multiplication in hardware... Even then,
> > division by 10 needs ~10 adds and shift in total.
>
> How do you do that?
>
> The only ways I know of, division by 10 needs up to wordSize3
> iterations of a loop, with each loop containing two
> compareandbranches, a subtract and an add (or OR) controlled by one
> of the compares, and two shifts. That's eight instructions on most
> ISAs.
Division by ten is essentially multiplication by 0xcccd for 16bit ops
and shifting the result by 15 or so. That's multiplying by 0xc (shift
and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).
and adding the original value. Roughly at least. Double the number if
there is no good way to express 32bit numbers.
Joerg
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


> Original Message
> From: cfedev < [hidden email]> On Behalf Of Joerg
> Sonnenberger via cfedev
> Sent: Monday, October 28, 2019 6:54 PM
> To: [hidden email]
> Subject: Re: [cfedev] Loop optimised into div (?)
>
> On Mon, Oct 28, 2019 at 03:15:37PM 0700, Bruce Hoult via cfedev wrote:
> > On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfedev
> > < [hidden email]> wrote:
> > >
> > > On Mon, Oct 28, 2019 at 02:17:50PM 0700, Bruce Hoult via cfedev
> wrote:
> > > > > I found this when compiling a function that will convert a binary
> number to a sequence of decimal digits expressed in ascii. Every digit is
> determined by counting the number of subtractions required for poweror
> ten factors . For that algorithm, only up to 9 subtractions are required
> at every step (or factor). Replacing that by divisions is a lot more
> expensive for the affected targets.
> > > >
> > > > Aha. I see, yes.
> > > >
> > > > I think the divisions are not as expensive as you might be assuming.
> > > > The key is to do the same as for the subtact version and divide by
> > > > 10000, then when the remainder is less than 10000 divide by 1000,
> and
> > > > so on. Don't repeatedly divide by 10 as in the recursive algorithm 
> 
> > > > that *will* be slow.
> > >
> > > It is also important to note that on any somewhat sane architecture,
> no
> > > division will actually be created here. Now if your architecture has
> > > neither a division nor a multiplication in hardware... Even then,
> > > division by 10 needs ~10 adds and shift in total.
> >
> > How do you do that?
> >
> > The only ways I know of, division by 10 needs up to wordSize3
> > iterations of a loop, with each loop containing two
> > compareandbranches, a subtract and an add (or OR) controlled by one
> > of the compares, and two shifts. That's eight instructions on most
> > ISAs.
>
> Division by ten is essentially multiplication by 0xcccd for 16bit ops
> and shifting the result by 15 or so. That's multiplying by 0xc (shift
> and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).
> and adding the original value. Roughly at least. Double the number if
> there is no good way to express 32bit numbers.
>
> Joerg
The book _Hacker's Delight_ has a whole chapter on integer division by
a constant, with proofs, turning into mostly multiplies and shifts.
I remember seeing a particularly clever sequence for converting binary
to decimal, years ago, that used these tricks in parallel to do more
than one digit at a time, allowing it to use fewer total multiplies.
(Which aren't exactly cheap, either.)
paulr
> _______________________________________________
> cfedev mailing list
> [hidden email]
> https://lists.llvm.org/cgibin/mailman/listinfo/cfedev_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Kristof Beyls via cfedev
Hi Bruce,
I appreciate your testing of this for the rv32i target. I am doing some work on 8 bit and 16 bit targets. For these targets, I do not agree that it is best to replace loops by divisions even if the compiler doesn’t know the number of iterations. I’m not really aware of “automatically generated” code, in particular for these targets, but if that causes a problem, then maybe we must blame the “automator”, not the compiler. In any case, I’m pretty sure that most code for these targets is implemented by humans.
So my preferred approach is still to trust the user source code and to avoid the introduction of expensive target operations if the user has not explicitly used them (division is very expensive for 8 bit targets with no hardware divide). If the user implemented a loop that could be obviously turned into a division, then there are high chances than the loop is better than the division, on THAT particular scenario. Any compiler attempts to reverse this will most probably result in worse performance. A compiler is just a tool, it should not attempt to replace human designed algorithms, but just to generate the best possible assembly for the original source code.
Please also understand that I only used the ‘countHundred’ function to show the subject. I strongly doubt that a normal programmer would use that function to implement a dividebyonehundred as opposed to an actual division, so the point remains that if a programmer wanted to use a loop then there must be a good reason for that, which the compiler should respect. I’m talking at all times about targets with no hardware divide, of course for the remaining targets the current transformation is not a problem.
John
> On 28 Oct 2019, at 22:17, Bruce Hoult < [hidden email]> wrote:
>
> On Mon, Oct 28, 2019 at 2:27 AM Joan Lluch < [hidden email]> wrote:
>>
>> Hi Bruce,
>>
>> I understand what you say, but the optimisation you suggest requires the compiler to generate runtime checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required.
>
> No no  I was not suggesting the compiler add runtime checks. I was
> suggesting that if the compiler didn't know the input value then it
> should use a divide, and keep the loop only if it knows for sure the
> input number is very small  for example because that code is guarded
> by the programmer writing an if() statement as I showed. Or possibly
> an assert().
>
>> My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.
>
> I have sympathy for this position, but at the same time a lot of code
> is generated mechanically or using macros, and it's good for the
> compiler to clean it up.
>
>> I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for powerorten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
>
> Aha. I see, yes.
>
> I think the divisions are not as expensive as you might be assuming.
> The key is to do the same as for the subtact version and divide by
> 10000, then when the remainder is less than 10000 divide by 1000, and
> so on. Don't repeatedly divide by 10 as in the recursive algorithm 
> that *will* be slow.
>
> I don't have an MSP430 and libraries handy, but more or less
> understand the instruction set, and I can tell LLVM to generate
> assembly language for it.
>
> I had a look at the actual code for RISCV rv32i. If I force compiling
> to a subtraction loop then the complete countHundred(num) function
> executes 5 + 3 * floor(num/100) instructions. On MSP430 or ARM it
> would be 5 + 4 * floor(num/100) as compareandbranch needs two
> instructions on those.
>
> With the udiv the complete countHundred(num) function executes 20 + 10
> * ceil(log2(num/100)) instructions on rv32i using Newlib.
>
> Averaged over all values of num from 0 to 999, the iterative version
> uses 18.5 instructions per call to countHundred() while the udiv
> version uses an average of 39.8 instructions. The udiv version is
> faster for all values of num >= 2200 (which I know won't happen in
> your case, but the compiler doesn't know that).
>
> So in this usecase the iterative subtraction is a little over twice
> faster. That's *something*, for sure, but it's not like there's an
> order of magnitude difference or anything like that. Unless it's very
> critical you're unlikely to notice.
>
> I expect results would be roughly similar for the MSP430 version of udiv.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Tue, Oct 29, 2019 at 12:54:38AM +0100, Joan Lluch via cfedev wrote:
> I’m talking at all times about targets with no hardware divide,
> of course for the remaining targets the current transformation is not
> a problem.
Again, the problem is not really the support for divide. If anything at
all, it is the lacking support for multiplication that would hurt this.
I'm going to dare to say that lack of hardware multiplication is enough
reason to maybe not want to use a general purpose compiler framework...
Joerg
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Kristof Beyls via cfedev
On Mon, Oct 28, 2019 at 4:26 PM Robinson, Paul via cfedev
< [hidden email]> wrote:
>
>
>
> > Original Message
> > From: cfedev < [hidden email]> On Behalf Of Joerg
> > Sonnenberger via cfedev
> > Sent: Monday, October 28, 2019 6:54 PM
> > To: [hidden email]
> > Subject: Re: [cfedev] Loop optimised into div (?)
> >
> > On Mon, Oct 28, 2019 at 03:15:37PM 0700, Bruce Hoult via cfedev wrote:
> > > On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfedev
> > > < [hidden email]> wrote:
> > > >
> > > > On Mon, Oct 28, 2019 at 02:17:50PM 0700, Bruce Hoult via cfedev
> > wrote:
> > > > > > I found this when compiling a function that will convert a binary
> > number to a sequence of decimal digits expressed in ascii. Every digit is
> > determined by counting the number of subtractions required for poweror
> > ten factors . For that algorithm, only up to 9 subtractions are required
> > at every step (or factor). Replacing that by divisions is a lot more
> > expensive for the affected targets.
> > > > >
> > > > > Aha. I see, yes.
> > > > >
> > > > > I think the divisions are not as expensive as you might be assuming.
> > > > > The key is to do the same as for the subtact version and divide by
> > > > > 10000, then when the remainder is less than 10000 divide by 1000,
> > and
> > > > > so on. Don't repeatedly divide by 10 as in the recursive algorithm 
> > 
> > > > > that *will* be slow.
> > > >
> > > > It is also important to note that on any somewhat sane architecture,
> > no
> > > > division will actually be created here. Now if your architecture has
> > > > neither a division nor a multiplication in hardware... Even then,
> > > > division by 10 needs ~10 adds and shift in total.
> > >
> > > How do you do that?
> > >
> > > The only ways I know of, division by 10 needs up to wordSize3
> > > iterations of a loop, with each loop containing two
> > > compareandbranches, a subtract and an add (or OR) controlled by one
> > > of the compares, and two shifts. That's eight instructions on most
> > > ISAs.
> >
> > Division by ten is essentially multiplication by 0xcccd for 16bit ops
> > and shifting the result by 15 or so. That's multiplying by 0xc (shift
> > and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).
> > and adding the original value. Roughly at least. Double the number if
> > there is no good way to express 32bit numbers.
> >
> > Joerg
>
> The book _Hacker's Delight_ has a whole chapter on integer division by
> a constant, with proofs, turning into mostly multiplies and shifts.
Here are functions based on multiplying by 0xcccd and probably the
best Hacker's delight algorithm:
uint16_t div10(uint16_t n){
uint32_t p = n * (uint32_t)0xcccd;
return p>>19;
}
uint16_t divu10(uint16_t n) {
uint16_t q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
//q = q + (q >> 16);
q = q >> 3;
r = n  (((q << 2) + q) << 1);
return q + (r > 9);
}
I've verified they both work for all values of n by compiling for
x86_64 and riscv. I can compile for msp430, but I don't have runtime
libraries or an emulator for it.
The MSP430 code generated for the above functions:
00000018 <div10>:
18: 0b 12 push r11
1a: 0a 12 push r10
1c: 09 12 push r9
1e: 08 12 push r8
20: 08 4f mov r15, r8
22: 09 43 clr r9
24: 0c 48 mov r8, r12
26: 0d 49 mov r9, r13
28: 0c 5c rla r12
2a: 0d 6d rlc r13
2c: 0c 58 add r8, r12
2e: 0d 69 addc r9, r13
30: 0c 5c rla r12
32: 0d 6d rlc r13
34: 0c 5c rla r12
36: 0d 6d rlc r13
38: 0e 4c mov r12, r14
3a: 0f 4d mov r13, r15
3c: 0e 5e rla r14
3e: 0f 6f rlc r15
40: 0e 5e rla r14
42: 0f 6f rlc r15
44: 0e 5e rla r14
46: 0f 6f rlc r15
48: 0e 5e rla r14
4a: 0f 6f rlc r15
4c: 0e 5c add r12, r14
4e: 0f 6d addc r13, r15
50: 0a 4e mov r14, r10
52: 0b 4f mov r15, r11
54: 4b ee xor.b r14, r11
56: 0b ee xor r14, r11
58: 8b 10 swpb r11
5a: 4a 4a mov.b r10, r10
5c: 8a 10 swpb r10
5e: 0e 5a add r10, r14
60: 0f 6b addc r11, r15
62: 0e 58 add r8, r14
64: 0f 69 addc r9, r15
66: 0d 4f mov r15, r13
68: 0e 4d mov r13, r14
6a: 0f 43 clr r15
6c: 12 c3 clrc
6e: 0f 10 rrc r15
70: 0e 10 rrc r14
72: 12 c3 clrc
74: 0f 10 rrc r15
76: 0e 10 rrc r14
78: 12 c3 clrc
7a: 0f 10 rrc r15
7c: 0e 10 rrc r14
7e: 0f 4e mov r14, r15
80: 38 41 pop r8
82: 39 41 pop r9
84: 3a 41 pop r10
86: 3b 41 pop r11
88: 30 41 ret
0000008a <divu10>:
8a: 0e 4f mov r15, r14
8c: 12 c3 clrc
8e: 0e 10 rrc r14
90: 12 c3 clrc
92: 0e 10 rrc r14
94: 0d 4f mov r15, r13
96: 12 c3 clrc
98: 0d 10 rrc r13
9a: 0d 5e add r14, r13
9c: 0e 4d mov r13, r14
9e: 12 c3 clrc
a0: 0e 10 rrc r14
a2: 12 c3 clrc
a4: 0e 10 rrc r14
a6: 12 c3 clrc
a8: 0e 10 rrc r14
aa: 12 c3 clrc
ac: 0e 10 rrc r14
ae: 0d 5e add r14, r13
b0: 0e 4d mov r13, r14
b2: 8e 10 swpb r14
b4: 4e 4e mov.b r14, r14
b6: 0e 5d add r13, r14
b8: 12 c3 clrc
ba: 0e 10 rrc r14
bc: 12 c3 clrc
be: 0e 10 rrc r14
c0: 12 c3 clrc
c2: 0e 10 rrc r14
c4: 0d 4e mov r14, r13
c6: 0d 5d rla r13
c8: 0d 5d rla r13
ca: 0d 5e add r14, r13
cc: 0d 5d rla r13
ce: 0f 8d sub r13, r15
d0: 0d 4f mov r15, r13
d2: 1f 43 mov #1, r15 ;r3 As==01
d4: 3d 90 0a 00 cmp #10, r13 ;#0x000a
d8: 01 2c jc $+4 ;abs 0xdc
da: 0f 43 clr r15
dc: 0f 5e add r14, r15
de: 30 41 ret
Not even counting the pushes, pops, and returns, those have 48 and 41
instructions, independent of the input data (except one conditionally
executed clr r15 in the 2nd one).
You will recall that the simple loop that subtracts a power of ten
between 0 and 9 times uses on average 19.5 instructions on MSP430.
Calling the riscv general purpose udiv library function uses on
average 40 instructions. I don't have the MSP430 library routine, but
it will be similar.
Both of the above methods are just as bad or worse than a generic
division function known to return a value between 0 and 9.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev

