Virtual function call optimization(memoization) questions

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Virtual function call optimization(memoization) questions

Vassil Vassilev via cfe-dev
Hello,

I want to investigate if there are any possibilities of optimizing virtual functions calls. From my knowledge and reading I understand that the overhead for this is two pointer dereferences. I know about alternatives like CRTP and std::variant, but for this investigation, I'm interested only in traditional, dynamic polymorphism. One of my ideas is to use some form of caching of the computation of the address of the actual method that gets called. Obviously we cannot always do this, but there are some cases where we could.

One example:

I have a virtual function call in a loop. All the time, the same method is called:

The loop is in method work; the virtual call is to method id.

We can see that method work gets inlined, but inside the loop, there are always two pointer dereferences:
mov     raxqword ptr [r14]
call    qword ptr [rax]

Since the object referred to by b, never changes to a different object, this could(at least in this case), be cached.

My assembly might be rusty, but before the loop, we could have:
mov     r13qword ptr [r14]
mov     r13qword ptr [r13]  

and inside the loop we would only have: 

call    r13

My questions are:
Do we have a mechanism in C++ to explicitly store the result of the lookup in the vtable without additional overhead? Some sort of cache for this result so that we do not do the same computation over and over again? I'm specifically looking for this solution, not alternatives to dynamic polymorphism like CRTP or std::variant. I couldn't find one.
For functional programming style "pure functions", we would have:
    int res = pure_function();
    while(true) use(res);
instead of
    while(true) use(pure_function());

Is there anything in the C++ standard that prevents such an optimization?

How would we identify the cases in which such an optimization is possible and the ones in which it is not?

Are there any other reasons for which such an optimization would not be desired?


N.B.: I realize the last two questions might be difficult to answer, but could you at least point me in the right direction for investigating this myself?

Thanks,
Ninu.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Virtual function call optimization(memoization) questions

Vassil Vassilev via cfe-dev
On Tue, Jun 16, 2020 at 2:31 AM Ninu-Ciprian Marginean via cfe-dev <[hidden email]> wrote:

I have a virtual function call in a loop. All the time, the same method is called:

The loop is in method work; the virtual call is to method id.

We can see that method work gets inlined, but inside the loop, there are always two pointer dereferences:
mov     raxqword ptr [r14]
call    qword ptr [rax]

Since the object referred to by b, never changes to a different object, this could(at least in this case), be cached.

I believe that this is a similar problem to what was recently discussed in the thread titled "[cfe-dev] Smart Pointer Lifetime Optimizations".  In this case (as in that case), the human programmer knows that the call to `b->do_id()` does not actually change the value of b's vptr; but the compiler cannot make that assumption because the C++ Standard technically permits a virtual method to destroy and re-create `b` with a completely different dynamic type.

As in the smart-pointer case, Clang provides a "noescape" attribute that addresses half of the problem, but doesn't provide any "really_const" attribute that allows the human programmer to say "Look, I promise that this function call really does not change its argument!"  (I think LLVM's `readonly` attribute would help, but Clang does not expose that attribute to the C++ programmer.)
I tried slapping all the attributes I could think of onto this member function — noescape, const, pure — and none of them helped.
Even marking the stack objects as `const` didn't help, which surprises me in this case:

–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Virtual function call optimization(memoization) questions

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev
On Mon, 15 Jun 2020, 23:31 Ninu-Ciprian Marginean via cfe-dev, <[hidden email]> wrote:
Hello,

I want to investigate if there are any possibilities of optimizing virtual functions calls. From my knowledge and reading I understand that the overhead for this is two pointer dereferences. I know about alternatives like CRTP and std::variant, but for this investigation, I'm interested only in traditional, dynamic polymorphism. One of my ideas is to use some form of caching of the computation of the address of the actual method that gets called. Obviously we cannot always do this, but there are some cases where we could.

One example:

I have a virtual function call in a loop. All the time, the same method is called:

The loop is in method work; the virtual call is to method id.

We can see that method work gets inlined, but inside the loop, there are always two pointer dereferences:
mov     raxqword ptr [r14]
call    qword ptr [rax]

Since the object referred to by b, never changes to a different object, this could(at least in this case), be cached.

My assembly might be rusty, but before the loop, we could have:
mov     r13qword ptr [r14]
mov     r13qword ptr [r13]  

and inside the loop we would only have: 

call    r13

My questions are:
Do we have a mechanism in C++ to explicitly store the result of the lookup in the vtable without additional overhead? Some sort of cache for this result so that we do not do the same computation over and over again? I'm specifically looking for this solution, not alternatives to dynamic polymorphism like CRTP or std::variant. I couldn't find one.
For functional programming style "pure functions", we would have:
    int res = pure_function();
    while(true) use(res);
instead of
    while(true) use(pure_function());

Is there anything in the C++ standard that prevents such an optimization?

The optimization is valid, and clang performs it under -fstrict-vtable-pointers (https://godbolt.org/z/SrlUC8). Unfortunately, there are still some cases where this optimization can regress performance (the annotations that the frontend inserts to enable the optimization can get in the way of other transformations), so it's not enabled by default yet.

How would we identify the cases in which such an optimization is possible and the ones in which it is not?

Are there any other reasons for which such an optimization would not be desired?


N.B.: I realize the last two questions might be difficult to answer, but could you at least point me in the right direction for investigating this myself?

Thanks,
Ninu.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Virtual function call optimization(memoization) questions

Vassil Vassilev via cfe-dev
On Tue, 16 Jun 2020, 11:44 Richard Smith, <[hidden email]> wrote:
On Mon, 15 Jun 2020, 23:31 Ninu-Ciprian Marginean via cfe-dev, <[hidden email]> wrote:
Hello,

I want to investigate if there are any possibilities of optimizing virtual functions calls. From my knowledge and reading I understand that the overhead for this is two pointer dereferences. I know about alternatives like CRTP and std::variant, but for this investigation, I'm interested only in traditional, dynamic polymorphism. One of my ideas is to use some form of caching of the computation of the address of the actual method that gets called. Obviously we cannot always do this, but there are some cases where we could.

One example:

I have a virtual function call in a loop. All the time, the same method is called:

The loop is in method work; the virtual call is to method id.

We can see that method work gets inlined, but inside the loop, there are always two pointer dereferences:
mov     raxqword ptr [r14]
call    qword ptr [rax]

Since the object referred to by b, never changes to a different object, this could(at least in this case), be cached.

My assembly might be rusty, but before the loop, we could have:
mov     r13qword ptr [r14]
mov     r13qword ptr [r13]  

and inside the loop we would only have: 

call    r13

My questions are:
Do we have a mechanism in C++ to explicitly store the result of the lookup in the vtable without additional overhead? Some sort of cache for this result so that we do not do the same computation over and over again? I'm specifically looking for this solution, not alternatives to dynamic polymorphism like CRTP or std::variant. I couldn't find one.
For functional programming style "pure functions", we would have:
    int res = pure_function();
    while(true) use(res);
instead of
    while(true) use(pure_function());

Is there anything in the C++ standard that prevents such an optimization?

The optimization is valid, and clang performs it under -fstrict-vtable-pointers (https://godbolt.org/z/SrlUC8). Unfortunately, there are still some cases where this optimization can regress performance (the annotations that the frontend inserts to enable the optimization can get in the way of other transformations), so it's not enabled by default yet.
How would we identify the cases in which such an optimization is possible and the ones in which it is not?

Are there any other reasons for which such an optimization would not be desired?


N.B.: I realize the last two questions might be difficult to answer, but could you at least point me in the right direction for investigating this myself?

If you want more information, this research paper may be a good place to start: https://arxiv.org/pdf/2003.04228.pdf

Thanks,
Ninu.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Virtual function call optimization(memoization) questions

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev
On Tue, 16 Jun 2020, 11:44 Richard Smith, <[hidden email]> wrote:
On Mon, 15 Jun 2020, 23:31 Ninu-Ciprian Marginean via cfe-dev, <[hidden email]> wrote:
Hello,

I want to investigate if there are any possibilities of optimizing virtual functions calls. From my knowledge and reading I understand that the overhead for this is two pointer dereferences. I know about alternatives like CRTP and std::variant, but for this investigation, I'm interested only in traditional, dynamic polymorphism. One of my ideas is to use some form of caching of the computation of the address of the actual method that gets called. Obviously we cannot always do this, but there are some cases where we could.

One example:

I have a virtual function call in a loop. All the time, the same method is called:

The loop is in method work; the virtual call is to method id.

We can see that method work gets inlined, but inside the loop, there are always two pointer dereferences:
mov     raxqword ptr [r14]
call    qword ptr [rax]

Since the object referred to by b, never changes to a different object, this could(at least in this case), be cached.

My assembly might be rusty, but before the loop, we could have:
mov     r13qword ptr [r14]
mov     r13qword ptr [r13]  

and inside the loop we would only have: 

call    r13

My questions are:
Do we have a mechanism in C++ to explicitly store the result of the lookup in the vtable without additional overhead? Some sort of cache for this result so that we do not do the same computation over and over again? I'm specifically looking for this solution, not alternatives to dynamic polymorphism like CRTP or std::variant. I couldn't find one.
For functional programming style "pure functions", we would have:
    int res = pure_function();
    while(true) use(res);
instead of
    while(true) use(pure_function());

Is there anything in the C++ standard that prevents such an optimization?

The optimization is valid, and clang performs it under -fstrict-vtable-pointers (https://godbolt.org/z/SrlUC8). Unfortunately, there are still some cases where this optimization can regress performance (the annotations that the frontend inserts to enable the optimization can get in the way of other transformations), so it's not enabled by default yet.

Actually, I think the above performance concern is way out of date. I think we decided that we are happy with the performance delta (it improves a lot more than it regresses) and the remaining blocker for enabling it by default was ensuring LLVM IR ABI compatibility (especially when performing LTO between compilations with the feature turned off and compilations with it turned on).

How would we identify the cases in which such an optimization is possible and the ones in which it is not?

Are there any other reasons for which such an optimization would not be desired?


N.B.: I realize the last two questions might be difficult to answer, but could you at least point me in the right direction for investigating this myself?

Thanks,
Ninu.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Virtual function call optimization(memoization) questions

Vassil Vassilev via cfe-dev
Hello Richard,

it seems the caching works only when the compiler is able to unroll the loop. Otherwise, it does not(in the following example change the work method's actual parameter to a constant and the optimization is again performed):

Also, thanks for the reading material, I already went through it, but I need to read again, some parts are not very clear to me. If you won't mind I will get back to you with a reply when I'll get to my conclusions(hopefully soon); just two questions, for now. Is this optimization supposed to always yield correct results or are some cases just supposed to work differently(perhaps leading into undefined behaviour)? I don't have an actual example, but I'm thinking multithreading, even without exploiting race conditions can be used to carefully craft a program that breaks this optimization.
When I was thinking about it, I had my concerns about the possibility of somehow changing the result of the virtual call resolution, but I wasn't thinking the placement new could be used inside the actual call so this idea with multithreading didn't seem possible. Also are you aware of any other possibilities to invalidate the cached resolution?

Thanks,
Ninu.

On Tue, Jun 16, 2020 at 9:51 PM Richard Smith <[hidden email]> wrote:
On Tue, 16 Jun 2020, 11:44 Richard Smith, <[hidden email]> wrote:
On Mon, 15 Jun 2020, 23:31 Ninu-Ciprian Marginean via cfe-dev, <[hidden email]> wrote:
Hello,

I want to investigate if there are any possibilities of optimizing virtual functions calls. From my knowledge and reading I understand that the overhead for this is two pointer dereferences. I know about alternatives like CRTP and std::variant, but for this investigation, I'm interested only in traditional, dynamic polymorphism. One of my ideas is to use some form of caching of the computation of the address of the actual method that gets called. Obviously we cannot always do this, but there are some cases where we could.

One example:

I have a virtual function call in a loop. All the time, the same method is called:

The loop is in method work; the virtual call is to method id.

We can see that method work gets inlined, but inside the loop, there are always two pointer dereferences:
mov     raxqword ptr [r14]
call    qword ptr [rax]

Since the object referred to by b, never changes to a different object, this could(at least in this case), be cached.

My assembly might be rusty, but before the loop, we could have:
mov     r13qword ptr [r14]
mov     r13qword ptr [r13]  

and inside the loop we would only have: 

call    r13

My questions are:
Do we have a mechanism in C++ to explicitly store the result of the lookup in the vtable without additional overhead? Some sort of cache for this result so that we do not do the same computation over and over again? I'm specifically looking for this solution, not alternatives to dynamic polymorphism like CRTP or std::variant. I couldn't find one.
For functional programming style "pure functions", we would have:
    int res = pure_function();
    while(true) use(res);
instead of
    while(true) use(pure_function());

Is there anything in the C++ standard that prevents such an optimization?

The optimization is valid, and clang performs it under -fstrict-vtable-pointers (https://godbolt.org/z/SrlUC8). Unfortunately, there are still some cases where this optimization can regress performance (the annotations that the frontend inserts to enable the optimization can get in the way of other transformations), so it's not enabled by default yet.

Actually, I think the above performance concern is way out of date. I think we decided that we are happy with the performance delta (it improves a lot more than it regresses) and the remaining blocker for enabling it by default was ensuring LLVM IR ABI compatibility (especially when performing LTO between compilations with the feature turned off and compilations with it turned on).

How would we identify the cases in which such an optimization is possible and the ones in which it is not?

Are there any other reasons for which such an optimization would not be desired?


N.B.: I realize the last two questions might be difficult to answer, but could you at least point me in the right direction for investigating this myself?

Thanks,
Ninu.
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Virtual function call optimization(memoization) questions

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev
" but the compiler cannot make that assumption because the C++ Standard technically permits a virtual method to destroy and re-create `b` with a completely different dynamic type." - thanks, I wasn't aware of this. I'll look into the smart pointer lifetime optimizations thread soon.

On Tue, Jun 16, 2020 at 6:27 PM Arthur O'Dwyer <[hidden email]> wrote:
On Tue, Jun 16, 2020 at 2:31 AM Ninu-Ciprian Marginean via cfe-dev <[hidden email]> wrote:

I have a virtual function call in a loop. All the time, the same method is called:

The loop is in method work; the virtual call is to method id.

We can see that method work gets inlined, but inside the loop, there are always two pointer dereferences:
mov     raxqword ptr [r14]
call    qword ptr [rax]

Since the object referred to by b, never changes to a different object, this could(at least in this case), be cached.

I believe that this is a similar problem to what was recently discussed in the thread titled "[cfe-dev] Smart Pointer Lifetime Optimizations".  In this case (as in that case), the human programmer knows that the call to `b->do_id()` does not actually change the value of b's vptr; but the compiler cannot make that assumption because the C++ Standard technically permits a virtual method to destroy and re-create `b` with a completely different dynamic type.

As in the smart-pointer case, Clang provides a "noescape" attribute that addresses half of the problem, but doesn't provide any "really_const" attribute that allows the human programmer to say "Look, I promise that this function call really does not change its argument!"  (I think LLVM's `readonly` attribute would help, but Clang does not expose that attribute to the C++ programmer.)
I tried slapping all the attributes I could think of onto this member function — noescape, const, pure — and none of them helped.
Even marking the stack objects as `const` didn't help, which surprises me in this case:

–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev