Re: [libcxx] Extending the behavior of std::hash in libc++

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
Hi, I wanted to leave some thoughts here from my perspective as an
application developer.  As context, I added heterogeneous lookup
support to Google's internal dense_hash_{set,map}, and my day-to-day
job involves working with lots of high-throughput hashtables.

From my perspective, custom hash algorithms are pretty important, and
I'm concerned that this standard may ossify the state of the world for
applications which could benefit from using a different hash
algorithm.

Even if the standard library randomizes its hash (which I completely
agree is a good idea), I contend that it's still going to be hard for
standard library maintainers to swap out hash functions, because of
the conservative nature of standard library writing.  (As evidence,
consider that, last time I checked, glibc still used a malloc
implementation whose throughput dropped by more than half as soon as
you created a thread.)

I expect that the result of this conservatism is that even if a good
hash function is chosen today, it will rather quickly become outdated,
as hashing is an area of active development in the community at large
(see e.g. xxHash, MetroHash).  I expect this trend to continue -- it's
kind of the perfect nerdsnipe, and it also matters for lots of
high-performance code.

Moreover, applications don't necessarily control the version of the
standard library they link with; even if one standard library
implementation uses a good hash, that's no guarantee that they all
will.  As soon as any one of the standard libraries I support uses a
hash that's too slow or somehow weak (perhaps its randomization is
weak and doesn't prevent flooding), I have to make nontrivial changes
to all of my code.

So I contend that "standard library writers should use a good, fast
hash" implies "at least some standard libraries will use a possibly
bad, probably slow hash" some time in the relatively near future, just
as today some standard libraries use crappy malloc implementations.
And so just as performance-sensitive applications commonly replace
malloc, I expect many performance-sensitive applications will want to
control their default hasher.

Indeed, as the proposal says, "we must plan for hash algorithms that
have a finite (and probably short) operational life," and my
expectation is that there's a lot of carefully-maintained user code is
going to better able to play this cat-and-mouse game than the compiler
/ standard library headers.

One option is to say, fine, these applications can continue to use a
custom hash functor just as they do today.  But as the proposal says,
ergonomics and experience suggest that most people are just going to
use the default.  Therefore my suggestion is, it would be really nice
if there were an easy way for application developers to swap in a new
hasher, even if it's only for a set of explicitly-listed types.

I'll refrain from saying something dumb by not suggesting how this
might be specified.  :)

-Justin
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
I'm confused. You start by saying that you're concerned that "this standard [meaning P0029, I guess?] may ossify the state of the world for applications which could benefit from using a different hash algorithm", but I don't see how the proposal is any worse than the status quo with respect to the problems you describe. Furthermore, you go on to express a desire for user code to be able to control the default hash algorithm, but if anything P0029 makes that easier, by giving you one central hash algorithm to control, instead of dealing with hundreds of independent per-type hash algorithms.

In any event, "an easy way for application developers to swap in a new hasher" is basically exactly what I'm asking for here. I take no position on whether it would be a good idea to put such a mechanism in the standard itself, but if you want to go that way, surely gaining implementation experience in libc++ would be a good first step?

On Sun, Dec 6, 2015 at 10:05 PM, Justin Lebar <[hidden email]> wrote:
Hi, I wanted to leave some thoughts here from my perspective as an
application developer.  As context, I added heterogeneous lookup
support to Google's internal dense_hash_{set,map}, and my day-to-day
job involves working with lots of high-throughput hashtables.

From my perspective, custom hash algorithms are pretty important, and
I'm concerned that this standard may ossify the state of the world for
applications which could benefit from using a different hash
algorithm.

Even if the standard library randomizes its hash (which I completely
agree is a good idea), I contend that it's still going to be hard for
standard library maintainers to swap out hash functions, because of
the conservative nature of standard library writing.  (As evidence,
consider that, last time I checked, glibc still used a malloc
implementation whose throughput dropped by more than half as soon as
you created a thread.)

I expect that the result of this conservatism is that even if a good
hash function is chosen today, it will rather quickly become outdated,
as hashing is an area of active development in the community at large
(see e.g. xxHash, MetroHash).  I expect this trend to continue -- it's
kind of the perfect nerdsnipe, and it also matters for lots of
high-performance code.

Moreover, applications don't necessarily control the version of the
standard library they link with; even if one standard library
implementation uses a good hash, that's no guarantee that they all
will.  As soon as any one of the standard libraries I support uses a
hash that's too slow or somehow weak (perhaps its randomization is
weak and doesn't prevent flooding), I have to make nontrivial changes
to all of my code.

So I contend that "standard library writers should use a good, fast
hash" implies "at least some standard libraries will use a possibly
bad, probably slow hash" some time in the relatively near future, just
as today some standard libraries use crappy malloc implementations.
And so just as performance-sensitive applications commonly replace
malloc, I expect many performance-sensitive applications will want to
control their default hasher.

Indeed, as the proposal says, "we must plan for hash algorithms that
have a finite (and probably short) operational life," and my
expectation is that there's a lot of carefully-maintained user code is
going to better able to play this cat-and-mouse game than the compiler
/ standard library headers.

One option is to say, fine, these applications can continue to use a
custom hash functor just as they do today.  But as the proposal says,
ergonomics and experience suggest that most people are just going to
use the default.  Therefore my suggestion is, it would be really nice
if there were an easy way for application developers to swap in a new
hasher, even if it's only for a set of explicitly-listed types.

I'll refrain from saying something dumb by not suggesting how this
might be specified.  :)

-Justin


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
On Mon, Dec 7, 2015 at 10:36 AM, Geoffrey Romer <[hidden email]> wrote:
> I'm confused. You start by saying that you're concerned that "this standard
> [meaning P0029, I guess?] may ossify the state of the world for applications
> which could benefit from using a different hash algorithm", but I don't see
> how the proposal is any worse than the status quo with respect to the
> problems you describe.

It's not.

My mail was more to express the position that switching out hashers is
important for many applications.  It seems that perhaps if hashing is
being reconsidered in a big way, the ability to switch out the hasher
should be considered at the same time, lest an opportunity be lost.
Even if switching out hashers is out of scope for P0029, it seems that
if it's desirable and important it should be carefully considered, in
the same way that heterogeneous lookups are being considered.

> Furthermore, you go on to express a desire for user
> code to be able to control the default hash algorithm, but if anything P0029
> makes that easier, by giving you one central hash algorithm to control,
> instead of dealing with hundreds of independent per-type hash algorithms.

I suppose this is true from one perspective.  In particular it's
easier to use a single hasher with complex types -- assuming that
everyone writes their boilerplate using the templated idiom.

On the other hand, the data structure using the hash -- unordered_map
or whatever -- still controls which hasher we use.  My contention in
the earlier e-mail is that because (a) people will largely use the
default and (b) library writers are conservative, we will likely end
up with possibly bad, probably slow standard-library hashers in the
near future.  Just as replacing malloc is often a good idea, it seems
to me that replacing the default hasher will also often be a good
idea.  So I was asking that we consider whether it's desirable to
design a system where it's easier to replace the default hasher.

Maybe this is the best we can reasonably do in that respect.  It would
sort of be a bummer (as I understand the proposal), but I get that
there's a lot of constraints on the problem.

> I take no position on
> whether it would be a good idea to put such a mechanism in the standard
> itself, but if you want to go that way, surely gaining implementation
> experience in libc++ would be a good first step?

Sure, and I feel like this thread is part of that process?

> On Sun, Dec 6, 2015 at 10:05 PM, Justin Lebar <[hidden email]> wrote:
>>
>> Hi, I wanted to leave some thoughts here from my perspective as an
>> application developer.  As context, I added heterogeneous lookup
>> support to Google's internal dense_hash_{set,map}, and my day-to-day
>> job involves working with lots of high-throughput hashtables.
>>
>> From my perspective, custom hash algorithms are pretty important, and
>> I'm concerned that this standard may ossify the state of the world for
>> applications which could benefit from using a different hash
>> algorithm.
>>
>> Even if the standard library randomizes its hash (which I completely
>> agree is a good idea), I contend that it's still going to be hard for
>> standard library maintainers to swap out hash functions, because of
>> the conservative nature of standard library writing.  (As evidence,
>> consider that, last time I checked, glibc still used a malloc
>> implementation whose throughput dropped by more than half as soon as
>> you created a thread.)
>>
>> I expect that the result of this conservatism is that even if a good
>> hash function is chosen today, it will rather quickly become outdated,
>> as hashing is an area of active development in the community at large
>> (see e.g. xxHash, MetroHash).  I expect this trend to continue -- it's
>> kind of the perfect nerdsnipe, and it also matters for lots of
>> high-performance code.
>>
>> Moreover, applications don't necessarily control the version of the
>> standard library they link with; even if one standard library
>> implementation uses a good hash, that's no guarantee that they all
>> will.  As soon as any one of the standard libraries I support uses a
>> hash that's too slow or somehow weak (perhaps its randomization is
>> weak and doesn't prevent flooding), I have to make nontrivial changes
>> to all of my code.
>>
>> So I contend that "standard library writers should use a good, fast
>> hash" implies "at least some standard libraries will use a possibly
>> bad, probably slow hash" some time in the relatively near future, just
>> as today some standard libraries use crappy malloc implementations.
>> And so just as performance-sensitive applications commonly replace
>> malloc, I expect many performance-sensitive applications will want to
>> control their default hasher.
>>
>> Indeed, as the proposal says, "we must plan for hash algorithms that
>> have a finite (and probably short) operational life," and my
>> expectation is that there's a lot of carefully-maintained user code is
>> going to better able to play this cat-and-mouse game than the compiler
>> / standard library headers.
>>
>> One option is to say, fine, these applications can continue to use a
>> custom hash functor just as they do today.  But as the proposal says,
>> ergonomics and experience suggest that most people are just going to
>> use the default.  Therefore my suggestion is, it would be really nice
>> if there were an easy way for application developers to swap in a new
>> hasher, even if it's only for a set of explicitly-listed types.
>>
>> I'll refrain from saying something dumb by not suggesting how this
>> might be specified.  :)
>>
>> -Justin
>
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev

On Mon, Dec 7, 2015 at 2:31 PM, Justin Lebar <[hidden email]> wrote:
On Mon, Dec 7, 2015 at 10:36 AM, Geoffrey Romer <[hidden email]> wrote:
> I'm confused. You start by saying that you're concerned that "this standard
> [meaning P0029, I guess?] may ossify the state of the world for applications
> which could benefit from using a different hash algorithm", but I don't see
> how the proposal is any worse than the status quo with respect to the
> problems you describe.

It's not.

My mail was more to express the position that switching out hashers is
important for many applications.  It seems that perhaps if hashing is
being reconsidered in a big way, the ability to switch out the hasher
should be considered at the same time, lest an opportunity be lost.
Even if switching out hashers is out of scope for P0029, it seems that
if it's desirable and important it should be carefully considered, in
the same way that heterogeneous lookups are being considered.

I see, that makes sense; I've made a note to discuss that in the next version of the paper. I don't intend to actually propose it at this juncture, but I agree it's worth considering
 

> Furthermore, you go on to express a desire for user
> code to be able to control the default hash algorithm, but if anything P0029
> makes that easier, by giving you one central hash algorithm to control,
> instead of dealing with hundreds of independent per-type hash algorithms.

I suppose this is true from one perspective.  In particular it's
easier to use a single hasher with complex types -- assuming that
everyone writes their boilerplate using the templated idiom.

Sorry, I don't follow.
 

On the other hand, the data structure using the hash -- unordered_map
or whatever -- still controls which hasher we use.  My contention in
the earlier e-mail is that because (a) people will largely use the
default and (b) library writers are conservative, we will likely end
up with possibly bad, probably slow standard-library hashers in the
near future.  Just as replacing malloc is often a good idea, it seems
to me that replacing the default hasher will also often be a good
idea.  So I was asking that we consider whether it's desirable to
design a system where it's easier to replace the default hasher.

I get all that. My point is that even if P0029 doesn't directly give you a way to replace the hash algorithm, it gets you closer: under the status quo, even if the standard library gave you a way to change the default hasher, that would just give you a way to change the behavior for each type individually. Under P0029, by contrast, if the standard library gives you a way to change the hash algorithm, then you can change it for all types at a stroke.
 

Maybe this is the best we can reasonably do in that respect.  It would
sort of be a bummer (as I understand the proposal), but I get that
there's a lot of constraints on the problem.

> I take no position on
> whether it would be a good idea to put such a mechanism in the standard
> itself, but if you want to go that way, surely gaining implementation
> experience in libc++ would be a good first step?

Sure, and I feel like this thread is part of that process?

Agreed; I had thought you were objecting to my request, but it sounds like that's not the case.
 

> On Sun, Dec 6, 2015 at 10:05 PM, Justin Lebar <[hidden email]> wrote:
>>
>> Hi, I wanted to leave some thoughts here from my perspective as an
>> application developer.  As context, I added heterogeneous lookup
>> support to Google's internal dense_hash_{set,map}, and my day-to-day
>> job involves working with lots of high-throughput hashtables.
>>
>> From my perspective, custom hash algorithms are pretty important, and
>> I'm concerned that this standard may ossify the state of the world for
>> applications which could benefit from using a different hash
>> algorithm.
>>
>> Even if the standard library randomizes its hash (which I completely
>> agree is a good idea), I contend that it's still going to be hard for
>> standard library maintainers to swap out hash functions, because of
>> the conservative nature of standard library writing.  (As evidence,
>> consider that, last time I checked, glibc still used a malloc
>> implementation whose throughput dropped by more than half as soon as
>> you created a thread.)
>>
>> I expect that the result of this conservatism is that even if a good
>> hash function is chosen today, it will rather quickly become outdated,
>> as hashing is an area of active development in the community at large
>> (see e.g. xxHash, MetroHash).  I expect this trend to continue -- it's
>> kind of the perfect nerdsnipe, and it also matters for lots of
>> high-performance code.
>>
>> Moreover, applications don't necessarily control the version of the
>> standard library they link with; even if one standard library
>> implementation uses a good hash, that's no guarantee that they all
>> will.  As soon as any one of the standard libraries I support uses a
>> hash that's too slow or somehow weak (perhaps its randomization is
>> weak and doesn't prevent flooding), I have to make nontrivial changes
>> to all of my code.
>>
>> So I contend that "standard library writers should use a good, fast
>> hash" implies "at least some standard libraries will use a possibly
>> bad, probably slow hash" some time in the relatively near future, just
>> as today some standard libraries use crappy malloc implementations.
>> And so just as performance-sensitive applications commonly replace
>> malloc, I expect many performance-sensitive applications will want to
>> control their default hasher.
>>
>> Indeed, as the proposal says, "we must plan for hash algorithms that
>> have a finite (and probably short) operational life," and my
>> expectation is that there's a lot of carefully-maintained user code is
>> going to better able to play this cat-and-mouse game than the compiler
>> / standard library headers.
>>
>> One option is to say, fine, these applications can continue to use a
>> custom hash functor just as they do today.  But as the proposal says,
>> ergonomics and experience suggest that most people are just going to
>> use the default.  Therefore my suggestion is, it would be really nice
>> if there were an easy way for application developers to swap in a new
>> hasher, even if it's only for a set of explicitly-listed types.
>>
>> I'll refrain from saying something dumb by not suggesting how this
>> might be specified.  :)
>>
>> -Justin
>
>


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
> I've made a note to discuss that in the next version of the paper. I don't intend to actually propose it at this juncture, but I agree it's worth considering

Fantastic, thank you!

> Under P0029, by contrast, if the standard library gives you a way to change the hash algorithm, then you can change it for all types at a stroke.

Strong agree.

> I had thought you were objecting to my request, but it sounds like that's not the case.

Sorry, there clearly are communication nuances on this list that I'm
not yet hep to.  Thanks for being patient and digging out what I was
trying to say.

-Justin

On Mon, Dec 7, 2015 at 3:27 PM, Geoffrey Romer <[hidden email]> wrote:

>
> On Mon, Dec 7, 2015 at 2:31 PM, Justin Lebar <[hidden email]> wrote:
>>
>> On Mon, Dec 7, 2015 at 10:36 AM, Geoffrey Romer <[hidden email]> wrote:
>> > I'm confused. You start by saying that you're concerned that "this
>> > standard
>> > [meaning P0029, I guess?] may ossify the state of the world for
>> > applications
>> > which could benefit from using a different hash algorithm", but I don't
>> > see
>> > how the proposal is any worse than the status quo with respect to the
>> > problems you describe.
>>
>> It's not.
>>
>> My mail was more to express the position that switching out hashers is
>> important for many applications.  It seems that perhaps if hashing is
>> being reconsidered in a big way, the ability to switch out the hasher
>> should be considered at the same time, lest an opportunity be lost.
>> Even if switching out hashers is out of scope for P0029, it seems that
>> if it's desirable and important it should be carefully considered, in
>> the same way that heterogeneous lookups are being considered.
>
>
> I see, that makes sense; I've made a note to discuss that in the next
> version of the paper. I don't intend to actually propose it at this
> juncture, but I agree it's worth considering
>
>>
>>
>> > Furthermore, you go on to express a desire for user
>> > code to be able to control the default hash algorithm, but if anything
>> > P0029
>> > makes that easier, by giving you one central hash algorithm to control,
>> > instead of dealing with hundreds of independent per-type hash
>> > algorithms.
>>
>> I suppose this is true from one perspective.  In particular it's
>> easier to use a single hasher with complex types -- assuming that
>> everyone writes their boilerplate using the templated idiom.
>
>
> Sorry, I don't follow.
>
>>
>>
>> On the other hand, the data structure using the hash -- unordered_map
>> or whatever -- still controls which hasher we use.  My contention in
>> the earlier e-mail is that because (a) people will largely use the
>> default and (b) library writers are conservative, we will likely end
>> up with possibly bad, probably slow standard-library hashers in the
>> near future.  Just as replacing malloc is often a good idea, it seems
>> to me that replacing the default hasher will also often be a good
>> idea.  So I was asking that we consider whether it's desirable to
>> design a system where it's easier to replace the default hasher.
>
>
> I get all that. My point is that even if P0029 doesn't directly give you a
> way to replace the hash algorithm, it gets you closer: under the status quo,
> even if the standard library gave you a way to change the default hasher,
> that would just give you a way to change the behavior for each type
> individually. Under P0029, by contrast, if the standard library gives you a
> way to change the hash algorithm, then you can change it for all types at a
> stroke.
>
>>
>>
>> Maybe this is the best we can reasonably do in that respect.  It would
>> sort of be a bummer (as I understand the proposal), but I get that
>> there's a lot of constraints on the problem.
>>
>> > I take no position on
>> > whether it would be a good idea to put such a mechanism in the standard
>> > itself, but if you want to go that way, surely gaining implementation
>> > experience in libc++ would be a good first step?
>>
>> Sure, and I feel like this thread is part of that process?
>
>
> Agreed; I had thought you were objecting to my request, but it sounds like
> that's not the case.
>
>>
>>
>> > On Sun, Dec 6, 2015 at 10:05 PM, Justin Lebar <[hidden email]> wrote:
>> >>
>> >> Hi, I wanted to leave some thoughts here from my perspective as an
>> >> application developer.  As context, I added heterogeneous lookup
>> >> support to Google's internal dense_hash_{set,map}, and my day-to-day
>> >> job involves working with lots of high-throughput hashtables.
>> >>
>> >> From my perspective, custom hash algorithms are pretty important, and
>> >> I'm concerned that this standard may ossify the state of the world for
>> >> applications which could benefit from using a different hash
>> >> algorithm.
>> >>
>> >> Even if the standard library randomizes its hash (which I completely
>> >> agree is a good idea), I contend that it's still going to be hard for
>> >> standard library maintainers to swap out hash functions, because of
>> >> the conservative nature of standard library writing.  (As evidence,
>> >> consider that, last time I checked, glibc still used a malloc
>> >> implementation whose throughput dropped by more than half as soon as
>> >> you created a thread.)
>> >>
>> >> I expect that the result of this conservatism is that even if a good
>> >> hash function is chosen today, it will rather quickly become outdated,
>> >> as hashing is an area of active development in the community at large
>> >> (see e.g. xxHash, MetroHash).  I expect this trend to continue -- it's
>> >> kind of the perfect nerdsnipe, and it also matters for lots of
>> >> high-performance code.
>> >>
>> >> Moreover, applications don't necessarily control the version of the
>> >> standard library they link with; even if one standard library
>> >> implementation uses a good hash, that's no guarantee that they all
>> >> will.  As soon as any one of the standard libraries I support uses a
>> >> hash that's too slow or somehow weak (perhaps its randomization is
>> >> weak and doesn't prevent flooding), I have to make nontrivial changes
>> >> to all of my code.
>> >>
>> >> So I contend that "standard library writers should use a good, fast
>> >> hash" implies "at least some standard libraries will use a possibly
>> >> bad, probably slow hash" some time in the relatively near future, just
>> >> as today some standard libraries use crappy malloc implementations.
>> >> And so just as performance-sensitive applications commonly replace
>> >> malloc, I expect many performance-sensitive applications will want to
>> >> control their default hasher.
>> >>
>> >> Indeed, as the proposal says, "we must plan for hash algorithms that
>> >> have a finite (and probably short) operational life," and my
>> >> expectation is that there's a lot of carefully-maintained user code is
>> >> going to better able to play this cat-and-mouse game than the compiler
>> >> / standard library headers.
>> >>
>> >> One option is to say, fine, these applications can continue to use a
>> >> custom hash functor just as they do today.  But as the proposal says,
>> >> ergonomics and experience suggest that most people are just going to
>> >> use the default.  Therefore my suggestion is, it would be really nice
>> >> if there were an easy way for application developers to swap in a new
>> >> hasher, even if it's only for a set of explicitly-listed types.
>> >>
>> >> I'll refrain from saying something dumb by not suggesting how this
>> >> might be specified.  :)
>> >>
>> >> -Justin
>> >
>> >
>
>
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev

On Mon, Dec 7, 2015 at 6:50 PM, Justin Lebar <[hidden email]> wrote:
> I've made a note to discuss that in the next version of the paper. I don't intend to actually propose it at this juncture, but I agree it's worth considering

Fantastic, thank you!

> Under P0029, by contrast, if the standard library gives you a way to change the hash algorithm, then you can change it for all types at a stroke.

Strong agree.

> I had thought you were objecting to my request, but it sounds like that's not the case.

Sorry, there clearly are communication nuances on this list that I'm
not yet hep to.  Thanks for being patient and digging out what I was
trying to say.

I'm new here too, so this may just be down to my lack of reading comprehension skils.
 

-Justin

On Mon, Dec 7, 2015 at 3:27 PM, Geoffrey Romer <[hidden email]> wrote:
>
> On Mon, Dec 7, 2015 at 2:31 PM, Justin Lebar <[hidden email]> wrote:
>>
>> On Mon, Dec 7, 2015 at 10:36 AM, Geoffrey Romer <[hidden email]> wrote:
>> > I'm confused. You start by saying that you're concerned that "this
>> > standard
>> > [meaning P0029, I guess?] may ossify the state of the world for
>> > applications
>> > which could benefit from using a different hash algorithm", but I don't
>> > see
>> > how the proposal is any worse than the status quo with respect to the
>> > problems you describe.
>>
>> It's not.
>>
>> My mail was more to express the position that switching out hashers is
>> important for many applications.  It seems that perhaps if hashing is
>> being reconsidered in a big way, the ability to switch out the hasher
>> should be considered at the same time, lest an opportunity be lost.
>> Even if switching out hashers is out of scope for P0029, it seems that
>> if it's desirable and important it should be carefully considered, in
>> the same way that heterogeneous lookups are being considered.
>
>
> I see, that makes sense; I've made a note to discuss that in the next
> version of the paper. I don't intend to actually propose it at this
> juncture, but I agree it's worth considering
>
>>
>>
>> > Furthermore, you go on to express a desire for user
>> > code to be able to control the default hash algorithm, but if anything
>> > P0029
>> > makes that easier, by giving you one central hash algorithm to control,
>> > instead of dealing with hundreds of independent per-type hash
>> > algorithms.
>>
>> I suppose this is true from one perspective.  In particular it's
>> easier to use a single hasher with complex types -- assuming that
>> everyone writes their boilerplate using the templated idiom.
>
>
> Sorry, I don't follow.
>
>>
>>
>> On the other hand, the data structure using the hash -- unordered_map
>> or whatever -- still controls which hasher we use.  My contention in
>> the earlier e-mail is that because (a) people will largely use the
>> default and (b) library writers are conservative, we will likely end
>> up with possibly bad, probably slow standard-library hashers in the
>> near future.  Just as replacing malloc is often a good idea, it seems
>> to me that replacing the default hasher will also often be a good
>> idea.  So I was asking that we consider whether it's desirable to
>> design a system where it's easier to replace the default hasher.
>
>
> I get all that. My point is that even if P0029 doesn't directly give you a
> way to replace the hash algorithm, it gets you closer: under the status quo,
> even if the standard library gave you a way to change the default hasher,
> that would just give you a way to change the behavior for each type
> individually. Under P0029, by contrast, if the standard library gives you a
> way to change the hash algorithm, then you can change it for all types at a
> stroke.
>
>>
>>
>> Maybe this is the best we can reasonably do in that respect.  It would
>> sort of be a bummer (as I understand the proposal), but I get that
>> there's a lot of constraints on the problem.
>>
>> > I take no position on
>> > whether it would be a good idea to put such a mechanism in the standard
>> > itself, but if you want to go that way, surely gaining implementation
>> > experience in libc++ would be a good first step?
>>
>> Sure, and I feel like this thread is part of that process?
>
>
> Agreed; I had thought you were objecting to my request, but it sounds like
> that's not the case.
>
>>
>>
>> > On Sun, Dec 6, 2015 at 10:05 PM, Justin Lebar <[hidden email]> wrote:
>> >>
>> >> Hi, I wanted to leave some thoughts here from my perspective as an
>> >> application developer.  As context, I added heterogeneous lookup
>> >> support to Google's internal dense_hash_{set,map}, and my day-to-day
>> >> job involves working with lots of high-throughput hashtables.
>> >>
>> >> From my perspective, custom hash algorithms are pretty important, and
>> >> I'm concerned that this standard may ossify the state of the world for
>> >> applications which could benefit from using a different hash
>> >> algorithm.
>> >>
>> >> Even if the standard library randomizes its hash (which I completely
>> >> agree is a good idea), I contend that it's still going to be hard for
>> >> standard library maintainers to swap out hash functions, because of
>> >> the conservative nature of standard library writing.  (As evidence,
>> >> consider that, last time I checked, glibc still used a malloc
>> >> implementation whose throughput dropped by more than half as soon as
>> >> you created a thread.)
>> >>
>> >> I expect that the result of this conservatism is that even if a good
>> >> hash function is chosen today, it will rather quickly become outdated,
>> >> as hashing is an area of active development in the community at large
>> >> (see e.g. xxHash, MetroHash).  I expect this trend to continue -- it's
>> >> kind of the perfect nerdsnipe, and it also matters for lots of
>> >> high-performance code.
>> >>
>> >> Moreover, applications don't necessarily control the version of the
>> >> standard library they link with; even if one standard library
>> >> implementation uses a good hash, that's no guarantee that they all
>> >> will.  As soon as any one of the standard libraries I support uses a
>> >> hash that's too slow or somehow weak (perhaps its randomization is
>> >> weak and doesn't prevent flooding), I have to make nontrivial changes
>> >> to all of my code.
>> >>
>> >> So I contend that "standard library writers should use a good, fast
>> >> hash" implies "at least some standard libraries will use a possibly
>> >> bad, probably slow hash" some time in the relatively near future, just
>> >> as today some standard libraries use crappy malloc implementations.
>> >> And so just as performance-sensitive applications commonly replace
>> >> malloc, I expect many performance-sensitive applications will want to
>> >> control their default hasher.
>> >>
>> >> Indeed, as the proposal says, "we must plan for hash algorithms that
>> >> have a finite (and probably short) operational life," and my
>> >> expectation is that there's a lot of carefully-maintained user code is
>> >> going to better able to play this cat-and-mouse game than the compiler
>> >> / standard library headers.
>> >>
>> >> One option is to say, fine, these applications can continue to use a
>> >> custom hash functor just as they do today.  But as the proposal says,
>> >> ergonomics and experience suggest that most people are just going to
>> >> use the default.  Therefore my suggestion is, it would be really nice
>> >> if there were an easy way for application developers to swap in a new
>> >> hasher, even if it's only for a set of explicitly-listed types.
>> >>
>> >> I'll refrain from saying something dumb by not suggesting how this
>> >> might be specified.  :)
>> >>
>> >> -Justin
>> >
>> >
>
>


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
The thing I would like (also... a pony...) is a more explicit
discussion of hash function families, and the ability to select from a
family. There are relatively important hash-based data structures that
require 2+ hash functions, notably Bloom filters and Cuckoo hash
tables. Given the big push to revamp std::hash, this seems like a
great opportunity to address the needs of library authors who want to
be able to select randomly from a family of hash functions. Ideally
selecting new hash functions could happen "at will," e.g., at table
rehash in Cuckoo.

Apologies if this is already covered by your proposed changes to the
C++ standard.

thanks,

Geoff
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
> The thing I would like is a more explicit discussion of hash function families, and the ability to select from a family.

How would this be different from the hashtable / bloom filter hashing
in a random number into the beginning or end of the bitstream?  Is the
idea that we could be more efficient than this, or that we could get
better hashes if we had explicit support for seeding the hash?

On Tue, Dec 8, 2015 at 3:45 PM, Geoff Pike <[hidden email]> wrote:

> The thing I would like (also... a pony...) is a more explicit
> discussion of hash function families, and the ability to select from a
> family. There are relatively important hash-based data structures that
> require 2+ hash functions, notably Bloom filters and Cuckoo hash
> tables. Given the big push to revamp std::hash, this seems like a
> great opportunity to address the needs of library authors who want to
> be able to select randomly from a family of hash functions. Ideally
> selecting new hash functions could happen "at will," e.g., at table
> rehash in Cuckoo.
>
> Apologies if this is already covered by your proposed changes to the
> C++ standard.
>
> thanks,
>
> Geoff
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
On Tue, Dec 8, 2015 at 5:07 PM, Justin Lebar <[hidden email]> wrote:
>> The thing I would like is a more explicit discussion of hash function families, and the ability to select from a family.
>
> How would this be different from the hashtable / bloom filter hashing
> in a random number into the beginning or end of the bitstream?  Is the
> idea that we could be more efficient than this, or that we could get
> better hashes if we had explicit support for seeding the hash?

Would the container writer be able to manipulate a bitstream?
Currently, the container just gets a hash function to call, whose
input is a key and whose output is size_t.

Also, the ideal number of bits to prepend, if you're going that route,
would be different for different hash function families. It seems more
logical to me to make the interface more explicit about the existence
of hash function families.

Geoff
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
> Would the container writer be able to manipulate a bitstream?
> Currently, the container just gets a hash function to call, whose
> input is a key and whose output is size_t.

I don't see why it couldn't manipulate the bitstream?  If I'm reading
this API right, it could do something like

  size_t hash(const T& t) {
    return hash_combine(std::hash_code(), seed, t)();
  }

> Also, the ideal number of bits to prepend, if you're going that route, would be different for different hash function families.

I guess what you're saying is that the algorithm keeps some number of
bits as internal state, and ideally you'd want to fill up all of those
bits with a seed, rather than just using a size_t?

But in practice we're not usually interested in exploring the full
range of possible hashers, especially given that flooding protection
is already being taken care of by randomizing at startup.  If you know
you'll need n different hashers, is it not sufficient to seed with
values in the range [0, n)?

I guess you could just pass initial state to the hash_code constructor
(again if I'm reading this API right).  But that's kind of an ugly API
if the constructor takes a different number of bytes depending on the
hash function family.  Maybe if this information is really needed it
would be better to expose a typedef on some hasher type that tells you
how many bits of entropy it wants -- then it's up to you to prepend
the right amount.


This all brings up another concern:

Looking at the sample implementation [1], hashing a small value, e.g.
a pointer, an int64, or even a short string seems necessarily slower
than if I just implemented std::hash<T> to call FarmHash64 directly.
The reason being that the new API doesn't know a priori how many bytes
I'm going to be hashing, and I'm free to split up those bytes into as
many calls to the hasher as I want.  Therefore it has to carefully
buffer things until I finalize the hasher.

In contrast, if I just call FarmHash64(some_int), it drops right into
a fast path and doesn't do any buffering and so on.

I like that the proposal lets us hash aggregate types (vectors,
structs, etc.) while preserving the full state of the hasher (that is,
without finalizing down to size_t), but in accordance with the
zero-cost principle of C++, it's probably important that simple types
are hashed as fast as you could hash them if you were invoking
FarmHash64 directly?

[1] https://github.com/google/hashing-demo/blob/N0029R0/farmhash.h

On Tue, Dec 8, 2015 at 7:37 PM, Geoff Pike <[hidden email]> wrote:

> On Tue, Dec 8, 2015 at 5:07 PM, Justin Lebar <[hidden email]> wrote:
>>> The thing I would like is a more explicit discussion of hash function families, and the ability to select from a family.
>>
>> How would this be different from the hashtable / bloom filter hashing
>> in a random number into the beginning or end of the bitstream?  Is the
>> idea that we could be more efficient than this, or that we could get
>> better hashes if we had explicit support for seeding the hash?
>
> Would the container writer be able to manipulate a bitstream?
> Currently, the container just gets a hash function to call, whose
> input is a key and whose output is size_t.
>
> Also, the ideal number of bits to prepend, if you're going that route,
> would be different for different hash function families. It seems more
> logical to me to make the interface more explicit about the existence
> of hash function families.
>
> Geoff
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
On Tue, Dec 08, 2015 at 11:33:19PM -0800, Justin Lebar via cfe-dev wrote:
> But in practice we're not usually interested in exploring the full
> range of possible hashers, especially given that flooding protection
> is already being taken care of by randomizing at startup.  If you know
> you'll need n different hashers, is it not sufficient to seed with
> values in the range [0, n)?

No. Depending on the specific context, you want either a family of
Universal Hash Functions or even pairwise independent hash functions.
That's a lot stronger than just n different hash functions and whether
hash combining is enough to qualify is quite questionable.

Joerg
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev

On Tue, Dec 8, 2015 at 11:33 PM, Justin Lebar <[hidden email]> wrote:
> Would the container writer be able to manipulate a bitstream?
> Currently, the container just gets a hash function to call, whose
> input is a key and whose output is size_t.

I don't see why it couldn't manipulate the bitstream?  If I'm reading
this API right, it could do something like

  size_t hash(const T& t) {
    return hash_combine(std::hash_code(), seed, t)();
  }

> Also, the ideal number of bits to prepend, if you're going that route, would be different for different hash function families.

I guess what you're saying is that the algorithm keeps some number of
bits as internal state, and ideally you'd want to fill up all of those
bits with a seed, rather than just using a size_t?

But in practice we're not usually interested in exploring the full
range of possible hashers, especially given that flooding protection
is already being taken care of by randomizing at startup.  If you know
you'll need n different hashers, is it not sufficient to seed with
values in the range [0, n)?

I guess you could just pass initial state to the hash_code constructor
(again if I'm reading this API right).  But that's kind of an ugly API
if the constructor takes a different number of bytes depending on the
hash function family.  Maybe if this information is really needed it
would be better to expose a typedef on some hasher type that tells you
how many bits of entropy it wants -- then it's up to you to prepend
the right amount.


This all brings up another concern:

Looking at the sample implementation [1], hashing a small value, e.g.
a pointer, an int64, or even a short string seems necessarily slower
than if I just implemented std::hash<T> to call FarmHash64 directly.
The reason being that the new API doesn't know a priori how many bytes
I'm going to be hashing, and I'm free to split up those bytes into as
many calls to the hasher as I want.  Therefore it has to carefully
buffer things until I finalize the hasher.

In contrast, if I just call FarmHash64(some_int), it drops right into
a fast path and doesn't do any buffering and so on.

I like that the proposal lets us hash aggregate types (vectors,
structs, etc.) while preserving the full state of the hasher (that is,
without finalizing down to size_t), but in accordance with the
zero-cost principle of C++, it's probably important that simple types
are hashed as fast as you could hash them if you were invoking
FarmHash64 directly?

The implementation is permitted to specialize std::hash<T> to call FarmHash64 (or whatever) directly when T doesn't depend on a user-defined type (and the user is permitted to do so when T does depend on a user-defined type), so it's possible to fix this as a QOI matter if it arises in practice. However, I don't expect this to be a practical issue. The API has been very carefully crafted to ensure that the optimizer can discover and take advantage of situations where only a fixed number of bytes are being hashed, and even if it can't eliminate the buffering, the cost of a single memcpy is quite small compared to the cost of evaluating a modern hash function.
 

[1] https://github.com/google/hashing-demo/blob/N0029R0/farmhash.h

On Tue, Dec 8, 2015 at 7:37 PM, Geoff Pike <[hidden email]> wrote:
> On Tue, Dec 8, 2015 at 5:07 PM, Justin Lebar <[hidden email]> wrote:
>>> The thing I would like is a more explicit discussion of hash function families, and the ability to select from a family.
>>
>> How would this be different from the hashtable / bloom filter hashing
>> in a random number into the beginning or end of the bitstream?  Is the
>> idea that we could be more efficient than this, or that we could get
>> better hashes if we had explicit support for seeding the hash?
>
> Would the container writer be able to manipulate a bitstream?
> Currently, the container just gets a hash function to call, whose
> input is a key and whose output is size_t.
>
> Also, the ideal number of bits to prepend, if you're going that route,
> would be different for different hash function families. It seems more
> logical to me to make the interface more explicit about the existence
> of hash function families.
>
> Geoff


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [libcxx] Extending the behavior of std::hash in libc++

Vassil Vassilev via cfe-dev
In reply to this post by Vassil Vassilev via cfe-dev
>> Would the container writer be able to manipulate a bitstream?
>> Currently, the container just gets a hash function to call, whose
>> input is a key and whose output is size_t.
>
> I don't see why it couldn't manipulate the bitstream?  If I'm reading
> this API right, it could do something like
>
>   size_t hash(const T& t) {
>     return hash_combine(std::hash_code(), seed, t)();
>   }
>
>> Also, the ideal number of bits to prepend, if you're going that route, would be different for different hash function families.
>
> I guess what you're saying is that the algorithm keeps some number of
> bits as internal state, and ideally you'd want to fill up all of those
> bits with a seed, rather than just using a size_t?

Yes. Traditionally, when you select from a hash function family you
initialize some state. (In crypto literature, that's the "key," and
the thing to be hashed is the "message.") The size of the state and
the allowable range for each number in it depends on the family.
Sometimes an allowable range isn't a power of 2. (For a hairy example,
see vmac_set_key:
http://lxr.free-electrons.com/source/crypto/vmac.c#L487 .)

> But in practice we're not usually interested in exploring the full
> range of possible hashers, especially given that flooding protection
> is already being taken care of by randomizing at startup.  If you know

Randomization at startup would offer some protection, but it doesn't
solve the hash-flooding problem; with various hash-function families
it is conjectured to thwart the worst hash-flooding attacks. I
strongly believe hash-based containers should have some hash-flooding
detection and evasion, and I'm working on that. One idea that looks
promising is to have containers use a quick-and-dirty hash function
most of the time, with judicious use of
randomly-selected-slower-but-nicer hash functions as needed.

> you'll need n different hashers, is it not sufficient to seed with
> values in the range [0, n)?
>
> I guess you could just pass initial state to the hash_code constructor
> (again if I'm reading this API right).  But that's kind of an ugly API
> if the constructor takes a different number of bytes depending on the
> hash function family.  Maybe if this information is really needed it
> would be better to expose a typedef on some hasher type that tells you
> how many bits of entropy it wants -- then it's up to you to prepend
> the right amount.
>
>
> This all brings up another concern:
>
> Looking at the sample implementation [1], hashing a small value, e.g.
> a pointer, an int64, or even a short string seems necessarily slower
> than if I just implemented std::hash<T> to call FarmHash64 directly.
> The reason being that the new API doesn't know a priori how many bytes
> I'm going to be hashing, and I'm free to split up those bytes into as
> many calls to the hasher as I want.  Therefore it has to carefully
> buffer things until I finalize the hasher.
>
> In contrast, if I just call FarmHash64(some_int), it drops right into
> a fast path and doesn't do any buffering and so on.
>
> I like that the proposal lets us hash aggregate types (vectors,
> structs, etc.) while preserving the full state of the hasher (that is,
> without finalizing down to size_t), but in accordance with the
> zero-cost principle of C++, it's probably important that simple types
> are hashed as fast as you could hash them if you were invoking
> FarmHash64 directly?

Yes, from a performance standpoint, APIs that let me specify exactly
what I'm doing are probably going to be best. E.g., maybe I want to
initialize and use a few hash functions that map strings to integers
in [0, number of buckets). I'm guessing the counterargument relates to
API complexity.

Geoff
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev