[RFC] new format string attributes

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] new format string attributes

Fangrui Song via cfe-dev
Hey guys,

I'd like to add an extension to Clang's format string checking, as well as extend the format string attributes to add UTF-16 (u) and UTF-32 (U) length modifiers to the c, C, s, and S type specifiers.

I've chosen lower case and upper case u to match the C and C++ standards for string literal specifiers, u"" and U"" telling the compiler to create a UTF-16 and UTF-32 string literals respectively.

as for the function attributes, my plan is to create uprintf and Uprintf arguments to the gcc style attributes e.g. __attribute__(__format__(uprintf)) and __attribute__(__format__(Uprintf))

I've chosen uprintf and Uprintf to match C and C++'s string type indicators as well, though it might be a better idea to go with u16printf and u32printf to match C++'s u16string and u32string types?
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] new format string attributes

Fangrui Song via cfe-dev
Thank you, this sounds like interesting work. I have a few questions below.

On Tue, Mar 24, 2020 at 4:22 PM Marcus Johnson via cfe-dev
<[hidden email]> wrote:
>
> Hey guys,
>
> I'd like to add an extension to Clang's format string checking, as well as extend the format string attributes to add UTF-16 (u) and UTF-32 (U) length modifiers to the c, C, s, and S type specifiers.
>
> I've chosen lower case and upper case u to match the C and C++ standards for string literal specifiers, u"" and U"" telling the compiler to create a UTF-16 and UTF-32 string literals respectively.

Are you proposing you want to add these length modifiers to llvm-libc?
Or do you have an existing libc implementation that already supports
them and you would like Clang's format string checking functionality
to honor them?

You should not use 'u' as a length modifier as that's reserved for
future standardization because it's lowercase (see C17 7.31.11p1).
Perhaps U16 and U32 would be reasonable length modifiers though? Also,
what about U8 for UTF-8 strings, or are you planning to not handle
those for some reason?

Also, are there any active proposals within the C committee for this
functionality? Are you planning to introduce any? (This relates to
https://clang.llvm.org/get_involved.html "Contributing Extensions to
Clang" bullet 4.)

> as for the function attributes, my plan is to create uprintf and Uprintf arguments to the gcc style attributes e.g. __attribute__(__format__(uprintf)) and __attribute__(__format__(Uprintf))
>
> I've chosen uprintf and Uprintf to match C and C++'s string type indicators as well, though it might be a better idea to go with u16printf and u32printf to match C++'s u16string and u32string types?

Why do you need a new format archtype?

~Aaron

> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] new format string attributes

Fangrui Song via cfe-dev
Hey Aaron,

I'm only planning on adding support to Clang for static analysis, FixIts, etc.

I am planning on introducing a proposal to the C and C++ committees to add these length modifiers for c and s type specifiers.

As for UTF-8 I was planning on just using the no-length-modifier c or s type specifier.

Why do you need a new format archtype?

Once again, I'm just trying to add support for static analysis, checking, etc.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] new format string attributes

Fangrui Song via cfe-dev
On Wed, Mar 25, 2020 at 10:32 AM Marcus Johnson
<[hidden email]> wrote:
>>
>> Hey Aaron,
> I'm only planning on adding support to Clang for static analysis, FixIts, etc.

This has me a bit confused then. If the libc implementation doesn't
know about the new length modifiers, why do you want to add them to
the format string checker? We don't want to suggest users use length
modifiers that are UB at runtime because the C library doesn't support
them.

> I am planning on introducing a proposal to the C and C++ committees to add these length modifiers for c and s type specifiers.
>
> As for UTF-8 I was planning on just using the no-length-modifier c or s type specifier.

Okay, so you hope that eventually the standards committees will adopt
this, and then drive the libc implementations to support it?

> > Why do you need a new format archtype?
>
> Once again, I'm just trying to add support for static analysis, checking, etc.

That doesn't answer why we need a new format archtype. The archtypes
are used because we want the check to model behavior specific to some
API. If I understand your proposal properly, you're not proposing to
add anything like uprintf() to a C library (and such an API doesn't
already exist), so adding a new archtype surprises me. I would have
thought the existing archtypes would suffice, but maybe I'm still
misunderstanding a part of your proposal.

~Aaron
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] new format string attributes

Fangrui Song via cfe-dev

> That doesn't answer why we need a new format archtype. The archtypes
> are used because we want the check to model behavior specific to some
> API. If I understand your proposal properly, you're not proposing to
> add anything like uprintf() to a C library (and such an API doesn't
> already exist), so adding a new archtype surprises me. I would have
> thought the existing archtypes would suffice, but maybe I'm still
> misunderstanding a part of your proposal.
>
> ~Aaron

I'm not against adding printf functions to the C standard, I've got my own formatters that work with UTF-16 and UTF-32 and I just want to add compile time checking to them really.

the point of adding new length modifiers to the standard is so the current functions can accept a UTF-16 or UTF-32 strings or characters as arguments
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] new format string attributes

Fangrui Song via cfe-dev
On Wed, Mar 25, 2020 at 11:51 AM Marcus Johnson
<[hidden email]> wrote:

> > That doesn't answer why we need a new format archtype. The archtypes
> > are used because we want the check to model behavior specific to some
> > API. If I understand your proposal properly, you're not proposing to
> > add anything like uprintf() to a C library (and such an API doesn't
> > already exist), so adding a new archtype surprises me. I would have
> > thought the existing archtypes would suffice, but maybe I'm still
> > misunderstanding a part of your proposal.
> >
> > ~Aaron
>
> I'm not against adding printf functions to the C standard, I've got my own formatters that work with UTF-16 and UTF-32 and I just want to add compile time checking to them really.
>
> the point of adding new length modifiers to the standard is so the current functions can accept a UTF-16 or UTF-32 strings or characters as arguments

Ah, I think I may be starting to understand better now, thank you! It
sounds like what you're looking for is a way to add length modifiers
and format specifier extensions for a custom printf-like API, with the
current use case being for UTF-16 and UTF-32 strings. Would that be an
accurate understanding?

~Aaron
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] new format string attributes

Fangrui Song via cfe-dev
Ah, I think I may be starting to understand better now, thank you! It
sounds like what you're looking for is a way to add length modifiers
and format specifier extensions for a custom printf-like API, with the
current use case being for UTF-16 and UTF-32 strings. Would that be an
accurate understanding?

I think so?

length modifiers for c and s, yes.

Also I'm trying to extend the attributes __attribute__(__format__(X, Y, Z)) where X is a type for UTF-16 and UTF-32, and Y and Z are unchanged

Not sure what you mean by format specifier extensions beyond these two.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] new format string attributes

Fangrui Song via cfe-dev
In reply to this post by Fangrui Song via cfe-dev
On Wed, Mar 25, 2020 at 12:59 PM Aaron Ballman via cfe-dev <[hidden email]> wrote:
On Wed, Mar 25, 2020 at 11:51 AM Marcus Johnson <[hidden email]> wrote:
> > That doesn't answer why we need a new format archtype. The archtypes
> > are used because we want the check to model behavior specific to some
> > API. If I understand your proposal properly, you're not proposing to
> > add anything like uprintf() to a C library (and such an API doesn't
> > already exist), so adding a new archtype surprises me. I would have
> > thought the existing archtypes would suffice, but maybe I'm still
> > misunderstanding a part of your proposal.

In particular, if all you want to do is support `__attribute__((format(printf, x, y)))` on function parameters that happen to be of type `char8_t*`, `char16_t*` or `char32_t*`, that should be trivial.  Just look at how Clang works for arguments of type `wchar_t*` and copy that.

...Oh wait, it looks like neither GCC nor Clang actually implement format-string checking for wchar_t format strings!

    std::wprintf("%s", 42);  // no diagnostic emitted

So that would be a very good place to start, IMO. Once the code is in place to format-check wide string literals, it should be trivial to extend it to also format-check char{8,16,32}_t literals.
Here's the existing bug report: https://bugs.llvm.org/show_bug.cgi?id=16810

Orthogonally, you seem to be proposing that there should be some new printf format specifiers besides %s %c %[ (for char) and %ls %lc %l[ (for wchar_t).  This is not a Clang issue; this is a library-design issue that you should think about as you write your library function that takes a format string (you know, the one you want to apply __attribute__((format)) to).  If you are not writing a library function, then you have nothing to apply the attribute to, and therefore there's no reason for you to need anything changed.
You throw out the ideas of %us for char16_t, %Us for char32_t, and have no suggestion for char8_t. However, you cannot use %us as a format specifier, because printf already gives that sequence a valid meaning:

    printf("hello %us world", 42u);  // prints "hello 42s world"

My off-the-top-of-my-head idea is that you should take a hint from MSVC; they provide %I32d, %I64d, etc., for integer types, so how about %C8s, %C16s, and %C32s for Unicode character string types?  However, again, this is an issue to think about as you design your `MyPrintfLikeFunction` within your own codebase. Maybe you'll find that you don't even need a format specifier for those types.

(FWIW, the C and C++ party line seems to be that no "%C16s" or "%C32s" is needed, because the modern approach is to separate transcoding from output. You shouldn't be printf'ing Unicode strings directly; you should be first transcoding them into `char` or `wchar_t` strings, and then printf'ing or wprintf'ing those strings. Personally I don't think that approach is very helpful in practice, though.)

–Arthur

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] new format string attributes

Fangrui Song via cfe-dev
Hey Arthur,

I really like your idea to just use the regular printf format attribute for all formats.

We can wrap the wprintf attribute as well in the same logic, all we’d have to check is that the format argument is a literal, a string, and that the strings designated type (via u8, u, U, L, etc prefixes) match the actual type expected by printf.

As for changing the length modifiers to U16, U32, etc sure it’s kinda disappointing, but it’s doable.

On Mar 25, 2020, at 1:53 PM, Arthur O'Dwyer via cfe-dev <[hidden email]> wrote:


On Wed, Mar 25, 2020 at 12:59 PM Aaron Ballman via cfe-dev <[hidden email]> wrote:
On Wed, Mar 25, 2020 at 11:51 AM Marcus Johnson <[hidden email]> wrote:
> > That doesn't answer why we need a new format archtype. The archtypes
> > are used because we want the check to model behavior specific to some
> > API. If I understand your proposal properly, you're not proposing to
> > add anything like uprintf() to a C library (and such an API doesn't
> > already exist), so adding a new archtype surprises me. I would have
> > thought the existing archtypes would suffice, but maybe I'm still
> > misunderstanding a part of your proposal.

In particular, if all you want to do is support `__attribute__((format(printf, x, y)))` on function parameters that happen to be of type `char8_t*`, `char16_t*` or `char32_t*`, that should be trivial.  Just look at how Clang works for arguments of type `wchar_t*` and copy that.

...Oh wait, it looks like neither GCC nor Clang actually implement format-string checking for wchar_t format strings!

    std::wprintf("%s", 42);  // no diagnostic emitted

So that would be a very good place to start, IMO. Once the code is in place to format-check wide string literals, it should be trivial to extend it to also format-check char{8,16,32}_t literals.
Here's the existing bug report: https://bugs.llvm.org/show_bug.cgi?id=16810

Orthogonally, you seem to be proposing that there should be some new printf format specifiers besides %s %c %[ (for char) and %ls %lc %l[ (for wchar_t).  This is not a Clang issue; this is a library-design issue that you should think about as you write your library function that takes a format string (you know, the one you want to apply __attribute__((format)) to).  If you are not writing a library function, then you have nothing to apply the attribute to, and therefore there's no reason for you to need anything changed.
You throw out the ideas of %us for char16_t, %Us for char32_t, and have no suggestion for char8_t. However, you cannot use %us as a format specifier, because printf already gives that sequence a valid meaning:

    printf("hello %us world", 42u);  // prints "hello 42s world"

My off-the-top-of-my-head idea is that you should take a hint from MSVC; they provide %I32d, %I64d, etc., for integer types, so how about %C8s, %C16s, and %C32s for Unicode character string types?  However, again, this is an issue to think about as you design your `MyPrintfLikeFunction` within your own codebase. Maybe you'll find that you don't even need a format specifier for those types.

(FWIW, the C and C++ party line seems to be that no "%C16s" or "%C32s" is needed, because the modern approach is to separate transcoding from output. You shouldn't be printf'ing Unicode strings directly; you should be first transcoding them into `char` or `wchar_t` strings, and then printf'ing or wprintf'ing those strings. Personally I don't think that approach is very helpful in practice, though.)

–Arthur
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev