Quantcast

Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Scott Conger
Hi all,

I'm new to clang. I've been looking at adding support for
-finput-charset, -fexec-charset and -fwide-exec-charset. I took at
look through the mailing list archives and code, and I haven't seen a
lot of discussion of this except in a general sense. Has anyone taken
a more serious look at this?

>From what I can tell, the general assumption inside clang about
character sets is that both the input and execution character sets are
some form of single byte encoding based on ASCII. Lexer.cpp actually
has an ASCII table in it. Multi-byte characters in encodings like
UTF-8 are falsely considered multiple characters.

That is, clang is working like this:

input-charset: Bytes straight from file, assumed to be ASCII, with no
awareness of multi-byte encoding.
exec-charset: Same as input
wide-exec-charset: Not a character set exactly. A single byte from the
input file is copied into the wide character, with no awareness of
multi-byte encoding. Also seems to assume little endian.

GCC, by default, works like this:

input-charset: Locale specified encoding or UTF-8 if it cannot detect
system encoding. Picks up byte order marks (sometimes?) as it relies
on iconv.
exec-charset: UTF-8
wide-exec-charset: UTF-16 or UTF-32 (depending on wchar_t size) in
local endian byte order

Obviously, you can override these values with the command line
options. It supports anything the native iconv library supports.

Given that clang seems to assume single byte ASCII all over the place,
the best way to go would seem to be to use iconv (or the native
Windows API) to convert the input files into UTF-8 and use that
internally. Existing code will function for character values in the
basic ASCII set (that is, values smaller than 128) and multi-byte
characters should work correctly in comments. The one place that would
need a serious update would be the parsing of string literals (i.e.
StringLiteralParser::init).

Any comments, suggestions, etc? Specifically, input from people that
have looked at supporting universal-character-names or unicode string
literals?

-Scott
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Chris Lattner
On Jun 12, 2011, at 9:59 PM, Scott Conger wrote:
> I'm new to clang. I've been looking at adding support for
> -finput-charset, -fexec-charset and -fwide-exec-charset. I took at
> look through the mailing list archives and code, and I haven't seen a
> lot of discussion of this except in a general sense. Has anyone taken
> a more serious look at this?

Hi Scott,

It would be great for you to tackle this.  Some people (including me) have thought about this a bit, but no specific work has started. Your assessment of how things work (everything ASCII) is right on target.

I'd suggest starting with this approach:

1. Make the compiler fully UTF8 clean and happy.  This is moderately easy, the only major concern is that the lexer is highly performance sensitive.  Your plan makes sense to me.
2. Introduce support for UCN's.
3. Add support for specifying/detecting input charsets (e.g. through BOMs).

Part #3 can be handled in several ways.  The best way to start is to have SourceManager detect that files need to be remapped when opened, and just rewrite the entire input buffer into UTF8.  This way we only pay a performance hit when dealing with files in strange encodings.  For (common!) single byte encodings that map 0-127 onto normal ascii characters, SourceManager can scan the file and if there are no high characters, then no remapping is required.

-Chris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Scott Conger
Thanks for the reply Chris.

I was going to put off universal-character-names for now. It should be
easy to add afterward.

For the BOM and input character sets the general scheme I have at the moment is:

* Check for BOM (warning if it contradicts the inputcharset option)
* If inputcharset option is UTF-8, the locale specified encoding is
UTF-8 or there is a UTF-8 BOM, just validate the input (performance
hit later on if there can be invalid UTF-8)
* If user specified a non-UTF-8 inputcharset, use iconv to convert
(ignoring the BOM, which might be a false positive)
* For other BOM, use iconv to convert

The fallback is to check if every byte is < 128, using iconv or the
windows API to convert from the native encoding if a high bit is set.
This appears to be a valid assumption on everything except IBM
machines with native ebcdic, which I'm ignoring since Clang won't
compile anyways.

The main issue that I've run into is compatibility. My experimentation
with gcc shows a lot of edge cases such as specifying a
wide-exec-charset that is some 8 bit encoding, or putting octal/hex in
a string that violates the alignment.

-Scott

On Tue, Jun 14, 2011 at 11:01 PM, Chris Lattner <[hidden email]> wrote:

> On Jun 12, 2011, at 9:59 PM, Scott Conger wrote:
>> I'm new to clang. I've been looking at adding support for
>> -finput-charset, -fexec-charset and -fwide-exec-charset. I took at
>> look through the mailing list archives and code, and I haven't seen a
>> lot of discussion of this except in a general sense. Has anyone taken
>> a more serious look at this?
>
> Hi Scott,
>
> It would be great for you to tackle this.  Some people (including me) have thought about this a bit, but no specific work has started. Your assessment of how things work (everything ASCII) is right on target.
>
> I'd suggest starting with this approach:
>
> 1. Make the compiler fully UTF8 clean and happy.  This is moderately easy, the only major concern is that the lexer is highly performance sensitive.  Your plan makes sense to me.
> 2. Introduce support for UCN's.
> 3. Add support for specifying/detecting input charsets (e.g. through BOMs).
>
> Part #3 can be handled in several ways.  The best way to start is to have SourceManager detect that files need to be remapped when opened, and just rewrite the entire input buffer into UTF8.  This way we only pay a performance hit when dealing with files in strange encodings.  For (common!) single byte encodings that map 0-127 onto normal ascii characters, SourceManager can scan the file and if there are no high characters, then no remapping is required.
>
> -Chris
>

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Eric Christopher-2

On Jun 15, 2011, at 12:10 AM, Scott Conger wrote:

> This appears to be a valid assumption on everything except IBM
> machines with native ebcdic, which I'm ignoring since Clang won't
> compile anyways.

FWIW this should work fine with -fexec-charset since you only
care about the strings that are being output into the object.

The rest of the plan sounds good to me.

-eric
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Chris Lattner
In reply to this post by Scott Conger

On Jun 15, 2011, at 12:10 AM, Scott Conger wrote:

> Thanks for the reply Chris.
>
> I was going to put off universal-character-names for now. It should be
> easy to add afterward.

Makes sense.

> For the BOM and input character sets the general scheme I have at the moment is:
>
> * Check for BOM (warning if it contradicts the inputcharset option)

Ok, I don't know GCC's policy on this (it's best to follow it for compatibility unless it is completely insane) but it seems reasonable that the -finput-charset option should only specify a charset for files without a BOM.  If a file has a BOM, we should probably follow it.

> * If inputcharset option is UTF-8, the locale specified encoding is
> UTF-8 or there is a UTF-8 BOM, just validate the input (performance
> hit later on if there can be invalid UTF-8)

If I understand correctly, the only invalid UTF8 occurs with high characters.  This can probably be inlined into the lexer at near-zero cost to avoid a prepass.

> * If user specified a non-UTF-8 inputcharset, use iconv to convert
> (ignoring the BOM, which might be a false positive)
> * For other BOM, use iconv to convert

Yep.

> The fallback is to check if every byte is < 128, using iconv or the
> windows API to convert from the native encoding if a high bit is set.
> This appears to be a valid assumption on everything except IBM
> machines with native ebcdic, which I'm ignoring since Clang won't
> compile anyways.

Yes, we don't care about EBCDIC. If someone comes around with a deep passion for it later, we can deal with it then.

> The main issue that I've run into is compatibility. My experimentation
> with gcc shows a lot of edge cases such as specifying a
> wide-exec-charset that is some 8 bit encoding, or putting octal/hex in
> a string that violates the alignment.

I'm not sure what you mean here,

-Chris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Scott Conger
For the BOM detection, I meant that we should always go with the user
specified -finput-charset option, simply issuing a warning if it
contradicts what the user told us. gcc always goes with what the user
specified in -finput-charset.

As for edge cases, consider something like:

char c = '(some Han character that takes 4 bytes)';
char d = '0xFFC';

Here, gcc will trim the character. You get:

test.c:6:12: warning: character constant too long for its type
test.c:6:13: warning: character constant too long for its type

This seems to be tied to the -Wcharacter-truncation option.

Another example is:

wchar_t* str = L"abcdef\x54ghijklmnop";

If sizeof(wchar_t) == 4, we normally expect characters to be aligned
to 4 byte boundary. The string suggests that a single byte is inserted
in the middle which leaves the compiler with an issue of
interpretation. It seems that gcc zero-pads the value to align it.

A more complex case is when you take that same string and tell gcc
-fwide-exec-charset=ASCII. This doesn't make a lot of sense, but gcc
happily takes it.

In these cases, we simply have to be sure we correctly gcc's behavior
and make sure we have adequate test coverage.

-Scott

On Fri, Jun 17, 2011 at 9:43 AM, Chris Lattner <[hidden email]> wrote:

>
> On Jun 15, 2011, at 12:10 AM, Scott Conger wrote:
>
>> Thanks for the reply Chris.
>>
>> I was going to put off universal-character-names for now. It should be
>> easy to add afterward.
>
> Makes sense.
>
>> For the BOM and input character sets the general scheme I have at the moment is:
>>
>> * Check for BOM (warning if it contradicts the inputcharset option)
>
> Ok, I don't know GCC's policy on this (it's best to follow it for compatibility unless it is completely insane) but it seems reasonable that the -finput-charset option should only specify a charset for files without a BOM.  If a file has a BOM, we should probably follow it.
>
>> * If inputcharset option is UTF-8, the locale specified encoding is
>> UTF-8 or there is a UTF-8 BOM, just validate the input (performance
>> hit later on if there can be invalid UTF-8)
>
> If I understand correctly, the only invalid UTF8 occurs with high characters.  This can probably be inlined into the lexer at near-zero cost to avoid a prepass.
>
>> * If user specified a non-UTF-8 inputcharset, use iconv to convert
>> (ignoring the BOM, which might be a false positive)
>> * For other BOM, use iconv to convert
>
> Yep.
>
>> The fallback is to check if every byte is < 128, using iconv or the
>> windows API to convert from the native encoding if a high bit is set.
>> This appears to be a valid assumption on everything except IBM
>> machines with native ebcdic, which I'm ignoring since Clang won't
>> compile anyways.
>
> Yes, we don't care about EBCDIC. If someone comes around with a deep passion for it later, we can deal with it then.
>
>> The main issue that I've run into is compatibility. My experimentation
>> with gcc shows a lot of edge cases such as specifying a
>> wide-exec-charset that is some 8 bit encoding, or putting octal/hex in
>> a string that violates the alignment.
>
> I'm not sure what you mean here,
>
> -Chris
>

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Loading...