Musings about UCNs

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Musings about UCNs

Sean Hunt
I've been eyeing UCNs for a while, and so I've got a few musings to
share; perhaps they will help whoever gets around to implementing them.

Disclaimer: I'm basing this off the C++ spec. If there are
differences/incompatbilities for the C spec, I haven't noticed.

Thoughts:
  - We should probably use UTF-8 internally because it has a bunch of
nice features, like not breaking any existing code within clang.
  - We could also accept UTF-8 as the default character encoding and
process extended characters directly. The driver should handle other
encodings by converting them to UTF-8.
  - Pursuant to that, does clang currently assume it's being compiled on
an ASCII system?
  - To reduce performance hits, we should only scan a given identifier
once to see if it contains any illegal characters. I'm thinking the
Token should store whether it contains a universal-character as it
stores whether or not it needs cleaning, and IdentifierTable::get() gets
a default parameter added; if it's set and the identifier is not already
in the table, then a check is performed, ideally on a precompiled trie.
  - For literals, UCN processing will occur in the token lexer invoked
by Sema later on, including conversion to the execution character set if
necessary.
  - How extended characters should be stored in names in unclear.
Ancient cxx-abi-dev discussions are undecided on whether simply using
UTF-8 is correct. GCC code seems to suggest this is the intent in the
long run.

Sean
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] Musings about UCNs

Chris Lattner

On Jan 25, 2010, at 11:42 PM, Sean Hunt wrote:

> I've been eyeing UCNs for a while, and so I've got a few musings to
> share; perhaps they will help whoever gets around to implementing  
> them.
>
> Disclaimer: I'm basing this off the C++ spec. If there are
> differences/incompatbilities for the C spec, I haven't noticed.
>
> Thoughts:
>  - We should probably use UTF-8 internally because it has a bunch of
> nice features, like not breaking any existing code within clang.

yes.

>  - We could also accept UTF-8 as the default character encoding and
> process extended characters directly. The driver should handle other
> encodings by converting them to UTF-8.

We should have SourceMgr do this, the driver doesn't know about all  
the headers etc.

>  - Pursuant to that, does clang currently assume it's being compiled  
> on
> an ASCII system?

Yes, we don't care about non-ascii systems.  When we do, sourcemgr can  
translate them as well.

>  - To reduce performance hits, we should only scan a given identifier
> once to see if it contains any illegal characters.

Yes, the lexer should just handle this in the identifier lexing code.  
The common case is "no ucn" so any ucn characters should cause a  
branch out of the fastpath into the existing slow case of identifier  
lexing.

> I'm thinking the
> Token should store whether it contains a universal-character as it
> stores whether or not it needs cleaning, and IdentifierTable::get()  
> gets
> a default parameter added; if it's set and the identifier is not  
> already
> in the table, then a check is performed, ideally on a precompiled  
> trie.

I don't think this is necessary.  The IdentifierInfo* should contain  
the canonicalized utf8 encoding, and the spelling is whatever is in  
the code (after sourcemgr switches the character set).

>  - For literals, UCN processing will occur in the token lexer invoked
> by Sema later on, including conversion to the execution character  
> set if
> necessary.

Sure.

>  - How extended characters should be stored in names in unclear.
> Ancient cxx-abi-dev discussions are undecided on whether simply using
> UTF-8 is correct. GCC code seems to suggest this is the intent in the
> long run.

Storing canonicalized utf8 in the identifiers is the only reasonable  
thing to do.

-Chris
_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev