This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)


2009/9/28 Corinna Vinschen:
>> Oh, and I thought of one more thing that won't roundtrip correctly
>> from Unix to Windows and back: a high surrogate directly followed by a
>> low surrogate, because they'll combine into a non-BMP codepoint
>> represented by a 4-byte sequence. That's near-impossible to happen by
>> chance though.
>
> There is no chance to do that right. ÂBut I'm willing to stick to
> this trade-off since, as you wrote, it's near-impossible that somebody
> created that filename by chance.

Hmm. But what if Java or Oracle or some other CESU-8 degenerate did
that on purpose?

Just in case you're not yet completely sick of this, here's how I
think it could be done:
- Keep treating surrogate codepoints in UTF-8 as illegal.
- Go for the F0xx encoding for invalid bytes in filenames. Hence the
three bytes of a CESU-8 surrogate will turn into three F0xx, which
will round-trip correctly.
- Encode lone surrogates on the Windows side as ^X sequences. The only
issue here is that the standard __utf8_mbtowc/wctomb could not be used
to do that.

Andy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]