Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Mon Sep 28 17:05:00 GMT 2009

2009/9/28 Corinna Vinschen:
>> >> Oh, and I thought of one more thing that won't roundtrip correctly
>> >> from Unix to Windows and back: a high surrogate directly followed by a
>> >> low surrogate, because they'll combine into a non-BMP codepoint
>> >> represented by a 4-byte sequence. That's near-impossible to happen by
>> >> chance though.
>> >
>> > There is no chance to do that right.  But I'm willing to stick to
>> > this trade-off since, as you wrote, it's near-impossible that somebody
>> > created that filename by chance.
>>
>> Hmm. But what if Java or Oracle or some other CESU-8 degenerate did
>> that on purpose?
>>
>> Just in case you're not yet completely sick of this, here's how I
>> think it could be done:
>
> Nooooo!  I *am* completely sick of this.  I'm willing to let this slip
> until the first complaint about this very issue comes along.

Sorry.

And I was wrong as well: outlawing lone surrogates in
__utf8_mbtowc/wctomb is not necessary to deal with this. It could be
done in the same way as I'd suggested for F0xx codepoints, i.e., treat
them as illegals in sys_cp_mbstowcs only:

\xED\xB0\80 -> U+F0ED U+F0B0 U+F080

^X encoding would still be needed for Windows-side lone surrogates,
but due to the above, __utf8_wctomb could be used to encode them.

But yeah, let's leave that for now.

Andy