This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: charset changes


Andy Koppe:
>>> Other systems usually have a 32-bit wchar, though. I can see three
>>> ways to tackle the issue, but none of them entirely satisfactory. When
>>> encountering a 4-byte sequence in __gb18300_mbtowc that maps to a
>>> non-BMP char (and hence a UTF-16 surrogate pair):
>>> 1. Just report an invalid sequence. BMP-only support would probably
>>> still cover most practical needs.
>>> 2. Write the high surrogate and report that one byte less than
>>> actually seen has been consumed. On the next mbtowc call, ignore the
>>> input, write the low surrogate, and report that 1 byte has been
>>> consumed. Unfortunately this scheme falls down if the user feeds in
>>> the bytes one-by-one, as Corinna previously found when handling UTF-8
>>> like this.
>>> 3. Write the high surrogate and report the actual number of bytes
>>> consumed. On the next call, write the low surrogate, and return 0 to
>>> indicate that no bytes have been consumed. Trouble is, a return value
>>> of 0 from mbrtowc is supposed to indicate that a null character has
>>> been found. While uses within Cygwin could be changed to recognise
>>> string end by instead looking at the character actually written, this
>>> would lead to truncated strings in applications.
>
> I just found that approach 3 ends up delaying the low surrogate until
> the first byte of the next character is passed to mbtowc. For keyboard
> input at least, that's bad.
>
>
> Corinna Vinschen:
>> Can't we just carry over the surrogate pair handling from __utf8_mbtowc()
>> in newlib/libc/stdlib/mbtowc_r.c? ÂWhat's the stumbling block exactly?
>> Do you have an example?
>
> __utf8_mbtowc write the UTF-16 high surrogate after seeing only three
> bytes of a four-byte sequence. It can do that because the first three
> bytes of a UTF-8 sequence contain all the bits needed for the high
> surrogate. When it's called with the fourth byte, it writes the low
> surrogate and returns 1 to indicate it's consumed 1 byte. (Unless the
> fourth byte is invalid in which case it returns -1).
>
> That approach fits nicely with the mbrtowc spec, but I don't think it
> can be used for GB18030, because there the first three bytes of a four
> byte sequence do not necessarily determine all the bits of the high
> surrogate.
>
> For example (all in hex):
>
> U+207FF ÂGB18030: 95 33 D1 33 ÂUTF-16: D841 DFFF
> U+20800 ÂGB18030: 95 33 D1 34 ÂUTF-16: D842 DC00
>
> The first three GB18030 bytes are the same, yet the high UTF-16
> surrogate is different.

I think the conclusion from all this is that approach 2 is the least
broken way to handle GB18030: when encountering a 4-byte sequence that
maps to a non-BMP char (and hence a UTF-16 surrogate pair), write the
high surrogate and report that one byte less than actually seen has
been consumed. On the next mbtowc call, ignore the input, write the
low surrogate, and report that 1 byte has been consumed.

As mentioned, this breaks the mbtowc spec when bytes are fed in
one-by-one, because in that case zero needs to be returned  after the
high surrogate, yet zero is meant to signal string end. An application
that's aware of that can work around it by checking whether the wide
character that's written actually is null, but in others it may cause
truncated strings. Fortunately, the mbstowcs implementation isn't
affected by this, because that always passes as many bytes as possible
to mbtowc, i.e. the incorrect zero return can't occur there.

The MultiByteToWideChar() function doesn't have a way to tell
incomplete from invalid sequences, which is needed to decide whether
to return -2 or -1 from mbtowc. "Interestingly", if you give it only
two bytes of a 4-byte GB18030 sequence, e.g. \x95 \x33, it interprets
that as a one-byte invalid sequence followed by the digit '3'.

Therefore I think the best thing to do is to manually parse GB18030
sequences, which is fairly straightforward, and only hand complete
sequences over to MultiByteToWideChar for translation to UTF-16. Shall
I have a go at that?

Andy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]