This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: charset changes


On Mar 27 06:47, Andy Koppe wrote:
> I think the conclusion from all this is that approach 2 is the least
> broken way to handle GB18030: when encountering a 4-byte sequence that
> maps to a non-BMP char (and hence a UTF-16 surrogate pair), write the
> high surrogate and report that one byte less than actually seen has
> been consumed. On the next mbtowc call, ignore the input, write the
> low surrogate, and report that 1 byte has been consumed.
> 
> As mentioned, this breaks the mbtowc spec when bytes are fed in
> one-by-one, because in that case zero needs to be returned  after the
> high surrogate, yet zero is meant to signal string end. An application
> that's aware of that can work around it by checking whether the wide
> character that's written actually is null, but in others it may cause
> truncated strings. Fortunately, the mbstowcs implementation isn't
> affected by this, because that always passes as many bytes as possible
> to mbtowc, i.e. the incorrect zero return can't occur there.
> 
> The MultiByteToWideChar() function doesn't have a way to tell
> incomplete from invalid sequences, which is needed to decide whether
> to return -2 or -1 from mbtowc. "Interestingly", if you give it only
> two bytes of a 4-byte GB18030 sequence, e.g. \x95 \x33, it interprets
> that as a one-byte invalid sequence followed by the digit '3'.

Huh?  How did you test that?  AFAIK MultiByteToWideChar, it doesn't
tell you how many and which bytes it treated as valid substring.

> Therefore I think the best thing to do is to manually parse GB18030
> sequences, which is fairly straightforward, and only hand complete
> sequences over to MultiByteToWideChar for translation to UTF-16. Shall
> I have a go at that?

I would really be glad.  You'd just create two functions __gb18030_mbtowc
and __gb18030_wctomb in strfuncs.cc, and I could easily add it to newlib's
setlocale_r.  Oh, and then there's check_codepage in nlsfuncs.cc which
needs to test if codepage 54936 is installed.

However, here's a problem.  Adding these functions is non-trivial code
and requires a copyright assignment... sigh.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]