This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Non-canonical mode input via tcsetattr(), under mintty console


Thomas Wolff:
> Dave Lee schrieb:
>>
>> Hi all,
>>
>> I was testing a program that uses non-canonical mode input via
>> tcsetattr().
>>
>> ...
>> Specifically, I entered the chinese character "ä" (which means "rule"
>> or "example"). It occupies 3 bytes in UTF-8 representation: E4, BE, 8B.
>>
>> On standard console, the read() call returned THREE bytes (n == 3), and
>> (not surprisingly) E4, BE and 8B were returned to buf[].
>>
>> On mintty console, the read() call returned ONE byte (n == 1), and only
>> E4 were returned to buf[]. I could grab the other two bytes if I did
>> additional calls to read().
>>
> This is absolutely in line with the specified interface of read(), whether
> or not you apply some tcsetattr settings, and whether or not there is a
> difference between cygwin console and mintty. It is a traditional
> byte-oriented function and has no knowlege or handling of character
> encoding, and there is no guarantee that a multi-byte character comes in one
> piece.

Exactly.


> (Even if mintty were changed to try to feed them in one piece, there
> would still be no guarantee that you receive them in one piece.)

As it happens, mintty sends multibyte characters in a single write()
already, but the pseudo terminal device driver is indeed entitled to
pick them apart anyway: VMIN=1 and VTIME=0 means give me at least one
byte, as soon as you have it. It's also possible that multiple
characters are delivered at once.


> You have four options (two each whether you want UTF-8 or Unicode words in
> your program):
> [...]
> * Read bytes and transform with one of the mbtowc (multi-byte to
> wide-character) functions
> [...]

I'd go with that, because that way you can support not only UTF-8, but
all the charsets supported by the OS.


> (provided you want characters as Unicode words,
> not UTF-8 sequences in your program).

In that case, one can just ignore the widechar output and only use the
length info returned by mb(r)towc.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]