This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: The C locale


2009/9/21 Corinna Vinschen:
> Back from vacation I re-read this thread now and I have to say I just
> don't know what is the best course of action here.

I'm afraid I can only reiterate what I said previously.

Let's use the Windows "ANSI" codepage as the character set for the C
locale, for both the conversion functions and filenames. This means
CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
ones, and so on.

This way, the non-ASCII needs of most users are covered
out-of-the-box, and compatibility with Cygwin 1.5 and users'
ANSI-encoded files is ensured. Applications that still assume that a
byte and a character are the same thing work correctly (except that
they'll treat East Asian doublebyte chars as two characters, but a
different default charset won't cure that).

Filenames created on the Cygwin side show up correctly in Explorer.
Windows filenames show up correctly in Cygwin as long as they're
limited to the ANSI codepage. The ^N encoding nevertheless ensures
that UTF-16 characters outside that codepage are uniquely represented.

Beyond that, encourage maintainers to make their applications
UTF-8-capable and encourage users to choose a UTF-8 locale. Consider
adding a locale setting to setup.exe that gets written to cygwin.bat.


> The idea to use UTF-8 for filename and console operations by default was
> to get the least problems converting from UTF-16 to multibyte, so that
> readdir() always returns a valid filename.

But the ^N scheme does ensure that for any charset anyway, doesn't it?


> As for the conversion of filenames, you get the same problem on Linux if
> the filename contains non-ASCII bytes and these bytes are not a valid
> multibyte character in the current locale.

Yes, but Cygwin does actually have a big advantage here. Unlike Linux,
where the filename encoding is basically undefined, we *know* that
Windows filenames are always encoded as UTF-16. Therefore, the Cygwin
file functions do have the chance to always translate filenames
correctly into the application's locale.

And with any locale except "C" and "POSIX",  this is working very
well, due to your great work implementing all the difficult bits such
as the ^N and 0xDC?? encodings and UTF-16 surrogates (and
notwithstanding the issue with translating 0xDC??s to charsets other
than UTF-8).


>> I see two good solutions:
>> - Use the default Windows codepage for filenames, console, and
>> multibyte functions. This is what happens already if you specifiy a
>> locale with a language but no charset, e.g. "en". Maximum 1.5
>> compatibility.
>
> Hmm, yes, that might be an option. ÂAllowing the C.UTF-8 locale
> could workaround the remaining problems.

Not sure that the C.UTF-8 locale is necessary for that, but it would
be nice to have, and it's easy to implement.


>> - Use UTF-8 throughout. Full Unicode support out-of-the box.
>
> What means "throughout"? ÂDo you want ASCII multibyte conversion to
> use UTF-8 as well?

Yep, that was the idea, but later on I realised that it's not a good
one, because too many applications still assume that a byte and a
character are the same thing. For example, start nano in a UTF-8
locale, enter a few umlauts, and move the cursor around, and you'll
see some weird effects. Similarly, filenames with non-ASCII chars will
corrupt midnight commander's display.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]