This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: gcc4[1.7] printf treats differently a string constant and a character array


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Andy Koppe on 12/28/2009 11:17 PM:
>>> I am using LC_ALL=es_VE.ISO-8859-15.

So you told gcc which charset to use for those non-ASCII characters, which
resulted in raw 8-bit bytes.  puts is required to work transparently on
bytes, but printf is specified as a mix between bytes (arguments matching
%s) and characters (the format string itself, and arguments matching %ls).

> 
> Ah, the problem actually is that your program is missing a call to
> setlocale(LC_CTYPE, "") to switch to the locale and character set
> specified in the environment. In fact, since your program contains
> hard-coded ISO-8859-15 strings, you should probably do
> setlocale(LC_CTYPE, "<whatever>.ISO-8859-15").

Well, as long as you are running it on your machine, with
LC_ALL=es_VE.ISO-8859-15 in the environment, then setlocale(LC_ALL,"")
will pick up the same charset as what gcc hard-coded into your app.  But
yes, by using 8-bit bytes in your string, you have married your executable
to a particular locale, and it is no longer portable to machines using a
different charset.  To be more portable, you would want to use some
iconv() conversions (or look into using gettext() for translation catalogs).

> 
> Without a setlocale call, programs use the "C" locale, and on Cygwin
> 1.7 that implies the UTF-8 character set. Those single accented
> ISO-8859-15 characters are invalid when interpreted as UTF-8, so
> printf halts there. The accented character pairs like "ÃÂ", meanwhile,
> happen to be valid UTF-8, so they get through.
> 
> I couldn't find specific text about invalid bytes in the POSIX printf
> spec,

http://www.opengroup.org/onlinepubs/9699919799/functions/fprintf.html

"all forms of fprintf() shall fail if:

[EILSEQ]
    [CX] A wide-character code that does not correspond to a valid
character has been detected."

> It's talking about "characters" rather than "bytes" there, which I
> think does leave the behaviour for invalid bytes undefined,

It's actually well-defined - non-characters in the format string MUST make
printf fail.  However, it raises the issue of whether the failure must
occur without any output, or only upon detection of the first invalid
character whether or not prior characters and % directives have been acted
upon.  I think the standard is silent on that point, making it a QoI issue.

Remember, POSIX states that any use in a character context of bytes with
the 8th-bit set is specifically undefined in the C locale (whether that be
C.ASCII or C.UTF-8).  Using accented characters (which result in bytes
with the 8th-bit set, whether you use UTF-8 or ISO-8859-15) falls into
that category, so the bug is in your program for expecting sane results
while not changing the locale away from C.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@byu.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAks6AkYACgkQ84KuGfSFAYCcZwCfSqNz9qdjxEBXHMwtPJ+8bx9T
6S4AoJlgfarKywPgDH6TY3Zy16/3jc1K
=YRTJ
-----END PGP SIGNATURE-----

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]