This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: filenames with characters that have the high bit set


David Byron:
> I've read http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode and
> http://cygwin.com/cygwin-ug-net/setup-locale.html but I'm still stumped.
>
> My cygwin.bat now contains:
>
> @echo off
>
> C:
> chdir C:\utils\cygwin\bin
> set LANG=en_US.UTF-8
> bash --login -I
>
> And my ~/.inputrc contains:
>
> set meta-flag on
> set convert-meta off
> set input-meta on
> set output-meta on

Makes plenty of sense. But note that meta-flag is a synonym for
input-meta, so you can remove one of them.

> $ echo $LC_ALL
> en_US

Hang on, where did that come from? LC_ALL overrides any other locale
variables including LANG. Specifying a locale without a charset means
that Cygwin 1.7.1 looks up your ANSI codepage. Assuming you're on a US
system, this means you're getting CP1252, not UTF-8. (Note besides:
Cygwin 1.7.2 changes to a Linux-compatible scheme for locales without
explicit charset instead, where you'd get ISO-8859-1 instead.)


> $ echo $LANG
> en_US.UTF-8
>
> For the rest of this post, assume <special_filename> is "foo" with U+00E9 (e
> with acute accent) at the end.
>
> $ test -f <special_filename>; echo $?
>
> prints 1 when <special_filename> really does exist....depending on how I try
> to represent U+00E9 on the command line
>
> $ ls foo<tab>
>
> adds the actual accented character to the command line (whether set
> show-all-if-ambiguous on is in ~/.inputrc or not). ÂThen I press return and
> ls prints the filename. ÂThen if I go through command history and change
> "ls" to "test -f" and add the "; echo $?" I get the right answer from test.
> So far so good.
>
> But, if I I try to do what
> http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual
> says, the test command always fails, and ls doesn't print the filename. ÂI'm
> not really sure how to get hex code 0x18 through bash and to
> ls/test/whatever properly.
>
> ÂThis what I tried:
>
> $ ls "foo\x18<tab>"
> $ ls "foo\x18\xc3\xa9<tab>"
> $ ls "foo\x18\xc3\xa9*"
>
> Note that 0xC3A9 is the UTF-8 encoding of U+00E9.

There's a bunch of things wrong here.

Due to the LC_ALL setting above, the U+00E9 is encoded as \xE9, not \xC3\xA9.

The \x18 scheme is only used for codepoints that can not be
represented in the selected character set, yet U+00E9 can be
represented CP1252. By definition, any Unicode codepoint can be
represented in UTF-8, so the \x18 scheme is never used when that is
selected.

Bash does not interpret \x specially when it appears in double quotes
(or single quotes or unquoted):

$ echo "\x18"
\x18

To enable C-style backslash interpretation, you need to use $'...' quoting.

Finally, it would appear that bash does not complete partial UTF-8
sequences, which makes sense, as it's probably dealing with wide
characters internally.

> But all get me nothing. ÂReplacing "ls" with "test -f" gives me the same
> nothing. ÂReplacing \x with \X doesn't change anything either.
>
> Perhaps interesting is that if I pipe the ls command built with tab
> completion that actually prints the filename to "od -c" I see
> Then for kicks I tried:
>
> $ touch "\x18"; echo $?
> 0

Have a look in your root directory. There should be a file called x18 there.

> Can someone give me a hand coming up with a command line where I can build
> up filenames that contain characters that have the high bit set (as well as
> any non-ascii character really)?

Just type them in. The 'US International' keyboard layout might be
useful here. See
http://en.wikipedia.org/wiki/Keyboard_layout#US-International.

Otherwise, use $'...', and lose the unnecessary \x18s.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]