This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Grepping Unicode files?


On 05/14/2015 10:32 AM, Vince Rice wrote:

> locale run from a cmd.exe session says that everything is âC.UTF-8â, while locale run from mintty says that everything is en_US.UTF-8. A âwhichâ in both cases shows that the locale being run is cygwinâs, so I assume mintty does something slightly differently than the normal console? I donât even know if thereâs a difference. (Have I mentioned I donât know anything about all of this?)
> 
> From cmd.exe:
> LANG=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_ALL=

That's because all programs default to C unless told otherwise; from
cmd, there is nothing stating otherwise, as each cygwin command is the
first process in its own tree of processes.

> 
> From mintty
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_ALL=

mintty is a cygwin process, AND it sets your locale variables to match
your Windows locale, then all other processes are children of mintty and
get the preferred locale settings by default.  Of course, if you don't
like mintty's defaults, you can set up your shell initialization scripts
to change it to your preference.

> 
> Now, pardon my continued ignorance, but which of those variables needs to be set to UTF16 in order for grep to work? And I assume it (they?) should be set to en_US.UTF-16?

None.  UTF16 is not a valid locale.  It is a valid encoding (wide
character), but locales must operate on multi-byte sequences, not wide
characters.  So you HAVE to convert from wide character to multi-byte
before you can do anything that requires a locale to work correctly.

> 
> Thanks to everyone for your help. I think youâve all confirmed this isnât cygwin-specific, but I couldnât find anything even searching generically (âgrep unicodeâ and now âgrep utf16â). I did finally find an external reference to iconv, but if grep is supposed to be handle this natively, I havenât been able to find much on how to do it.

grep cannot handle UTF16 natively.  iconv exists to do encoding
transformations, so that the rest of the system can live in multi-byte
world instead of worrying about wide-character encodings.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]