This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: regex library fails git tests


On Jul 20 15:52, Mark Levedahl wrote:
> Current git fails two sets of tests on cygwin due apparently to
> problems in the regex library. One set of tests does language based
> word-matching, and has a common failure during regex compilation.
> The suffix clause ("|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+") is
> common to all of these, removing that clause eliminates the regcomp
> failure.
> 
> A test case extracted from the git sources is below - this works
> correctly on Fedora 18, fails on Cygwin:
> 
> $ gcc test-regex.c
> $ ./a.out
> failed regcomp() for pattern '[^<>=     ]+|[^[:space:]]|[â-â][â-â]+'
> 
> The failure disappears when the suffix clause is removed from pat_html.
> 
> This is happening on a current installation:
> $ uname -a
> CYGWIN_NT-5.1 virt-winxp 1.7.21(0.267/5/3) 2013-07-15 12:17 i686 Cygwin

Thanks for the testcase.  The problem is this:  Cygwin's regex is taken
from FreeBSD, so it's not identical to the glibc implementation on Linux.
The FreeBSD implementation converts all input chars to wchar_t and then
handles everything, the pattern as well as the input string, in wchar_t
to be locale- and codeset independent.

You application does not call setlocale, so the locale is "C" or "POSIX"
and the codeset is ANSI_X3.4-1968 (aka ASCII).  The conversion to wchar_t
is performed by calling the mbrtowc function.  This function behaves on
Cygwin the same as on Linux:  If the current locale's codeset is ASCII,
and if the input character is >= 0x80, mbrtowc returns -1 with errno set
to EILSEQ.

This happens on Cygwin.  The regcomp routine converting the input string
to wchar_t calls mbrtowc, and mbrtowc returns -1 (EILSEQ) because the
input character is >= 0x80 in the bracket expression.

Even though the mbrtowc functions behave the same in Cygwin and glibc,
the glibc implementation of regcomp apparently does not call mbrtowc
under all circumstances, namely not in the "C"/"POSIX" locale or if the
locale's codeset is ASCII.  Therefore it does not treat the chars >= 0x80
as invalid characters.

So, what I did now was this:  I added a workaround to Cygwin's regcomp.
If the current codeset is ASCII, the characters in the pattern are
converted to wchar_t by simply using their unsigned value verbatim.
This allows to compile (and test) the patterns in the git testcases.

However, please note that this behaviour, while being provided by glibc
and now by Cygwin, is *not* standards-compliant.  In the narrow sense
the characters beyond 0x7f are still invalid ASCII chars, and other
functions working with wchar_t strings won't be as forgiving when using
invalid input.


HTH,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]