This is the mail archive of the cygwin mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: sed doesn't like LANG= anymore

From: Andy Koppe <andy dot koppe at gmail dot com>
To: "cygwin at cygwin dot com" <cygwin at cygwin dot com>
Date: Thu, 20 May 2010 20:46:05 +0300
Subject: Re: sed doesn't like LANG= anymore
References: <20100520123926.GA1432@onderneming10.xs4all.nl> <AANLkTilpbuyiJIswTZGQN5jsHsK793ITUP9pcB95Hf1l@mail.gmail.com> <4BF55F87.4060407@towo.net>

On Thursday, May 20, 2010, Thomas Wolff:
> With LANG=anything-unknown, the charmap is set to ASCII, so it works (as there is at least no multibyte character then).

Anything above 0x7F is invalid with charset ASCII though (since
1.7.2). But perhaps sed skips the multibyte conversion functions when
in the C locale.

> Considering the described effect, I doubt that a UTF-8 decoder should swallow an ASCII byte after an incomplete UTF-8 sequence;
> it should rather stop at the last UTF-8 sequence byte, and consider any subsequent initial UTF-8 or ASCII byte as a new character.

0xE5 is a valid initial byte of a UTF-8 sequence, hence mbtowc returns
-2 ("incomplete") after that and -1 ("invalid") on encountering the
following ASCII byte. I think it would be wrong to ignore the encoding
error, and it's up to the application to back up and feed in the same
byte again if it wants to.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

References:
- sed doesn't like LANG= anymore
  - From: Jurriaan
- Re: sed doesn't like LANG= anymore
  - From: Andy Koppe
- Re: sed doesn't like LANG= anymore
  - From: Thomas Wolff

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]