representing charsets

Andy Koppe andy.koppe@gmail.com
Wed Mar 31 05:53:00 GMT 2010


Corinna Vinschen:
> Andy Koppe:
>> 3) Represent charsets as enum constants (or #defines) rather than
>> strings throughout, with the singlebyte charsets ordered in such a way
>> that they correspond to their order in the conversion tables, along
>> these lines:
>>
>> enum {
>>   CS_UTF8 = 0,
>>
>>   /* ISO singlebyte codepages */
>>   CS_ISO8859_1 = 1,
>>   CS_ISO8859_2 = 2,
>>   ...
>>   CS_ISO8859_11 = 11,
>>   /* ISO-8859-12 doesn't exist */
>>   CS_ISO8859_13 = 12,
>>   ...
>>   CS_ISO8859_16 = 15,
>>
>>   /* Windows singlebyte codepages */
>>   CS_CP437 = 100,
>>   CS_CP720 = 101,
>>   CS_CP737 = 102,
>>   ...
>>
>>   /* Multibyte codepages */
>>   CS_SJIS = 200,
>>   CS_GBK = 201,
>>   ...
>> }
>
> But what is that good for?  Which advantage do you have?

- No need to pass around both charset name and the charset table index.
- The __cp_index and __iso8859_index functions can be junked.
__cp_mbtowc/wctomb obtain the index with (cs_id - CS_CP437). Similar
for ISO.
- Only one list of valid codepages (since the one in __cp_index can go).
- Get rid of the hack where the likes of KOI8-R or PT154 are
internally represented as "CPxxx" names, some of which don't actually
correspond to Windows codepages.
- All those strcpy() calls in setlocale become simple assignments,
e.g. charset_id = CS_EUCJP instead of strcpy(charset, "EUCJP"). Not
relevant performance-wise, but in terms of space (for embedded
targets).
- Similarly, charset comparisons become simple integer comparisons
instead of strcmps.


> If you
> only keep the number, where do you get the charset name from?

A new function, e.g. 'void __get_charset_name(int cs_id, char *buf)',
where a buffer of size ENCODING_LEN+1 needs to be passed in.
nl_langinfo(CODESET) would simply call that  instead of doing its own
strcmp-heavy parsing of internal names to turn them back into official
names.


> Btw., while I was writing the above, it occured to me that we
> don't really need the index into the iso or cp array.  What we
> really need is a pointer to the array member, which can be used
> immediately.

Good idea, although it won't make much of a difference, because array
indexing is cheap, basically just a shift and an add.

Andy



More information about the Cygwin-developers mailing list