fosstodon.org is one of the many independent Mastodon servers you can use to participate in the fediverse.
Fosstodon is an invite only Mastodon instance that is open to those who are interested in technology; particularly free & open source software. If you wish to join, contact us for an invite.

Administered by:

Server stats:

11K
active users

keithp

Can anyone explain why glibc's ctype functions return different values than wctype functions in a utf8 locale for Latin-1 Supplement values (0x80 <= c < 0xff)?

isupper(0xc4) == 0
iswupper(0xc4) == 1

Is this correct? How is this motivated by the C standard?

Is it because utf-8 is a multibyte encoding where Ä is represented by two bytes in strings?

@keithp This is expected behaviour for the C locale… maybe it's related to UTF-8 being multi-byte? If isupper(0xc384) and !isupper(0xc3a4), then it's probably multi-byte.

wchar_t is usually 4-bytes (except on Windows, which doesn't follow the C standard), so you should be able to use that if you know you have Unicode codepoints.

@wizzwizz4 Sure, the default C locale is (effectively) ASCII, so values between 0x80 and 0xff aren't part of the character set.

The ctype functions only want an unsigned char or EOF, so you can't pass them larger values; that's what the wctype functions are for, but those use wchar_t which are always(?) Unicode.

@keithp Out of open-std.org/jtc1/sc22/wg14/ww §7.4.1 and §5.2.2, I think this is the most relevant passage:

> The presence, meaning, and representation of any additional members [beyond the basic character set] is locale-specific.

But it doesn't look like the C specification actually says what should happen in the case where a valid `unsigned char` value does not represent a character. It doesn't say it's UB, but it also doesn't seem to define the behaviour.

@wizzwizz4 Yeah, I'm going to just read this as:

"UTF-8 is a multi-byte encoding, and a single byte with the high-bit set is not a valid utf-8 value"

If I squint, that kinda makes sense.

I pushed a bunch of ctype fixes to picolibc today, including testing for every encoding picolibc supports. There were a number of minor ctype differences with the Unicode tables. ISO-8859-10 was completely broken because the code checked for > 10 instead of >= 10. ctype is now generated from Unicode data.