Fosstodon @fosstodon

Recent searches

Search options

Only available when logged in.

Can anyone explain why glibc's ctype functions return different values than wctype functions in a utf8 locale for Latin-1 Supplement values (0x80 <= c < 0xff)?

isupper(0xc4) == 0
iswupper(0xc4) == 1

Is this correct? How is this motivated by the C standard?

Is it because utf-8 is a multibyte encoding where Ä is represented by two bytes in strings?

Jan 23, 2025, 10:20 PM·

2boosts·1favorite

**wizzwizz4** @wizzwizz4 · Jan 23

Jan 23

wizzwizz4 @wizzwizz4

@keithp This is expected behaviour for the C locale… maybe it's related to UTF-8 being multi-byte? If isupper(0xc384) and !isupper(0xc3a4), then it's probably multi-byte.

wchar_t is usually 4-bytes (except on Windows, which doesn't follow the C standard), so you should be able to use that if you know you have Unicode codepoints.

**keithp** @keithp · Jan 23

Jan 23

keithp @keithp

@wizzwizz4 Sure, the default C locale is (effectively) ASCII, so values between 0x80 and 0xff aren't part of the character set.

The ctype functions only want an unsigned char or EOF, so you can't pass them larger values; that's what the wctype functions are for, but those use wchar_t which are always(?) Unicode.

**wizzwizz4** @wizzwizz4 · Jan 23

Jan 23

wizzwizz4 @wizzwizz4

@keithp Out of https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf §7.4.1 and §5.2.2, I think this is the most relevant passage:

> The presence, meaning, and representation of any additional members [beyond the basic character set] is locale-specific.

But it doesn't look like the C specification actually says what should happen in the case where a valid `unsigned char` value does not represent a character. It doesn't say it's UB, but it also doesn't seem to define the behaviour.

**keithp** @keithp · Jan 24

Jan 24

keithp @keithp

@wizzwizz4 Yeah, I'm going to just read this as:

"UTF-8 is a multi-byte encoding, and a single byte with the high-bit set is not a valid utf-8 value"

If I squint, that kinda makes sense.

I pushed a bunch of ctype fixes to picolibc today, including testing for every encoding picolibc supports. There were a number of minor ctype differences with the Unicode tables. ISO-8859-10 was completely broken because the code checked for > 10 instead of >= 10. ctype is now generated from Unicode data.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back