Does C and C++ guarantee the ASCII of [a-f] and [A-F] characters?

Question

Welcome To Ask or Share your Answers For Others

Does C and C++ guarantee the ASCII of [a-f] and [A-F] characters?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Does C and C++ guarantee the ASCII of [a-f] and [A-F] characters?

I'm looking at the following code to test for a hexadecimal digit and convert it to an integer. The code is kind of clever in that it takes advantage of difference between between capital and lower letters is 32, and that's bit 5. So the code performs one extra OR, but saves one JMP and two CMPs.

static const int BIT_FIVE = (1 << 5);
static const char str[] = "0123456789ABCDEFabcdef";

for (unsigned int i = 0; i < COUNTOF(str); i++)
{
    int digit, ch = str[i];

    if (ch >= '0' && ch <= '9')
        digit = ch - '0';
    else if ((ch |= BIT_FIVE) >= 'a' && ch <= 'f')
        digit = ch - 'a' + 10;
    ...
}

Do C and C++ guarantee the ASCII or values of [a-f] and [A-F] characters? Here, guarantee means the upper and lower character sets will always differ by a constant value that can be represented by a bit (for the trick above). If not, what does the standard say about them?

(Sorry for the C and C++ tag. I'm interested in both language's position on the subject).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:11:09+0000

No, it does not.

The C standard guarantees that the decimal digits and uppercase and lowercase letters exist, along with a number of other characters. It also guarantees that the decimal digits are contiguous, for example '0' + 9 == '9', and that all members of the basic execution character set have non-negative values. It specifically does not guarantee that the letters are contiguous. (For all the gory details, see the N1570 draft of the C standard, section 5.2.1; the guarantee that basic characters are non-negative is in 6.2.5p3, in the discussion of type char.)

The assumption that 'a' .. 'f' and 'A' .. 'F' have contiguous codes is almost certainly a reasonable one. In ASCII and all ASCII-based character sets, the 26 lowercase letters are contiguous, as are the 26 uppercase letters. Even in EBCDIC, the only significant rival to ASCII, the alphabet as a whole is not contiguous, but the letters 'a' ..'f' and 'A' .. 'F' are (EBCDIC has gaps between 'i' and 'j', between 'r' and 's', between 'I' and 'J', and between 'R' and 'S').

However, the assumption that setting bit 5 of the representation will convert uppercase letters to lowercase is not valid for EBCDIC. In ASCII, the codes for the lowercase and uppercase letters differ by 32; in EBCDIC they differ by 64.

This kind of bit-twiddling to save an instruction or two might be reasonable in code that's part of the standard library or that's known to be performance-critical. The implicit assumption of an ASCII-based character set should IMHO at least be made explicit by a comment. A 256-element static lookup table would probably be even faster at the expense of a tiny amount of extra storage.

Categories

Does C and C++ guarantee the ASCII of [a-f] and [A-F] characters?

Does C and C++ guarantee the ASCII of [a-f] and [A-F] characters?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags