'What does this char string related piece of C++ code do?

bool check(const char *text) {
    char c;
    while (c = *text++) {
        if ((c & 0x80) && ((*text) & 0x80)) {
            return true;
        }
    }
    return false;
}

What's 0x80 and the what does the whole mysterious function do?



Solution 1:[1]

Rewriting to be less compact:

while (true)
{
    char c = *text;
    text += 1;
    if (c == '\0') // at the end of string?
        return false;
    
    int temp1 = c & 0x80;          // test MSB of c
    int temp2 = (*text) & 0x80;    // test MSB of next character
    if (temp1 != 0 && temp2 != 0)  // if both set the return true
        return true;
}

MSB means Most Significant Bit. Bit7. Zero for plain ascii characters

Solution 2:[2]

Testing the result of an x & 0x80 expression for non-zero (as is done twice in the code you show) checks if the most significant bit (bit 7) of the char operand (x) is set1. In your case, the code loops through the given string looking for two consecutive characters (c, which is a copy of the 'current' character, and *test, the next one) with that bit set.

If such a combination is found, the function returns true; if it is not found and the loop reaches the nul terminator (so that the c = *text++ expression becomes zero), it returns false.

As to why it does such a check – I can only guess but, if that upper bit is set, then the character will not be a standard ASCII value (and may be the first of a Unicode pair, or some other multi-byte character representation).


Possibly helpful references:


1 Note that this bitwise AND test is really the only safe way to check that bit, because the C++ Standard allows the char type to be either signed (where testing for a negative value would be an alternative) or unsigned (where testing for >= 128 would be required); either of those tests would fail if the implementation's char had the 'wrong' type of signedness.

Solution 3:[3]

I can't be totally sure without more context, but it looks to me like this function checks to see if a string contains any UTF-8 characters outside the classic 7-bit US-ASCII range.

while (c=*text++) will loop until it finds the nul-terminator in a C-style string; assigning each char to c as it goes. c & 0x80 checks if the most-significant-bit of c is set. *text & 0x80 does the same for the char pointed to by text (which will be the one after c, since it was incremented as part of the while condition).

Thus this function will return true if any two adjacent chars in the string pointed to by text have their most-significant-bit set. That's the case for any code points U+0080 and above in UTF-8; hence my guess that this function is for detecting UTF-8 text.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ted Klein Bergman
Solution 2
Solution 3 Miles Budnek