'What does this char string related piece of C++ code do?
bool check(const char *text) {
char c;
while (c = *text++) {
if ((c & 0x80) && ((*text) & 0x80)) {
return true;
}
}
return false;
}
What's 0x80
and the what does the whole mysterious function do?
Solution 1:[1]
Rewriting to be less compact:
while (true)
{
char c = *text;
text += 1;
if (c == '\0') // at the end of string?
return false;
int temp1 = c & 0x80; // test MSB of c
int temp2 = (*text) & 0x80; // test MSB of next character
if (temp1 != 0 && temp2 != 0) // if both set the return true
return true;
}
MSB means Most Significant Bit. Bit7. Zero for plain ascii characters
Solution 2:[2]
Testing the result of an x & 0x80
expression for non-zero (as is done twice in the code you show) checks if the most significant bit (bit 7) of the char
operand (x
) is set1. In your case, the code loops through the given string looking for two consecutive characters (c
, which is a copy of the 'current' character, and *test
, the next one) with that bit set.
If such a combination is found, the function returns true
; if it is not found and the loop reaches the nul
terminator (so that the c = *text++
expression becomes zero), it returns false
.
As to why it does such a check – I can only guess but, if that upper bit is set, then the character will not be a standard ASCII value (and may be the first of a Unicode pair, or some other multi-byte character representation).
Possibly helpful references:
1 Note that this bitwise AND test is really the only safe way to check that bit, because the C++ Standard allows the char
type to be either signed
(where testing for a negative value would be an alternative) or unsigned
(where testing for >= 128 would be required); either of those tests would fail if the implementation's char
had the 'wrong' type of signedness.
Solution 3:[3]
I can't be totally sure without more context, but it looks to me like this function checks to see if a string contains any UTF-8 characters outside the classic 7-bit US-ASCII range.
while (c=*text++)
will loop until it finds the nul-terminator in a C-style string; assigning each char
to c
as it goes. c & 0x80
checks if the most-significant-bit of c
is set. *text & 0x80
does the same for the char
pointed to by text
(which will be the one after c
, since it was incremented as part of the while
condition).
Thus this function will return true
if any two adjacent char
s in the string pointed to by text
have their most-significant-bit set. That's the case for any code points U+0080 and above in UTF-8; hence my guess that this function is for detecting UTF-8 text.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ted Klein Bergman |
Solution 2 | |
Solution 3 | Miles Budnek |