'Comparison of byte literals in Python

The following question arose because I was trying to use bytes strings as dictionary keys and bytes values that I understood to be equal weren't being treated as equal.

Why doesn't the following Python code compare equal - aren't these two equivalent representations of the same binary data (the example is knowingly chosen to avoid endianness)?

b'0b11111111' == b'0xff'

I know the following evaluates true, demonstrating the equivalence:

int(b'0b11111111', 2) == int(b'0xff', 16)

But why does Python force me to know the representation? Is it related to endianness? Is there some easy way to force these to compare equivalent other than converting them all to, e.g., hexadecimal literals? Is there a transparent and clear method to move between all representations in a (somewhat) platform independent way (or am I asking too much)?

Say I want to actually index a dictionary using 8 bits in the form b'0b11111111', then why does Python expand it to ten bytes and how do I prevent that?

This is a smaller piece of a large tree data structure and expanding my indexing by a factor of 80 seems like a huge waste of memory.



Solution 1:[1]

Bytes can represent any number of things. Python cannot and will not guess at what your bytes might encode.

For example, int(b'0b11111111', 34) is also a valid interpretation, but that interpretation is not equal to hex FF.

The number of interpretations, in fact, is endless. The bytes could represent a series of ASCII codepoints, or image colors, or musical notes.

Until you explicitly apply an interpretation, the bytes object consists just of the sequence of values in the range 0-255, and the textual representation of those bytes use ASCII if so representable as printable text:

>>> list(bytes(b'0b11111111'))
[48, 98, 49, 49, 49, 49, 49, 49, 49, 49]
>>> list(bytes(b'0xff'))
[48, 120, 102, 102]

Those byte sequences are not equal.

If you want to interpret these sequences explicitly as integer literals, then use ast.literal_eval() to interpret decoded text values; always normalise first before comparison:

>>> import ast
>>> ast.literal_eval(b'0b11111111'.decode('utf8'))
255
>>> ast.literal_eval(b'0xff'.decode('utf8'))
255

Solution 2:[2]

b'0b11111111' consists of 10 bytes:

In [44]: list(b'0b11111111')
Out[44]: ['0', 'b', '1', '1', '1', '1', '1', '1', '1', '1']

whereas b'0xff' consists of 4 bytes:

In [45]: list(b'0xff')
Out[45]: ['0', 'x', 'f', 'f']

Clearly, they are not the same objects.

Python values explicitness. (Explicit is better than implicit.) It does not assume that b'0b11111111' is necessarily the binary representation of an integer. It's just a string of bytes. How you choose to interpret it must be explicitly stated.

Solution 3:[3]

It seems that what you were trying to do is get a byte string representing the value 0b11111111 (or 255). This is not what b'0b11111111' does – that actually stands for a byte string representing the character (Unicode) string '0b11111111'.

What you want would be written as b'\xff'. You can check that it is actually one byte: len(b'\xff') == 1.

To convert a Python int to a binary representation, you can use the ctypes library. You need to choose one of the C integer types, e.g.:

>>> bytes(ctypes.c_ubyte(255))
b'\xff'

>>> bytes(ctypes.c_ubyte(0xff))
b'\xff'

>>> bytes(ctypes.c_long(255))
b'\xff\x00\x00\x00\x00\x00\x00\x00'

Note: Instead of c_ubyte and c_long, you can use the aliases c_uint8 (i.e. 8-bit unsigned C integer) and c_int64 (64-bit signed C integer), respectively.

To convert back:

>>> ctypes.c_ubyte.from_buffer_copy(b'\xff').value
255

Be careful about overflow:

>>> ctypes.c_ubyte(256)
c_ubyte(0)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 ondra.cifka