'Is there an 8-bit code page (CP-####) such that every byte is defined?

I am currently trying to make a code for a data compression algorithm. My goal is to take a file and read it, convert it to its binary equivalent, and then store it into an output text file.

For example: Lets say my input is "Hello World",

then what I want in my output is "0100100001100101011011000110110001101111001000000101011101101111011100100110110001100100".

This code does this successfully for most messages. However, I have a file with a neccessary byte of information that is "10000001". In CP-1252, this maps to an undefined character.

Is there an 8-bit code page that I can use in Line 9, Character 51 of my code that uses all 256 possible codes to avoid this error?

from datetime import datetime
now = datetime.now()

#makes a file with a unique timestamp in the name
stamp = now.strftime("%H%M%S")
f = open("requests" + str(stamp) + ".txt", "w")

#opens desired input file in read mode
text_file = open("videointxt", mode="r", encoding=None)
                                                  ^
#read whole file to a string
data = text_file.read()

#takes the message converts it to its binary equivakent, and then writes it to the output file
res = ''.join(format(ord(i), '08b') for i in data)
print(res, file=f)


f.close


Solution 1:[1]

Your problem isn't with the code page per-se. Your problem is that you're trying to interpret the binary information into glyphs. In python3.x strings are Unicode. For Python to read in data as a string it needs to know the mapping of values to glyphs i.e. the encoding. If you just want to load raw data without interpreting it you probably want to preform a byte read and skip passing an encoding.

- text_file = open("videointxt", mode="r", encoding=None)
+ text_file = open("videointxt", mode="rb")

The variable text_file will now be type bytes instead of type string so you'll need to change how you're interacting with it but this does mean that you don't need to mess with encodings until deciding how you want to display information to the screen.

Personally I'd suggest displaying output in hex instead of binary.


Please note that on disk text data is already a binary format. There are no storage gains to be had by storing the binary equivalent to disk. The act of converting from characters to binary and writing to disk is inverse of what happened when the file was read to begin with.

As @chepner says your code produces a string of '0', and '1' characters which will increase the total file size by a factor of 8.

If you're working at a linux terminal you can see this for yourself:

# my_script.py
with open('tmp', 'rb') as f:
    data = file.read()
    print(''.join(format(i, '08b') for i in data))
$ echo -n 'a' > tmp
$ xxd -b tmp
00000000: 01100001                                               a
$ python my_script.py > tmp
$ xxd -b tmp
00000000: 00110000 00110001 00110001 00110000 00110000 00110000  011000
00000006: 00110000 00110001                                      01

Note how you've expanded the file size from 1 byte to 8, because instead of storing 'a' as 1 byte of data you're storing 8, instances of the characters '0', and '1'.

Solution 2:[2]

Yes, there is at least one single-byte character set where every byte value is mapped: Code page 437

It's a decent choice for visualizing binary data (e.g. in a hex editor). It has no newline character. But note that although it only lists 00 as a control character, many APIs will still interpret characters among 01-1f and 7f as control characters anyway, so use with caution.

I would not use it as a storage format though, base64 is much more appropriate for safely storing binary data in a string.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2