Monday 11 September 2023

Python and bytes

I've been working with binary data in python recently, which is something that happens to me rather occasionally. I've though it would be useful putting some notes here for further reference.

Python provides 2 objects for dealing with binary data: bytes and bytearray. bytes objects are immutable while bytearrays are not. Both of them are iterable. I've been working mainly with bytes objects, so that's what I'll be focusing on in this post.

For reading a binary file you add a "b" to your normal open clause (without the encoding parameter, as it makes no sense for binary data). Then, when reading from the returned file object you'll obtain a bytes object (and notice that you can use the len() function on a bytes object)


with open(file_path, "rb") as fr:
    first_bytes = fr.read(8)
print(type(first_bytes).__name__) # bytes object: immutable and iterable
print(f"len(first_bytes): {len(first_bytes)}") # 8


As I've said the bytes object is iterable and immutable. When iterating it we obtain an int object for each byte (with a value from 0 to 255, the decimal value corresponding to that byte). So if you want to filter/modify some of its values you can iterate over it and create a list of int's. Then you can create a new bytes object from that list of ints. Notice also that the Equality comparison works perfectly with two bytes objects representing the same values:



bytes_list: list[int] = list(first_bytes)
print(len(bytes_list))
first = bytes_list[0]
print(type(first).__name__) # int

# convert back to bytes
bytes2 = bytes(bytes_list)
print(f"equal: {first_bytes == bytes2}") #True


When we want to create a bytes object from our own literal-hardcoded values, not by reading from a file, we have 2 options, using a bytes literal, or using the bytes constructor. Let's see.

bytes literal. We can use it for hexadecimal values of for ascii chars, like this (bear in mind when looking this sample together with the next one, that x41,x43 hexadecimal correspond to 65,67 decimal and the A, C ascii chars):


b1 = b"\x41\x43"
b2 = b"AC" # I can only use "ascii characters"
b1 == b2 # True

# and notice this syntax error:
# bytes can only contain ASCII literal characters
#b7 = b"á"

Using the bytes constructor with a collection integers (provided either in hexadecimal or decimal form)


b3 = bytes([0x41, 0x43]) # I can define ints using an haxadedimal value
b4 = bytes([65, 67])
b3 == b4 # True

Using the bytes constructor with a string, providing the encoding that we want to use. Depending on that encoding, one characters will take 1 or more bytes


b5 = bytes("AC", encoding="ISO8859-1") # ansi
b6 = bytes("AC", encoding="UTF-8")
print(f"len(b5): {len(b5)}") # 2
# UTF-8 representation of a "basic" charaters is just one byte
print(f"len(b6): {len(b6)}") # 2
# but for a non "basic" character like this it's 2 bytes
b8 = bytes("áé", encoding="UTF-8") 
print(f"len as UTF-8: {len(b8)}") # 4

And now a few extra things to be careful about:

If we use a "string of numbers" in a bytes literal, it'll use the bits corresponding to the ascii value each "numeric char", not the bits corresponding to a "digit"


# this is false, b"5" gives us the byte representacion of the ascii character 5, not of the number 5
print("bytes([5]) == b'5': " + str(bytes([5]) == b'5'))

print(f"bytes([5]) == b'\\x05': " + str(bytes([5]) == b'\x05')) # True

Obviously when using the bytes constructor with collection of numbers, each number must be a value between 0 and 255:


try:
    b1 = bytes([400])
except Exception as ex:
    print(f"exception: {ex}")
    #exception: bytes must be in range(0, 256)

No comments:

Post a Comment