It's clear that I have a fascination with the interaction between interpreters and JIT's, optmizations, deoptimizations and so on. I've written multiple posts about that, with this one being the most recent. I recently came across this interesting thesis where in the introduction it mentions:
For both variants, we can quicken certain operations. Quickening entails replacing one operation with another, which usually handles a subset of possible values or a more restrictive set of preconditions but performs the operation quicker. In tree-walk interpreters, this is performed by node replacement, and in bytecode-based interpreters, by changing the instruction stream, replacing one instruction with another. In both cases, the target of the dispatch changes. How this is implemented in Operation DSL is detailed in section 3.7.
The "replacing one instruction with another" suddenly reminded me of something I had read some months ago regarding Python performance improvements but that I had forgotten to dive into and had indeed forgotten. I'm talking about Adaptive Specializing Interpreter, that is something pretty surprising to me. I've talked in my previous posts about interpreters that find hotspots in your code and send those hot methods to a fast JIT compiler to turn them into native code. Then that native code continues to be monitored and if it's hot enough it's sent to a more aggressive JIT compiler that spends more time in producing a more optimized native code. Then we have cases where the code has to be deoptimized, returning the function to its interpreted form, and the cycle starts again. But the idea of monitoring the code to find specific bytecode instructions (opcodes) that can be replaced by an optimized (specialized/quickened) version of that bytecode instruction is something that was pretty new to me
As of today (Python 3.13) the main Python environment, CPython, does not come with a JIT compiler (an experimental one is due for 3.14 I think). CPython just uses a bytecode interpreter (historical note: The move from the initial tree-walk interpreter to a bytecode interpreter happened between Python 0.9.x and Python 1.0, likely around 1992–1993 during the pre-1.0 development phase). Python 3.11 implemented PEP-659 - Specializing Adaptive Interpreter as part of the Faster CPython project. It introduced some bytecode instructions that are generic (for example the one for adding 2 items, BINARY_OP_ADD (+)) and that if a constant execution pattern is found will be replaced (specialized/quickened) by an specialized version ('BINARY_OP_ADD_FLOAT', 'BINARY_OP_ADD_INT', 'BINARY_OP_ADD_UNICODE'). If that pattern changes the instruction will be replaced by the initial, generic bytecode. This discussion has some interesting information.
You probably know that (as I mention in this post) Python compiles functions to code objects, and then each function object points to a code object (via __code__ attribute). A code object has a co_code attribute pointing to a bytes object containing the bytecodes.
bytes objects are inmutable, so the bytecode specialization has to happen in another structure. I could not find much information about this, so ChatGPT came to the rescue. So yes, there's an additional structure that contains a mutable copy of the bytecodes. It's from that structure that the interpreter reads the bytecodes to execute for that given function, and applies adaptations/specializations/quickening as it sees fit.
- The co_code itself remains immutable and is not rewritten at runtime. It continues to contain the canonical, "baseline" bytecode sequence as emitted by the compiler.
- When a code object is executed, CPython creates an internal _PyCodeRuntime structure (not exposed to Python), which contains a mutable copy of the bytecode in a field called co_firstinstr (technically in co_warm → co_warm.instructions).
- That runtime bytecode buffer is where the interpreter patches in "quickened" instructions (specialized opcodes). For example, a generic BINARY_OP might be replaced at runtime with BINARY_OP_ADD_INT if it sees enough hot integer additions.
That mutable copy of the bytecodes seem to be referenced from the code object via a private _co_code_adaptive attribute (but this is an internal, undocumented detail that can change from version to version). Python allows us to very easily check the bytecodes for a given function by using the standard dis module: dis.dis(my_function). By default dis.dis shows the immutable bytecodes in co_code, but since python3.12 we can use the adaptive=True flag, to see the adapted/quickened instruction. This is pretty amazing, cause we can so easily see how a function bytecodes evolve over time!
import sys, dis
def f(a, b):
return a + b
print("before warming up")
dis.dis(f, adaptive=True) # Only in 3.12+
# BINARY_OP 0 (+)
# Warm it up
for _ in range(10_000):
f(1, 2)
# Disassemble with quickening shown
print("after warming up with ints")
dis.dis(f, adaptive=True)
#BINARY_OP_ADD_INT 0 (+)
# now let's try to break the quickening by passing strings rather than ints
print("first call with strings")
f("a", "b")
dis.dis(f, adaptive=True)
# it's still quickened
#BINARY_OP_ADD_INT 0 (+)
print("second call with strings")
f("c", "b")
dis.dis(f, adaptive=True)
# it's still quickened
#BINARY_OP_ADD_INT 0 (+)
print("Warm it up again, this time with strings")
for _ in range(10_000):
f("a", "b")
print("after warming up again")
dis.dis(f, adaptive=True)
# BINARY_OP_ADD_UNICODE 0 (+)
So initially the bytecode for an addition of 2 values uses the BINARY_OP opcode (python is a dynamic language where a and b could be of any type, so BINARY_OP is a generic (and slower) instruction for summing up any value). Then we do a good bunch of additions all of them with int values, so that the interpreter decides to specialize the generic addition to a fast BINARY_OP_ADD_INT opcode. After that we do a couple of invokations using strings rather than ints. The specialized opcode checks if the operands are the expected types (here, two ints), as they are not it falls back to the generic implementation of the operation (the slow path), but for the moment it still keeps the specialized opcode. It takes note of these divergences so that if they continue it will revert the specialization. The thing is that in my tests I have not managed to find the number of failed executions that will make the interpreter to revert the specialization, what we can see is that after a good bunch of executions using strings, the int specialization is changed to a string specialization (BINARY_OP_ADD_UNICODE).
In some previous post I mentioned some crazy Python projects that manipulate functions by creating a new code object (an instance of types.CodeType) based on the original one, with a modified version (adding extra instructions, whatever) of its bytecodes, and assigns it to the function. How does this play with the adaptive version of the code? Well, thanks to ChatGPT we learn that the adaptation process starts again:
- The quickened bytecode (co_code_adaptive) is built lazily, the first time the interpreter executes a CodeType.
- It is not stored permanently in the CodeType; rather, it is in a per-runtime structure that references the original co_code.
- If you assign a different code object to a function (func.__code__ = new_code), that’s a new CodeType with its own co_code_adaptive buffer, initially empty.
- Therefore, execution will start again with baseline opcodes and caches, and the specializing interpreter will re-warm and re-specialize.