Tuesday, 28 October 2025

SqlAlchemy Registry

After talking about Persistence Ignorance and mapping styles in SqlAlchemy in my previous post, it's time now to take a look to an interesting technique used by the Declarative mapping. Whatever mapping style you use, SqlAlchemy relies on a registry where the information that maps entities to tables, fields to properties, relations, etc is stored. When using the Imperative mapping you directly work with that registry (sqlalchemy.orm.registry)


from sqlalchemy.orm import registry
@dataclasses.dataclass
class Post:
    title: int
    content: str

metadata = MetaData()
mapper_registry = registry(metadata=metadata)

mapper_registry.map_imperatively(
	entities.Post,
	table_post,
	properties={
		#"post_id": table_post.c.PostId, 
		"title": table_post.c.Title,
		"content": table_post.c.Content
	}      
)

But when using the declarative mapping you're not aware of that registry as you normally don't interact with it at all, though notice that you still have access to it through the Base class.


class Base(DeclarativeBase):
    pass
	
class Post(Base):
    __tablename__ = "Posts"
    post_id: Mapped[int] = mapped_column("PostId", primary_key=True, autoincrement=True)
    title: Mapped[str] = mapped_column("Title")
    content: Mapped[str] = mapped_column("Content")
	
# we have not directly used the registry at all in the above code, but it's still there, accessible through the Base class:
print(f"{Base.registry=}")
# Base.registry=
    

So how does the registry get set? Well, your entities get registered in that registry by leveraging the inheritance and metaclasses machinery to obtain a behaviour that is similar to the ruby inherited hook. Remember that I already talked in a previous post about simulating another ruby metaprogramming hook, the method_added hook, by means of metaclasses. We can use metaclasses to execute some action each time a class based on that metaclass is created (putting that code to be executed in the __new__ or __init__ methods of the metaclass). In our case, we want to execute code to add each model class to the registry. For that the Base class that we define for our entities must have DeclarativeMeta as its metaclass. We can do this by directly setting ourselves the metaclass and the registry instance:


mapper_registry = registry()

class Base(metaclass=DeclarativeMeta):
    registry = mapper_registry


Or by inheriting from DeclarativeBase (that already has DeclarativeMeta as its meta). DeclarativeBase will also take care of setting the registry in our Base class.



class Base(DeclarativeBase):
    pass


We can take look at the DeclarativeMeta code to see how it makes its magic:


class DeclarativeMeta(DeclarativeAttributeIntercept):
    metadata: MetaData
    registry: RegistryType

    def __init__(
        cls, classname: Any, bases: Any, dict_: Any, **kw: Any
    ) -> None:
        # use cls.__dict__, which can be modified by an
        # __init_subclass__() method (#7900)
        dict_ = cls.__dict__

        # early-consume registry from the initial declarative base,
        # assign privately to not conflict with subclass attributes named
        # "registry"
        reg = getattr(cls, "_sa_registry", None)
        if reg is None:
            reg = dict_.get("registry", None)
            if not isinstance(reg, registry):
                raise exc.InvalidRequestError(
                    "Declarative base class has no 'registry' attribute, "
                    "or registry is not a sqlalchemy.orm.registry() object"
                )
            else:
                cls._sa_registry = reg

        if not cls.__dict__.get("__abstract__", False):
            _ORMClassConfigurator._as_declarative(reg, cls, dict_)
        type.__init__(cls, classname, bases, dict_)
from sqlalchemy.orm import DeclarativeBase


I'm involved in some projects where the Database is not a critical element. We don't retrieve data from it, we just use it as an additional storage for our results, but the main store for those results are json/csv files. This means that if the Database is down, the application should run anyway. So it's important for me to have clear what things involve database access (and hence an error if the DB is not accessible), and also when the model mapping will throw an error if the mapping is incorrect. Let's see:

  • Adding classes to the registry (either explicitly with the imperative mapping or implicitly with the declarative one) does not perform any check with the database (so if the DB is down or there's something wrong in our mapping, like wrong tables or columns, we won't find it until later).
  • Creating a SqlAlchemy engine does not perform any connection to the DB either.
  • Creating a Session does connect to the Database, but it does not perform any model verification.
  • Adding objects to a Session won't check the model until the moment when you do a flush or a commit (that indirectly performs a flush).
  • Performing a select Query through a Session will obviously generate an error if any of the mappings for the tables involved in the query is wrong.

invoke operator function


Friday, 24 October 2025

Persistence Ignorance

I've used SqlAlchemy in some projects (basic use, projects where the Database is just one of multiple datasources), and until recently I'v been sticking to using the Imperative mapping style. I grew up as a developer with Persistence Ignorance (PI) as a guiding principle (keep your domain model free from infrastructure concerns like database access, so it remains clean, testable, and focused on business logic), so that was the natural thing to me, and I was really surprised to see that SqlAlchemy recommends to use the Declarative mapping style, where the entities are totally aware of the specific persistence mechanism. .Net Entity Framework and NHibernate make a good job in allowing us to have entities that are "almost" persistent ignorant. I say "almost" cause if you check this list of things that go against Persistence Ignorance, you'll recognize some entity framework requirements like parameterless constructor and using virtual properties for lazy loaded relations. You can have all the additional constructors that make sense for your entities, EF just needs this parameterless one as it will initialize your entities by calling it and then setting properties as needed. As for the virtual properties, EF implements lazy-loading by means of creating proxy classes. If you have a Country entity with a lazy-loaded navigation property cities, EF will create a Proxy class that inherits from Country and overrides the cities property implementing there the lazy-loading logic.

Using the imperative mapping in SqlAlchemy gives you even more freedom. Your entities can have any constructor, as SqlAlchemy leverages Python's __new__ and __init__ separation so that it does not invoke __init__ for initializing the entities, but set attributes one by one. Then the dynamic nature of the language means that you don't have to mark in any special way properties corresponding to lazy loaded relationships and it does not need to resort to proxy classes to implement lazy loading, as it leverages Python dynamism and lookup logic. I think for each lazy relation in an entity a Descriptor is added to the class. When you first try to access the corresponding attribute the lookup will reach the Descriptor, that will perform the corresponding query and set the result in an attribute of the instance, so that the next time that you access the relation, the values will be retrieved from the instance. I guess this is more or less related to what I discuss here.

All this said, we should also note that (as explained here) there's still some "persistence leakage" into your entities when using the Imperative mapping. While you define your entity classes fully unaware of the persistence, SqlAlchemy (when adding them to the registry) makes them aware of the persistence mechanism by adding different attributes at the class level and at the instance level. For example attributes like _sa_instance_state or _sa_lazy_loader (these are part of SQLAlchemy’s internal machinery to track state and identity, manage lazy loading and relationship resolution and hook into attribute access dynamically). So your entities become bloated with extra attributes that you don't use on your own, and if you serialize them to json or whatever, they'll show up.

In the end I've ended up having separate Model entities (that use the declarative mapping) and Domain entities (that know nothing about the database) and mapper classes/functions that map Model entities to Domain entities and viceversa. This gives you almost full PI. I say almost cause you still end up with table ID's leaking into you Domain entities, but this is a more than acceptable compromise. Anyway, you still could get rid of it by declaring your Domain entities without the ID (Countr class) but declaring additional child entities (CountryIdAware class) that incorporate the ID. Your Model to Domain mappers will indeed create CountryIdAware instances that will be passed to your Domain, but the Domain will we aware of them just as User instances, it won't see the ID attribute.

Sunday, 12 October 2025

Truffle Bytecode DSL

I have a fascination with Graal and the Truffle interpreters framework, though it's all from a theoretical point, as I've never built an interpreter myself. The thing is that recently I've found out about a new addition to Truffle, the bytecode DSL. This means that Truffle supports now the 2 main type of interpreters: Tree Parsing Interpreters and bytecode interpreters.

I found this a bit odd at first, as it was not clear to me how to reconcile this with what I understood as the main super-powers of Truffle. The "traditional" approach in Truffle is writing Tree Parsing Interpreters (AST interpreters). Summarizing what I explain in some of my previous posts, the nodes in this Tree correspond to java methods (Java bytecodes) that the interpreter invokes. These nodes can get specialized to more specific nodes thanks to profiling, and then when a guest language method is hot, the Java bytecodes for the nodes making up that method are sent to Graal for compiling it to native code (this is the Partial Evaluation part). The equivalent to specializing the AST nodes also exists for the bytecodes case, those bytecodes can be specialized/quickened in a way very similar to what the Python Adaptive Specializing Interpreter does. But for the compilation part, if with the bytecode DSL we no longer have a tree made up of nodes, are we missing the Partial Evaluation magic?

No, we are not missing anything. For each Guest language bytecode of the program we are executing we'll have a Java method that executes it. When a guest language method is hot, the Java methods for the bytecodes making up that method will be sent to Graal for compilation, so this is the same we do with AST nodes.

Confirming this with a GPT has provided me with a better understanding of the Partial Evaluation and optimizations. When a method is hot and it's nodes (or guest language bytecode) are sent for compilation Truffle can decide to send only part of the method, not the whole method (path specific compilation). When Truffle does partial evaluation, it traces the actual execution path that was taken during profiling. This means that if we have an "if-else" and the profiling shows that the condition is always true it will only send for compilation the "if part". Of course it adds guards so that if the assumptions taken becomes false it can deoptimize the code (transfer back to the interpreter)

There's an additional element in how Truffle can achieve such excellent performance, inlining (both for AST and bytecode interpreters). When Truffle sends the java methods for the nodes or bytecodes of a method (or of part of a method based on optimizations) for compilation, it will also send those methods called from that method, and will inline them in the generated native code.

A common taxonomy of JIT compilers is Method-based (Graal, HotSpot, .Net...) vs Tracing JITs (LuaJIT, older TraceMonkey).

Method-Based JITs (like Graal/Truffle, HotSpot, .NET)

  • Compilation unit: entire method
  • When a method becomes hot, compile it
  • Can still do path specialization within that method
  • Inlining: pulls called methods into the caller during compilation

Tracing JITs (like LuaJIT, older TraceMonkey)

  • Compilation unit: hot trace across multiple methods
  • Traces execution through method calls, loops, returns
  • The "trace" might start in methodA(), call into methodB(), and return - all one compiled unit
  • More aggressive cross-method optimization

The interesting thing is that while Graal is primarily method-based, with very aggressive inlining it can achieve trace-like behavior.

Pure tracing JITs can cross method boundaries more naturally, but modern method-based JITs like Graal blur this distinction through aggressive inlining. The end result can be quite similar, just with different conceptual models!

Example:

Tracing JITs Hot loop detected spanning multiple methods

  • Record exact execution path
  • Compile: loop_header → methodA → methodB → loop_back
  • One flat piece of native code

Graal methodA is hot

  • Compile methodA
  • Inline methodB call
  • Inline methodC call
  • Result looks similar but structured around methodA

Saturday, 4 October 2025

Python venvs

In the past I used to install Python modules globally, but since quite a while ago I'm careful to use separate virtual environments (venv) for all my projects. I guess anyone doing any non basic Python stuff will be familiar with venvs, so I'm not going to explain here what a venv is, but to provide some information that though pretty simple feels useful and interesting to me.

We create a new virtual environment with: python -m venv .venv. Notice that .venv is just the name of the folder where the venv will be created, we can use any name, but .venv (or env or venv) is a sort of convention. Then we activate the venv with: source .venv/bin/activate (or .venv\Scripts\activate in Windows).

The python version that we use when creating the virtual environment is the one the venv will use. That means that if we have several python versions (the system one, let's say 3.12, and several altinstalls, let's say 3.11, 3.13), if we create the venv using 3.13 (python3.13 -m venv .venv), when we activate the venv it will use the python3.13 altinstall, regardless of whether we type python, python3 or python3.13

That's so because inside the venv (.venv/bin) we have these symlinks:
python -> python3.13
python3 -> python3.13
python3.13 -> /usr/local/bin/python3.13

If we want to launch a python script in certain venv (I mean in one go, not the typical thing of opening a terminal, activating the venv in that terminal and then launching the python script), we can just put this in a launcher.sh script:
source /path/to/.venv/bin/activate && python /path/to/script.py
This will activate the venv in the bash process that runs the script and hence the python invocation will be done with the python pointed from the venv.

There's a more direct approach that I was not aware of until recently. We don't need to activate the venv, we can just type this:
/path/to/.venv/bin/python /path/to/script.py

All this works because the venv mechanism is implemented by python itself, it's not a third party addition. When we activate a venv with source .venv/bin/activate what is happening mainly is that it's prepending the path to .venv/bin to our PATH variable, that's all. That way we'll reach those symlinks that we've seen that point to the python installation used during the venv creation. So if in the end we're just running that global python installation, how is it that it will find the packages locally installed in: .venv/lib/python3.13/site-packages?

Well, that's so because when started, python checks if a pyenv.cfg file exists in a path relative to the path used for launching python (so in this case the path to that symlink). I guess it gets the path used for launching it by checking argv[0]. If that file (.venv/pyenv.cfg) exists, it will use it for:

  • It adjusts sys.path to point sto the venv’s lib/python3.13/site-packages
  • It sets sys.prefix and sys.exec_prefix to the venv directory
  • It avoids loading global site-packages (unless configured to do so)

With regards to installing packages with pip in a venv, we have to notice that pip is a bootstrap python script and a python module. When we create a venv, 3 pip scripts are created in .venv/bin:


pip
pip3
pip3.13

Each of them is a python script with a shebang pointing to the python version used during the venv creation. They look like this:


$ cd .venv/bin 
$ more pip
#!/myProjects/my_app/.venv/bin/python3.13
# -*- coding: utf-8 -*-
import re
import sys
from pip._internal.cli.main import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

And a pip module is installed inside the site-packages of that venv, (e.g: .venv/lib/python3.13/site-packages/pip). So when we run any of those pip scripts in the venv, they load the python version that was used when creating the venv, and that python version will see the pyenv.cfg file, prepend the .venv/site-packages to sys.path, and that way load the pip module in the .venv site-packages.