Saturday 4 August 2012

CLR vs JVM: Loading, unloading...

I'll try to sum up here some things I've been reading/thinking about lately, trying to glue them with some order/sense.

- Loading Unit

The CLR manages assemblies (.exe's and .dll's) as loading units, while the JVM manages classes (.class files). Pondering a bit about this, it makes a huge difference. Assemblies can contain millions of classes, so loading an assembly can be costly, loading a JVM class has to be a rather fast process. When an assembly is loaded the CLR has to create structures in memory for all the classes contained there: EEClass, MethodTable..., so it can be lengthy. I assume the JVM uses quite similar memory structures.

- Interfering in the loading process.

In the JVM we can alter how classes are loaded by creating our own ClassLoaders (this article is a pretty good reference). For example, when running groovy code from the groovy interpreter, the GroovyClassLoader loads the .groovy file and compiles it on the fly to JVM bytecodes.

In the CLR, we don't have the notion of custom assembly loaders and there's not any sort of AssemblyLoader class that you can subclass. You can dynamically load an assembly via Reflection with Assembly.Load, but same as with the automatic loading done by the runtime, it does not seem simple to hook your own code into the process. There are some google results that seem to indicate that Aspect Weaving at Assembly load time is possible, so there must be some sort of hack to get in the process, but it's far from obvious.
Java Class loaders are responsible for both finding the code to be loaded (you could get that code on the fly from the network for example) and loading it (from the Oracle documentation):

 class NetworkClassLoader extends ClassLoader {
         String host;
         int port;

         public Class findClass(String name) {
             byte[] b = loadClassData(name);
             return defineClass(name, b, 0, b.length);
         }

         private byte[] loadClassData(String name) {
             // load the class data from the connection
              . . .
         }
     }

Indeed, this is not so different in the .Net world, cause you could also fetch the bytes representing your assembly from wherever you wanted, you just would create a byte[] and then would pass it over to Assembly.Load. With this in mind, it occurs to me that we could conduct load time Assembly modification by getting the assembly in memory (I'm not phrasing this with the "load" verb, cause I'm not talking about a normal Assembly.Load), modify the Assembly with with Cecil, save to a byte[], and then Load the Assembly from that in memory byte[] using Assembly.Load(byte[]).

- When is an Assembly or a Class loaded?

On the Java side I'll talk here only about Oracle's Hotspot VM, as I have no idea of how other JVM's work (I'm quite interested though in how Oracle ends up integrating Hotspot and JRockit). This said, it's quite important to understand the huge differences between the CLR and HotSpot. On the CLR there's not any interpretation step, all code is JITed (apart from code already Ngened) before its first execution. On the JVM code is interpreted first and for many methods it will be always like that, interpretation. Nevertheless, those sections of code that are run frequently enough (hot spots) end up being Jitted to Native code. Its adaptive optimizations do heavy use of inlining and it goes one step further replacing already compiled code by a new optimized version (it can for example revert inlined codeed if conditions change). This is an interesting fast read

What I've described above clearly influences when the CLR or the JVM load a "code unit" (assembly or class).

In .Net, an assembly is loaded the first time that a method referencing classes in that Assembly is Jitted. Jitting happens before running the method, so the runtime does not know if the instructions in that method that need that assembly will ever be really executed (they could be inside an if that never happens to be true...). When developing code with critical performance requirements, we should have this present, cause some minor modification to our code can save us an unnecessary Assembly load).

With the HotSpot JVM, due to the initial interpretation, class loading is not done on the method border, but on the instruction border. I mean, one class is not loaded until the first instruction that needs that class is interpreted.

I intend to show some samples about this in a separate post

- Assembly unloading/reloading:

This is an interesting topic, particularly for long lived applications, but at first sight neither the CLR nor the JVM are too collaborative on this regard. There's not any kind of Assembly.Unload or Class.Unload methods... so what? Well, we all should know at this point that in the .Net world assemblies are loaded into Application Domains, that act almost as ".Net processes" into OS processes. We can unload an Application Domain and all the assemblies loaded there will be unloaded. This is what the w3wp process does. When several Asp.Net applications run within the same IIS App Pool, each one is loaded into a different Application Domain. This way, we can reload an Asp.Net application by unloading is App Domain and loading the app again into a new App Domain. It seems like in the Java world this unloading feature is achieved through custom ClassLoaders, as explained here In both cases, achieving this is not trivial, but things like MEF of OSGI seem to help.

I should clarify that unloading an Assembly would mainly mean removing from memory all the metadata structures created for it (EEClasses, Method Tables...), and removing the Jitted code

Many people have previously posed this question: Why there's not an Assembly.Unload method? and it's been answered in detail since many years ago. Long in short, first, you would need to make sure that at that point there's not any referencesto instances of classes defined in that assembly, and second, apart from removing the metadata structures from memory, you would need to remove the Jitted code... and both things are quite comples

From the article above, and from this one I've found out about the .Net Loader Heap. I was aware of the 3 generational Heaps and the Large Object Heap (collectively known as Managed Heap), but didn't know that statics, MetaData and Jitted code were put in this separate, Loader Heap (there's one per Application Domain). You can read more here

Another one that you can't entirely ignore in managed code is the heap that stores static variables. It is associated with the AppDomain, static variables live as long as the AppDomain lives. Commonly named "loader heap" in .NET literature. It actually consists of 3 heaps (high frequency, low frequency and stub heap), jitted code and type data is stored there too but that's getting to the nitty gritty.

No comments:

Post a Comment