Monday 12 July 2021

Out of Memory Killer

In this previous post I expressed my surprise before the fact that .Net does not allow to limit the Heap size of one application (contrary to Java and Node). Bearing in mind that .Net was born as a Windows only environment (hopefully things have moved a lot in these last years), this missing can be particularly harmful given the way Windows deals (doesn't deal indeed) with those awful times when the RAM of your system (and the Swap space) are completely full.

Linux manages this situation in a rather sensible way. The Linux kernel provides the OOM Killer (Out Of Memory Killer) mechanism (it can be disabled, but in 99% of occasions you shouldn't). Basically, when an application (or the OS itself) requests more physical memory and the OS has no more space available (both in RAM and Swap), the OOM Killer will choose an application and kill it. This choice is made based on a series of heuristics (but I guess the program using the most of RAM tends to be the best candidate). So one (or a few) programs are sacrificed in order to get the system back on track. The process gets a SIGKILL signal and it dies immediatelly.

At work we have a Java application that consumes a good bunch of RAM and that occasionally will stop running. We're running it with a Xmx setting of 4GB (max Heap size) and we had seen it several times using more than 5GBs of physical RAM (that means: Stack + MetaSpace + Code Cache + Heap...) so it seemed normal that at some point it would reach those 4GBs limit. In that case we should get a "java.lang.OutOfMemoryError: Java heap space" Exception that would be caught by our code and written to our application log before the application shuts down. As we were not finding anything like this in our log we were a bit confused until we found out about this nice thing called OOM Killer. In our case, the whole system had run out of RAM (before the Java Heap of our application had reached its limit), so the OOM Killer was doing its job and killing our process (that being the most RAM consuming process in that machine was the normal candidate).

When the OOM Killer kills a process it logs it to /var/log/messages (in RedHat) that is only accessible by root. We have no root access to that machine, but hopefully this information is stored in the kernerl ring buffer and any user can read it like this:

dmesg -T | egrep -i 'killed process'

For our OOM we could read this:
[2298074.477791] Out of memory: Kill process 18020 (MyProcess) score 298 or sacrifice child
[2298074.478406] Killed process 18020 (MyProcess), UID 16941, total-vm:8588392kB, anon-rss:5474896kB, file-rss:0kB, shmem-rss:48k

Windows behaviour is pretty different (and not much smart I would say...) There's nothing similar to the OOM Killer, so when the RAM and Swap are full Windows will gift the first process requesting more memory with an Out of Memory error and the process will die. This can be any process, not necessarily the process taking up more memory (that maybe is not actively requesting new memory now). So you can have a situation where processes keep dying in a system that becomes unusable until the moment when finally the main memory consumer dies or the user restarts the machine... I'm not making it up, you can read it from a more reliable source

No comments:

Post a Comment