|
|
Subscribe / Log in / New account

Taming the OOM killer

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

February 4, 2009

This article was contributed by Goldwyn Rodrigues

Under desperately low memory conditions, the out-of-memory (OOM) killer kicks in and picks a process to kill using a set of heuristics which has evolved over time. This may be pretty annoying for users who may have wanted a different process to be killed. The process killed may also be important from the system's perspective. To avoid the untimely demise of the wrong processes, many developers feel that a greater degree of control over the OOM killer's activities is required.

Why the OOM-killer?

Major distribution kernels set the default value of /proc/sys/vm/overcommit_memory to zero, which means that processes can request more memory than is currently free in the system. This is done based on the heuristics that allocated memory is not used immediately, and that processes, over their lifetime, also do not use all of the memory they allocate. Without overcommit, a system will not fully utilize its memory, thus wasting some of it. Overcommiting memory allows the system to use the memory in a more efficient way, but at the risk of OOM situations. Memory-hogging programs can deplete the system's memory, bringing the whole system to a grinding halt. This can lead to a situation, when memory is so low, that even a single page cannot be allocated to a user process, to allow the administrator to kill an appropriate task, or to the kernel to carry out important operations such as freeing memory. In such a situation, the OOM-killer kicks in and identifies the process to be the sacrificial lamb for the benefit of the rest of the system.

Users and system administrators have often asked for ways to control the behavior of the OOM killer. To facilitate control, the /proc/<pid>/oom_adj knob was introduced to save important processes in the system from being killed, and define an order of processes to be killed. The possible values of oom_adj range from -17 to +15. The higher the score, more likely the associated process is to be killed by OOM-killer. If oom_adj is set to -17, the process is not considered for OOM-killing.

Who's Bad?

The process to be killed in an out-of-memory situation is selected based on its badness score. The badness score is reflected in /proc/<pid>/oom_score. This value is determined on the basis that the system loses the minimum amount of work done, recovers a large amount of memory, doesn't kill any innocent process eating tons of memory, and kills the minimum number of processes (if possible limited to one). The badness score is computed using the original memory size of the process, its CPU time (utime + stime), the run time (uptime - start time) and its oom_adj value. The more memory the process uses, the higher the score. The longer a process is alive in the system, the smaller the score.

Any process unlucky enough to be in the swapoff() system call (which removes a swap file from the system) will be selected to be killed first. For the rest, the initial memory size becomes the original badness score of the process. Half of each child's memory size is added to the parent's score if they do not share the same memory. Thus forking servers are the prime candidates to be killed. Having only one "hungry" child will make the parent less preferable than the child. Finally, the following heuristics are applied to save important processes:

  • if the task has nice value above zero, its score doubles

  • superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE or CAP_SYS_RAWIO) have their score divided by 4. This is cumulative, i.e., a super-user task with hardware access would have its score divided by 16.

  • if OOM condition happened in one cpuset and checked task does not belong to that set, its score is divided by 8.

  • the resulting score is multiplied by two to the power of oom_adj (i.e. points <<= oom_adj when it is positive and points >>= -(oom_adj) otherwise).

The task with the highest badness score is then selected and its children are killed. The process itself will be killed in an OOM situation when it does not have children.

Shifting OOM-killing policy to user-space

/proc/<pid>/oom_score is a dynamic value which changes with time, and is not flexible with different and dynamic policies required by the administrator. It is difficult to determine which process will be killed in case of an OOM condition. The administrator must adjust the score for every process created, and for every process which exits. This could be quite a task in a system with quickly-spawning processes. In an attempt to make OOM-killer policy implementation easier, a name-based solution was proposed by Evgeniy Polyakov. With his patch, the process to die first is the one running the program whose name is found in /proc/sys/vm/oom_victim. A name based solution has its limitations:

  • task name is not a reliable indicator of true name and is truncated in the process name fields. Moreover, symlinks to executing binaries, but with different names will not work with this approach

  • This approach can specify only one name at a time, ruling out the possibility of a hierarchy
  • There could be multiple processes of the same name but from different binaries.

  • The behavior boils down to the default current implementation if there is no process by the name defined by /proc/sys/vm/oom_victim. This increases the number of scans required to find the victim process.

Alan Cox disliked this solution, suggesting that containers are the most appropriate way to control the problem. In response to this suggestion, the oom_killer controller, contributed by Nikanth Karthikesan, provides control of the sequence of processes to be killed when the system runs out of memory. The patch introduces an OOM control group (cgroup) with an oom.priority field. The process to be killed is selected from the processes having the highest oom.priority value.

To take control of the OOM-killer, mount the cgroup OOM pseudo-filesystem introduced by the patch:

    # mount -t cgroup -o oom oom /mnt/oom-killer

The OOM-killer directory contains the list of all processes in the file tasks, and their OOM priority in oom.priority. By default, oom.priority is set to one.

If you want to create a special control group containing the list of processes which should be the first to receive the OOM killer's attention, create a directory under /mnt/oom-killer to represent it:

    # mkdir lambs

Set oom.priority to a value high enough:

    # echo 256 > /mnt/oom-killer/lambs/oom.priority

oom.priority is a 64-bit unsigned integer, and can have a maximum value an unsigned 64-bit number can hold. While scanning for the process to be killed, the OOM-killer selects a process from the list of tasks with the highest oom.priority value.

Add the PID of the process to be added to the list of tasks:

    # echo <pid> > /mnt/oom-killer/lambs/tasks

To create a list of processes, which will not be killed by the OOM-killer, make a directory to contain the processes:

    # mkdir invincibles

Setting oom.priority to zero makes all the process in this cgroup to be excluded from the list of target processes to be killed.

    # echo 0 > /mnt/oom-killer/invincibles/oom.priority

To add more processes to this group, add the pid of the task to the list of tasks in the invincible group:

    # echo <pid> > /mnt/oom-killer/invincibles/tasks

Important processes, such as database processes and their controllers, can be added to this group, so they are ignored when OOM-killer searches for processes to be killed. All children of the processes listed in tasks automatically are added to the same control group and inherit the oom.priority of the parent. When multiple tasks have the highest oom.priority, the OOM killer selects the process based on the oom_score and oom_adj.

This approach did not appeal to cpuset users, though. Consider two cpusets, A and B. If a process in cpuset A has a high oom.priority value, it will be killed if cpuset B runs out of memory, even though there is enough memory in cpuset A. This calls for a different design to tame the OOM killer.

An interesting outcome of the discussion has been handling OOM situations in user space. The kernel sends notification to user space, and applications respond by dropping their user-space caches. In case the user-space processes are not able to free enough memory, or the processes ignore the kernel's requests to free memory, the kernel resorts to the good old method of killing processes. mem_notify, developed by Kosaki Motohiro, is one such attempt made in the past. However, the mem_notify patch cannot be applied to versions beyond 2.6.28 because the memory management reclaiming sequence have changed, but the design principles and goals can be reused. David Rientjes suggests having one of the two hybrid solutions:

One is the cgroup OOM notifier that allows you to attach a task to wait on an OOM condition for a collection of tasks. This allows userspace to respond to the condition by dropping caches, adding nodes to a cpuset, elevating memory controller limits, sending a signal, etc. It can also defer to the kernel OOM killer as a last resort.

The other is /dev/mem_notify that allows you to poll() on a device file and be informed of low memory events. This can include the cgroup oom notifier behavior when a collection of tasks is completely out of memory, but can also warn when such a condition may be imminent. I suggested that this be implemented as a client of cgroups so that different handlers can be responsible for different aggregates of tasks.

Most developers prefer making /dev/mem_notify a client of control groups. This can be further extended to merge with the proposed oom-controller.

Low Memory in Embedded Systems

The Android developers required a greater degree of control over the low memory situation because the OOM killer does not kick in till late in the low memory situation, i.e. till all the cache is emptied. Android wanted a solution which would start early while the free memory is being depleted. So they introduced the "lowmemory" driver, which has multiple thresholds of low memory. In a low-memory situation, when the first thresholds are met, background processes are notified of the problem. They do not exit, but, instead, save their state. This affects the latency when switching applications, because the application has to reload on activation. On further pressure, the lowmemory killer kills the non-critical background processes whose state had been saved in the previous threshold and, finally, the foreground applications.

Keeping multiple low memory triggers gives the processes enough time to free memory from their caches because in an OOM situation, user-space processes may not be able to run at all. All it takes is a single allocation from the kernel's internal structures, or a page fault to make the system run out of memory. An earlier notification of a low-memory situation could avoid the OOM situation with a little help from the user space applications which respond to low memory notifications.

Killing processes based on kernel heuristics is not an optimal solution, and these new initiatives of offering better control to the user in selecting the process to be the sacrificial lamb are steps to a robust design to give more control to the user. However, it may take some time to come to a consensus on a final control solution.

Index entries for this article
KernelMemory management/Out-of-memory handling
KernelOOM killer
GuestArticlesRodrigues, Goldwyn


(Log in to post comments)

Taming the OOM killer

Posted Feb 5, 2009 2:29 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

I'm still baffled as to why this is an issue at all. IMNSHO, the ability to overcommit memory should never have been created in the first place. If you need more memory, buy more memory, or create a larger swap partition or file.

What user-space programs are allocating so much more memory than they actually need, anyhow?

I've been doing all of my software development and electrical engineering work (including schematic capture, PCB layout including autorouting, HDL simulation, FPGA synthesis, etc.) for years on a system with no swap space and with memory overcommit disabled, and I haven't run into any problem with it. On rare occasions I've been unable to start a synthesis or simulation run, which is perfectly fine with me, because I'd much rather be unable to start a run than to have a run get killed randomly after hours because the memory it thought it had allocated wasn't actually available, or worse yet, have some other random process killed.

Taming the OOM killer

Posted Feb 5, 2009 6:19 UTC (Thu) by cpeterso (guest, #305) [Link]

I agree. Think of all the time wasted on OOM Killer development, mailing list flame wars, and user confusion for this hacky anti-feature.

Once the OOM Killer starts shooting down processes, I can't imagine that system will remain in a usable state much longer. You've ran out of memory and your processes (and work in progress) are gone.

Given that major distros default OOM Killer to off, who is the target market for the OOM Killer?

Taming the OOM killer

Posted Feb 8, 2009 19:05 UTC (Sun) by anton (subscriber, #25547) [Link]

Once the OOM Killer starts shooting down processes, I can't imagine that system will remain in a usable state much longer.
I have actually experienced several times that the system was stable and usable after the OOM killer had killed the right process. This typically involved killing only a pure user program that was not needed for any system job. In several cases these were compiler runs on a machine with 24GB of RAM and 48GB of swap, and buying more memory was not very practical, and probably would not have helped anyway: the memory consumption was probably due to a bug in the compiler.

Concerning memory overcommitment, I think that this is a good idea for most programs (which are not written to survive failing memory allocations). And relying on overcommitment can simplify programming: e.g., allocate a big chunk and put your growing structure there instead of reallocating all the time. And when you do it, do it right (i.e., echo 1 >/proc/sys/vm/overcommit_memory), not the half-hearted Linux default, which gives us the disadvantages of overcommitment (i.e., the OOM killer) combined with the disadvantages of no overcommitment (unpredictable allocation failures).

Concerning critical system programs, those should be written to survive failing memory allocations, should get really-committed memory if the allocation succeeds, and consequently should not be OOM-killed.

I have outlined this idea in more depth, and I think that AIX and/or Solaris implement something similar. Instead of my per-process idea, the MAP_NORESERVE flag allows to switch between commitment and overcommitment on a per-allocation basis (not sure how useful that is, as well as the default of committing memory).

Taming the OOM killer

Posted Feb 5, 2009 7:37 UTC (Thu) by dlang (guest, #313) [Link]

any process that forks and execs allocates more memory than it needs.

the fork technically needs to allocate as much memory as the program is currently using.

this means that if you have a 512MB firefox process that needs to exec a 64k program to handle some mime time, you first allocate 512MB of additional memory, then exec the 64K program (at which time 511.95M of ram will be freed.

it used to be (in the bad old unix days) that when a fork happened the kernel would take the time to copy all 512MB of ram, only to throw it away a few ms later (the vfork call was invented to tell the system that it wanted to fork, but not really as a work-around for this)

modern *nix systems instead go through and mark those 500MB of memory as Copy-On-Write (COW), which isn't free, but is _FAR_ cheaper than actually touching all the memory (in extreme cases it changes from touching all 1G of ram (512MB of read, 512MB of write) to touching 16K of memory (1 bit per 8k page). this is a phenomenal speedup ( and it gets even better when you consider the amount of cpu cache that you avoid corrupting in the process.

in cases where the system doesn't exec something else you can get drastic savings due to the fact that the memory pages that contain the binary you are executing never change and therefor you never need to copy them. this is similar to the memory savings from doing threading, but without the risk of one thread affecting another thread

Taming the OOM killer

Posted Feb 5, 2009 8:00 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

Sure, but there's no reason not to do both Copy On Write AND count the memory that theoretically might be needed by the fork() as committed. If there isn't enough memory/swap to handle that, the fork() should fail. The user should have enough memory and/or swap that this isn't a problem.

Assuming that when a big process does a fork() that it is going to exec() a small process is completely absurd. Sure, that may happen fairly often, but the case where a small process does a fork() and does an exec() of a large process also happens fairly often.

Usually somewhere in this discussion someone says "but what about embedded systems", which I claim actually supports my position, because it is even MORE important in an embedded system for there to (1) be sufficient memory/swap, and (2) not let the OOM killer nuke some random process if there isn't.

Taming the OOM killer

Posted Feb 5, 2009 8:06 UTC (Thu) by dlang (guest, #313) [Link]

swap isn't free, especially on embedded systems, so providing enough memory+swap to handle your total allocation may not be reasonable.

also, actually _using_ swap can be extremely painful, sometime more painful than simply crashing the system (at least that you can have watchdogs for and failover/reboot)

in theory you are right, the system is unpredictable with overcommit enabled. in practice it is reliable enough for _many_ uses

Taming the OOM killer

Posted Feb 5, 2009 8:22 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

swap isn't free, especially on embedded systems
I have yet to use an embedded Linux system that didn't have substantially more "disk" than RAM, except one case in which there was no disk but plenty of RAM. I somewhat question the wisdom of designing an embedded Linux system for which there is little RAM and even less disk.

In an embedded system, I wouldn't expect there to be a large number of processes sleeping between a fork() and exec(). If there were, that would most likely be a sign of serious problems, so having a fork() fail under such circumstances seems like a good thing.

also, actually _using_ swap can be extremely painful, sometime more painful than simply crashing the system
Sure, but when copy-on-write is used for the fork()/exec() case that seems to be what people are worried about, the swap won't actually be used. It will just be reserved until the exec().

If you're concerned about system performance degrading because of excessive swap usage, there's no reason why you can't have a user-space process to act as a watchdog for that problem, which may occur for reasons other than memory "overcommit".

I've been involved in engineering a number of embedded products that had to have high reliability in the field, and I would not dream of shipping such a product with the kernel configured to allow memory overcommitment. Even though you can _usually_ get away with it, "usually" isn't good enough. There simply needs to be enough memory (and/or swap) to handle the worst-case requirements of the system. Otherwise it _will_ fail in the field, and thus not be as reliable as intended.

Taming the OOM killer

Posted Feb 5, 2009 8:33 UTC (Thu) by dlang (guest, #313) [Link]

one point I was trying to make (and apparently failed) is that even on systems where you have the disk space for swap, having things use that swap can be a big problem

if you could allocate the address space but then tell the kernel "don't really use it" you may be ok, but how is that different from the current overcommit?

you _are_ overcommiting (compared to what is acceptable to the system's performance) and counting on the efficiancies of COW to keep you from actually using the swap space you have comitted.

the only difference is that you overcommit up to a given point (at which time your allocations start failing, which may also cause the system to 'fail' as far as the user is concerned)

i fully agree that there are situations where disabling overcommit is the right thing to do. However, I am also seeing other cases where allowing overcommit is the right thing to do.

Taming the OOM killer

Posted Feb 5, 2009 9:07 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

if you could allocate the address space but then tell the kernel "don't really use it"
I'm not telling the kernel "don't use it". If the kernel needs to, it will use it. For the primary case people seem concerned with, the time between fork() and exec(), it will be committed but due to COW, it won't actually get used. It may still get used for other cases, and within reason that's a good thing, but a user-space daemon can take some system-specific corrective action if it gets out of hand. This provides a whole lot more flexibility in error handling than a user-space daemon that would only control the behavior of the OOM killer.
you may be ok, but how is that different from the current overcommit?
It's different because the kernel is NEVER going to kill an unrelated process selected by a heuristic. It is going to fail an allocation or fork, and the software can take some reasonable recovery action.

The system should not be designed or configured such that the kernel can fail to provide memory that has been committed, because there is NO reasonable recovery mechanism for that. It is far easier to handle memory allocation problems gracefully when the error is reported at the time of the attempt to commit the memory, rather than at some random future time.

Taming the OOM killer

Posted Feb 5, 2009 9:16 UTC (Thu) by dlang (guest, #313) [Link]

so you would rather have the system slow to an unusable crawl if it actually tries to use all the memory that has been asked for rather than have _anything_ killed under _any_ conditions.

there are times for that, but there are also times when 99.999% reliability is good enough.

Taming the OOM killer

Posted Feb 5, 2009 10:23 UTC (Thu) by epa (subscriber, #39769) [Link]

The current setup (where memory is overcommitted and there is an OOM killer) is also quite capable of slowing the system to an unusable crawl if you have swap space in use. So I don't think that turning off overcommit and allocating a slightly larger amount of swap would make the situation any worse.

(On a related note, the kernel is free to refuse any request for extra memory, and can do so for its own reasons. So for example if a process needs to fork() then the memory allocation would normally succeed, on the assumption that the extra memory probably won't be used, but provided there is enough swap space to back it up just in case. Whereas an explicit memory allocation 'I want ten gigabytes' could, as a matter of policy, be denied if the system doesn't have that much physical RAM.)

Taming the OOM killer

Posted Feb 5, 2009 21:03 UTC (Thu) by dlang (guest, #313) [Link]

I'm not talking about 'I need 10G of memory' allocations, I'm talking about cases where lots of small programs end up using individually small amounts of memory, but the total is large.

but if you have large programs that may need to fork, it's not nessasarily the case that it's 'a slightly larger amount of swap'. I've seen people arguing your point of view toss off that a large system should be willing to dedicate a full 1TB drive just to swap so that it can turn overcommit off. in practice, if you end up using more than a gig or so of swap your system slows to a crawl

Taming the OOM killer

Posted Feb 5, 2009 22:26 UTC (Thu) by epa (subscriber, #39769) [Link]

I think it would be useful to add swap space for 'emergency only' use. So if all physical RAM is free, the kernel starts refusing user space requests for more memory. However if a process wants to fork() the kernel can let it succeed, knowing that in the worst case there is swap space to back its promises.

It is rather a problem that merely adding swap space as available means it can then be used by applications just as willingly as physical RAM. Perhaps a per-process policy flag would say whether an app can have its memory allocation requests start going to swap (as opposed to getting 0 from malloc() when physical RAM is exhausted). Then sysadmins could switch this flag on for particular processes that need it.

Taming the OOM killer

Posted Feb 6, 2009 0:45 UTC (Fri) by nix (subscriber, #2304) [Link]

The problem is that the system is more dynamic than that. Swap space is
moved to and from physical memory on demand; there is almost never much
free physical memory, because free memory is wasted memory, so the first
sign you get that you're about to run out of memory is when you're out of
*swap* and still allocating more (reducing the various caches and paging
text pages out as you go).

Taming the OOM killer

Posted Feb 5, 2009 12:42 UTC (Thu) by mjthayer (guest, #39183) [Link]

I would rather have my system slow to an unusable crawl if I was confident that it would come out of it again at some point. Even then, I can still press the reset button, which is what I have usually ended up doing in OOM situations anyway. And the same way as you can tune the behaviour of the OOM killer, you could also tune which applications the system tries to keep responsive, so that you can reasonably quickly manually kill (or just stop) the offending processes.

Taming the OOM killer

Posted Feb 5, 2009 15:44 UTC (Thu) by hppnq (guest, #14462) [Link]

I would rather have my system slow to an unusable crawl if I was confident that it would come out of it again at some point. Even then, I can still press the reset button, which is what I have usually ended up doing in OOM situations anyway.

On your home system this makes some sense, but all this goes out the window once you have to take service levels into account.

Taming the OOM killer

Posted Feb 6, 2009 7:46 UTC (Fri) by mjthayer (guest, #39183) [Link]

Granted, but then you don't want random processes dying either. That can also have adverse affects on service levels. In that case you are more likely to want a system that will stop allocating memory in time.

Taming the OOM killer

Posted Feb 6, 2009 8:58 UTC (Fri) by dlang (guest, #313) [Link]

it's actually far easier to deal with processes dieing then the entire machine effectivly locking up in a swap storm.

you probably already have tools in place to detect processes dieing and either restart them (if the memory preasure is temporary) or failover to another box (gracefully for all the other processes on the box)

Taming the OOM killer

Posted Jul 15, 2014 2:27 UTC (Tue) by bbulkow (guest, #87167) [Link]

When the random process is SSHD, few tools continue to function. Yes, I've seen this in production multiple times. I wish that most server distributions did not allow over commit, and/or SSHD was protected. I also wish the OOM killer system messages were clearer.

Taming the OOM killer

Posted Jul 15, 2014 2:52 UTC (Tue) by dlang (guest, #313) [Link]

turning off overcommit would cause more memory allocation failures (because the memory system would say that it couldn't guarantee memory that ends up never being used)

True, it would happen at malloc() time instead of randomly, but given that most programs don't check return codes, this would help less than it should

Taming the OOM killer

Posted Jul 15, 2014 9:41 UTC (Tue) by dgm (subscriber, #49227) [Link]

> but given that most programs don't check return codes

IMHO, this should be treated like a bug.

> the memory system would say that it couldn't guarantee memory that ends up *never being used*

This too.

Taming the OOM killer

Posted Jul 15, 2014 19:11 UTC (Tue) by dlang (guest, #313) [Link]

>> but given that most programs don't check return codes

> IMHO, this should be treated like a bug.

you have a right to your opinion, but in practice, your opinion doesn't matter that much

>> the memory system would say that it couldn't guarantee memory that ends up *never being used*

> This too.

exactly how would you expect the linux kernel to know that the application that just forked is never going to touch some of the memory of the parent and therefor doesn't need it to be duplicated (at least in allocation)?

this is especially important for large programs that are forking so that the child can then exec some other program. In this case you may have a multi-GB allocation that's not needed because the only thing the child does is to close some file discripters and exec some other program. With the default overcommit and Copy-on-Write, this 'just works', but with overcommit disabled, the kernel needs to allocate the multiple GB of RAM (or at least virtual memory) just in case the application is going to need it. This will cause failures if the system doesn't have a few extra GB around to handle these wasteful allocations.

not to mention that there's overhead in updating the global allocations, so allocating and then deallocating memory like that has a cost.

Taming the OOM killer

Posted Jul 16, 2014 11:23 UTC (Wed) by dgm (subscriber, #49227) [Link]

> exactly how would you expect the linux kernel to know that the application that just forked is never going to touch some of the memory of the parent and therefor doesn't need it to be duplicated (at least in allocation)?

What about telling it that you're just about to call execv, so it doesn't need to? What about auto-detecting this by simply watching what the first syscall after fork is?

Not bad for just 15 seconds of thinking about it, isn't it?

Taming the OOM killer

Posted Jul 16, 2014 12:05 UTC (Wed) by JGR (subscriber, #93631) [Link]

The very first syscall after fork is not necessarily execv, fds are often closed/set up just beforehand.
Even if execv is called immediately (for some value of immediately), the parent may well have scribbled over the memory which holds the parameters to be passed to execv in the child, before the child has called execv.
If it's really essential that nothing should be duplicated, you can still use vfork.

Taming the OOM killer

Posted Jul 16, 2014 12:18 UTC (Wed) by dgm (subscriber, #49227) [Link]

Not to mention that overcommit is not the same as CoW. You can keep CoW and still disable overcommit (there's even a knob for that).

Taming the OOM killer

Posted Jul 16, 2014 15:07 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> Not to mention that overcommit is not the same as CoW. You can keep CoW and still disable overcommit....

CoW is still a form of overcommit, even if it's not referred to as such. In the one case you commit to allocating a new page in the future, on the first write, and pre-filling it with a copy of an existing page. In the other case you commit to allocating a new page in the future, probably on the first write, and pre-filling it with zeros. In both cases you're writing an IOU for memory which may not actually exist when it's needed.

You could pre-allocate memory for CoW while deferring the actual copy, but that would only be a performance optimization. You'd still have the problem that fork() may fail in a large process for lack of available memory even though the child isn't going to need most of it.

Taming the OOM killer

Posted Jul 16, 2014 14:06 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

close(0); close(1); close(2); dup2(childendofsocket, 0); dup2(childendofsocket, 1); dup2(childendofsocket, 2); close(parentendofsocket); execve(/*args*/); _exit(255);

Taming the OOM killer

Posted Jul 16, 2014 18:50 UTC (Wed) by nix (subscriber, #2304) [Link]

Even if you checked allocator return codes perfectly, it still wouldn't help: you can OOM calling a function if there isn't enough memory to expand the stack, even in the absence of overcommit. Nothing you can do about *that* (other than to 'pre-expand' the stack with a bunch of do-nothing function calls early in execution, and hope like hell you expanded it enough).

Taming the OOM killer

Posted Jul 17, 2014 14:22 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

> other than to 'pre-expand' the stack with a bunch of do-nothing function calls early in execution, and hope like hell you expanded it enough

Also that you don't expand it too much and crash in your stack_balloon function.

Taming the OOM killer

Posted Jul 15, 2014 14:44 UTC (Tue) by raven667 (subscriber, #5198) [Link]

sshd should be auto-restarted by systemd which should help save the system if OOM killer is running rampant.

Taming the OOM killer

Posted Feb 5, 2009 21:06 UTC (Thu) by dlang (guest, #313) [Link]

the problem is that a system that goes heavily into swap may not come back out for hours or days.

if you are willing to hit reset in this condition then you should be willing to deal with the OOM killer killing the box under the same conditions.

Taming the OOM killer

Posted Feb 6, 2009 8:00 UTC (Fri) by mjthayer (guest, #39183) [Link]

Is I said, perhaps some work could be put into improving this situation then rather than improving the OOM killer. Like using the same heuristics they are developing for the killer to determine processes to freeze and move completely into swap, freeing up memory for other processes. This is of course somewhat easier to correct if the heuristics go wrong (unless they go badly wrong of course, and take down the X server or whatever) than if the process is just shot down.

Taming the OOM killer

Posted Feb 12, 2009 19:14 UTC (Thu) by efexis (guest, #26355) [Link]

There's no reason for OOM killer to kick in if there's swap available, stuff can just be swapped out (swapping may need memory, which case you set a watermark where swapping is forced before free memory drops below that point, to ensure that swapping can happen). OOM means exactly what it says - you're out of memory, silicon or magnetic it makes no difference.

Personally I have swap disabled or set very low, as a runaway process will basically mean I lose contact with a server, unable to log in to it or anything, until it has finished chewing through all available memory *and* swap (causing IO starvation, IO being the thing I need to log in and kill the offending task) until it hits the limit and gets killed.

Everything important is set to be restarted, either directly from init, or indirectly from daemontools or equivalent, which is restarted by init should it go down (which has never happened).

Taming the OOM killer

Posted Feb 13, 2009 23:33 UTC (Fri) by mjthayer (guest, #39183) [Link]

I have been thinking about this a bit more, since my system was just swapped to death again (and no, the OOM killer did not kick in). Has anyone tried setting a per-process memory limit in percentage of the total physical RAM? That would help limit the damage done by runaway processes without stopping large processes from forking.

Taming the OOM killer

Posted Feb 14, 2009 0:03 UTC (Sat) by dlang (guest, #313) [Link]

if you swapped to death and OOM didn't kick in, you have probably allocated more swap than you are willing to have used.

how much swap did you allocate? any idea how much was used?

enabling overcommit with small amounts of swap will allow large programs to fork without problems, but will limit runaway processes. it's about the textbook case for using overcommit.

Taming the OOM killer

Posted Feb 16, 2009 9:04 UTC (Mon) by mjthayer (guest, #39183) [Link]

> how much swap did you allocate? any idea how much was used?

Definitely too much (1 GB for 2 GB of RAM), as I realised after reading this: http://kerneltrap.org/node/3202. That page was also what prompted my last comment. It seems a bit strange to me that increasing swap size should so badly affect system performance in this situation, and I wondered whether this could be fixed with the right tweak, such as limiting the amount of virtual memory available to processes, say to a default of 80 percent of physical RAM. This would still allow for large processes to fork, but might catch runaway processes a bit earlier. I think that if I find some time, I will try to work out how to do that (assuming you don't answer in the mean time to tell me why that is a really bad idea, or that there already is such a setting).

Taming the OOM killer

Posted Feb 16, 2009 15:38 UTC (Mon) by dlang (guest, #313) [Link]

have you looked into setting the appropriate values in ulimit?

Taming the OOM killer

Posted Feb 17, 2009 8:23 UTC (Tue) by mjthayer (guest, #39183) [Link]

> have you looked into setting the appropriate values in ulimit?

Indeed. I set ulimit -v 1600000 (given that I have 2GB of physical RAM) and launched a known bad process (gnash on a page I know it can't cope with). gnash crashed after a few minutes, without even slowing down my system. I just wonder why this is not done by default. Of course, one could argue that this is a user or distribution problem, but given that knowledgeable people can change the value, why not in the kernel? (Again, to say 80% of physical RAM. I tried with 90% and gnash caused a noticeable performance degradation.) This is not a rhetorical question, I am genuinely curious.

Taming the OOM killer

Posted Feb 17, 2009 8:29 UTC (Tue) by dlang (guest, #313) [Link]

simple, the kernel doesn't know what is right for you. how can it know that you really don't want this program that you start to use all available ram (even at the expense of other programs)

the distro is in the same boat. if they configured it to do what you want, they would have other people screaming at them that they would rather see the computer slow down than have programs die (you even see people here arguing that)

Taming the OOM killer

Posted Feb 17, 2009 14:27 UTC (Tue) by mjthayer (guest, #39183) [Link]

> simple, the kernel doesn't know what is right for you. how can it know that you really don't want this program that you start to use all available ram (even at the expense of other programs)

It does take a decision though - to allow all programmes to allocate as much RAM as they wish by default, even if it is not present, is very definitely a policy decision. Interestingly Wine fails to start if I set ulimit -v in this way (I can guess why). I wonder whether disabling overcommit would also prevent it from working?

Taming the OOM killer

Posted Feb 5, 2009 10:17 UTC (Thu) by epa (subscriber, #39769) [Link]

any process that forks and execs allocates more memory than it needs.
Quite. Which is why a single fork-and-exec-child-process system call is needed. With that, there would be much less need to overcommit memory and so a better chance of avoiding hacks like the OOM killer.

The classical Unix design of separate fork() and exec() is elegant at first glance, but in practice it has caused various unpleasant kludges to cope with the memory overcommit. (Another one was vfork(), which IIRC was a fork call that used less memory but only worked as long as you promise to call exec() immediately afterwards. Why they didn't make a single fork-plus-exec primitive rather than this crufty interface eludes me.)

Taming the OOM killer

Posted Feb 5, 2009 11:16 UTC (Thu) by iq-0 (subscriber, #36655) [Link]

Since it is effectively the same as your all-in-one system call, since it would have to block till done to receive the necessary feedback. This way you can even implement a simple 'if exec(a) fails try exec(b) or otherwise flag error with (possibly program specific) meaningful data'...

I don't know if filehandle closing is allowed after vfork, but this would also be a great help to ensure the right file handles are passed (which would be a tedious operation or really complex/verbose in the case of a single fork-and-exec call).

Taming the OOM killer

Posted Feb 5, 2009 14:08 UTC (Thu) by epa (subscriber, #39769) [Link]

This way you can even implement a simple 'if exec(a) fails try exec(b) or otherwise flag error with (possibly program specific) meaningful data'...
Oh sure, sometimes you will want to do something more complex like that. In these cases vfork() (as in classic BSD) doesn't work, because the child process is using memory belonging to the parent. The traditional fork() then exec() is best.

But this is a small minority of the cases when an external command is run. And running an external command accounts for a large proportion of total fork()s. I'm just suggesting to make something more robust (avoiding the need to overcommit memory) for the common case.

Taming the OOM killer

Posted Feb 5, 2009 17:03 UTC (Thu) by nix (subscriber, #2304) [Link]

I benchmarked this a while back. The common case is pipeline invocation,
for which you need at least dup()s and filehandle manipulation between
fork() and exec(): and the nature of such manipulation differs for each
invoker...

Taming the OOM killer

Posted Feb 5, 2009 22:05 UTC (Thu) by epa (subscriber, #39769) [Link]

OK, I guess it's not as straightforward as I thought.

Perhaps another way to avoid the need for memory allocation would be to use a new vfork-like call (heck, it could even be called vfork) that has a fixed memory budget set as a matter of policy system-wide. So when you vfork(), the memory is set up as copy-on-write, but the child process has a budget of at most 1000 pages it can scribble on. That should be enough to set up the necessary file descriptors, but if it tries to dirty more than its allowance it is summarily killed.

That way, there is some upper limit to the amount of memory that needs to be allocated - when vfork()ing the kernel just needs to ensure 1000 free pages - and the kernel doesn't have to make a (possibly untrustworthy) promise that the whole process address space is available for normal use.

Taming the OOM killer

Posted Feb 5, 2009 11:49 UTC (Thu) by alonz (subscriber, #815) [Link]

<sarcasm>
So you would prefer the VMS/Win32 style “CreateProcess” system call, with its 30+ arguments—just in order to accommodate all possible behaviors expected from the parent process?
</sarcasm>

This isn't really simpler (except on block diagrams…)

Taming the OOM killer

Posted Feb 5, 2009 12:52 UTC (Thu) by mjthayer (guest, #39183) [Link]

Are you saying that nothing is possible in-between (v)fork plus exec and CreateProcess?

Taming the OOM killer

Posted Feb 5, 2009 20:00 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

(From SUSv2 / POSIX draft.)
The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.
On the other hand, pid_t child = clone(run_child_func, run_child_stack, CLONE_VM, run_child_data) would do the trick. The child would share memory with the parent, making overcommit unnecessary, but would have a different file descriptor table, allowing pipelines to be set up easily.

Taming the OOM killer

Posted Feb 5, 2009 16:55 UTC (Thu) by nix (subscriber, #2304) [Link]

fork()+other things+exec() are very common, to implement e.g. redirection
and piping. The call you want exists in POSIX, with the inelegant name
posix_spawn*(), but it's a nightmare of overdesign and a huge family of
functions precisely because it has to model all the things people usually
want to do between fork() and exec(). Its only real use is in embedded
MMUless systems in which fork() is impractical or impossible to implement.

Taming the OOM killer

Posted Feb 5, 2009 17:40 UTC (Thu) by martinfick (subscriber, #4455) [Link]

While the fork/exec example is a valid example of why over commit exists, it is not the only reason. Any attempt to propose a solution for this case only is a waste of time. The reality is that many fork only (not exec) situations allow processes to share huge amounts of memory through COW also. Eliminating over commit would make this impossible in many cases. Of course, if you don't like over commit, turn it off. But without it, many things simply aren't possible, this is probably why a large portion of people seem to like it's benefits.

Taming the OOM killer

Posted Feb 5, 2009 22:22 UTC (Thu) by epa (subscriber, #39769) [Link]

There may be some middle ground between eliminating overcommit altogether and continuing with the status quo. For example, fork() calls could continue to pretend that memory is available, on the assumption that the child process will soon exec() something or otherwise reduce its footprint; but if physical memory is tight then it's okay for the kernel to refuse memory allocation requests from a process (which is then passed up through the C library as a null return from malloc()). This might be more reliable than always having malloc() succeed, whether the OOM killer is turned on or not.

For those making embedded or high-availability systems who want to try harder and turn off overcommit altogether, fork-then-exec could be replaced in user space with posix_spawn or vfork-then-exec or similar.

Taming the OOM killer

Posted Feb 5, 2009 23:11 UTC (Thu) by martinfick (subscriber, #4455) [Link]

The OOM killer does not come into play when malloc is called. If malloc is called when there in no memory there is no need to kill any processes, malloc simply fails and return the appropriate error code.

The OOM killer kicks in when memory has been overcommitted through COW. Two processes are sharing the same memory region and one of them decides to write to that shared COW page requiring the page to now be copied. There is no memory allocating happening, simply a write to a memory page which is already allocated to a process (two of them actually).

Again, the fork then exec shortcut is not really the big deal, it is processes that fork and do not exec and then eventually write to a COW page.

Taming the OOM killer

Posted Feb 6, 2009 0:51 UTC (Fri) by nix (subscriber, #2304) [Link]

The OOM killer comes into play if memory is requested and is not
available, and the request is not failable. Several such allocations
spring to mind:

- when a process stack is grown

- when a fork()ed process COWs

- when a page in a private file-backed mapping is written to for the
first time

- when a nonswappable kernel resource needs to be allocated (other than a
cache) which cannot be discarded when memory pressure is high

- if overcommit_memory is set, if a page from the heap or an anonymous
mmap() is requested for the first time

So the OOM killer is *always* needed, even if overcommitting were disabled
as much as possible. (You can overcommit disk space, too: thanks to sparse
files, you can run out of disk space writing to the middle of a file. With
some filesystems, e.g. NTFS, you can run out of disk space by renaming a
file, triggering a tree rebalance and node allocation when there's not
enough disk space left. NTFS maintains an emergency pool for this
situation, but it's only so large...)

Taming the OOM killer

Posted Feb 6, 2009 0:58 UTC (Fri) by martinfick (subscriber, #4455) [Link]

Why when a process' stack is grown? In this case the process should fail (die?) just like when malloc would fail, but there should be no reason to upset other processes in the system!

Taming the OOM killer

Posted Feb 6, 2009 1:26 UTC (Fri) by dlang (guest, #313) [Link]

with malloc you can check the return code to see if it failed or not and handle the error

how would you propose that programmers handle an error when they allocate a variable? (which is one way to grow the stack)

Taming the OOM killer

Posted Feb 6, 2009 1:38 UTC (Fri) by brouhaha (subscriber, #1698) [Link]

The process should get a segfault or equivalent signal. If there is a handler for the signal, but the handler can't be invoked due to lack of stack space, the process should be killed. If the mechanism to signal the process in a potential out-of-stack situation is too complex to be practically implemented in the kernel, then the process should be killed without attempting to signal it.

At no point should the OOM killer become involved, because there is no reason to propagate the error outside the process (other than by another process noticing that the process in question has exited). A principle of reliable systems is confining the consequences of an error to the minimum area necessary, and killing some other randomly-selected (or even heuristically-selected) process violates that principle.

Taming the OOM killer

Posted Feb 6, 2009 5:26 UTC (Fri) by njs (guest, #40338) [Link]

> At no point should the OOM killer become involved, because there is no reason to propagate the error outside the process (other than by another process noticing that the process in question has exited).

This makes sense on the surface, but memory being a shared resource means that everything is horribly coupled no matter what and life isn't that simple.

You have 2 gigs of memory.

Process 1 and process 2 are each using 50 megabytes of RAM.

Then Process 1 allocates another 1948 megabytes.

Then Process 2 attempts to grow its stack by 1 page, but there is no memory.

The reason the OOM exists is that it makes no sense to blame Process 2 for this situation. And if you did blame Process 2, then the system would still be hosed and a few minutes later you'd have to kill off Process 3, Process 4, etc., until you got lucky and hit Process 1.

Taming the OOM killer

Posted Feb 7, 2009 17:52 UTC (Sat) by oak (guest, #2786) [Link]

Stack is actually yet another reason for overcommit. Current Linux
kernels map by default 8MB of stack for each thread (and usually threads
use only something like 4-8KB of that). Without overcommit, process with
16 threads couldn't run in 128MB RAM, unless you change this limit. I
think you can change it only from kernel source and it applies to all
processes/threads in the system?

Taming the OOM killer

Posted Feb 8, 2009 15:26 UTC (Sun) by nix (subscriber, #2304) [Link]

setrlimit (RLIMIT_STACK,...);

Taming the OOM killer

Posted Feb 12, 2009 14:32 UTC (Thu) by epa (subscriber, #39769) [Link]

The OOM killer does not come into play when malloc is called. If malloc is called when there in no memory there is no need to kill any processes, malloc simply fails and return the appropriate error code.
Ah, I didn't realize that. From the way people talk it sounded as though malloc() would always succeed and then the process would just blow up trying to use the memory. If the only memory overcommit is COW due to fork() then it's not so bad (though I still think some kind of vfork() would be a more hygienic practice).

Taming the OOM killer

Posted Feb 12, 2009 15:23 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

malloc() isn't implemented by the kernel, you might do better to listen to someone who knows what they're talking about. :/

Taming the OOM killer

Posted Feb 12, 2009 16:06 UTC (Thu) by dlang (guest, #313) [Link]

vfork tends to be strongly discouraged nowdays. it can be used safely, but it's easy to not use safely.

there are a growing number of such functions in C nowdays as people go back and figure out where programmers commonly get it wrong and provide functions that are much harder to misuse (the case that springs to mind are the string manipulation routines)

Taming the OOM killer

Posted Feb 5, 2009 22:49 UTC (Thu) by tbird20d (subscriber, #1901) [Link]

I'm still baffled as to why this is an issue at all. IMNSHO, the ability to overcommit memory should never have been created in the first place. If you need more memory, buy more memory, or create a larger swap partition or file.
This is not an option in embedded devices - particularly those that ship in the millions of units.

What user-space programs are allocating so much more memory than they actually need, anyhow?
This simple answer is, in a modern system, all of them.

Nearly all programs have a virtual memory footprint greater than they will actually use (at one time, or even during their entire lifetime). This is true even ignoring the fork-and-exec issue. For large systems, this discrepancy is easily ignored or worked around, but for low-resource systems, the difference becomes a major issue. The ability to overcommit memory is actually one of the reasons Linux is selected over traditional RTOSes for some embedded projects.

Taming the OOM killer

Posted Feb 7, 2009 21:37 UTC (Sat) by giraffedata (guest, #1954) [Link]

the ability to overcommit memory should never have been created in the first place.

I think you misunderstand the role overcommit plays in the issue. All it does is change which process gets arbitrarily killed when the unthinkable happens and the system exceeds the memory usage you planned for.

With overcommit, the OOM Killer uses a sophisticated algorithm to decide which process to kill. Without overcommit, the rule is simple: whatever next requests virtual memory (basically, malloc) dies. That process may be using very little memory, may be using the same amount it's used 100 times before without incident, and may be very important.

Oh, and with overcommit it's far less likely that anything at all will die.

There are systems (probably interactive, like the one you describe) where killing the next process to allocate memory is the least painful thing. There are plenty where it isn't.

Taming the OOM killer

Posted Feb 12, 2009 20:33 UTC (Thu) by xorbe (guest, #3165) [Link]

Talk about over-engineered. I hate these pseudo heuristic algorithms they keep stuffing into the scheduler or oom handler, because they always miss a bunch of corner cases. Forget "best", just make it simple and solid and UNDERSTANDABLE. Special mount for OOM conditions? Yikes!!

Taming the OOM killer

Posted May 16, 2017 19:26 UTC (Tue) by rrmhearts (guest, #115657) [Link]

When I read this article, I immediately thought of the United Flight where that doctor was forcibly removed from the plane. He was seated and then, they announce that someone is being kicked off for higher priority passengers.

It's bad enough that overcommitting memory locks up your system for a few minutes, it's even worse that "sacrificial lambs" are being killed causing who knows what kind of havoc to your computing ventures.

Android has it right, at least warn processes about the possibility so that they can save their state and be prepared for their doom. Long story short, if you are going to have sacrificial lambs, at least tell them before they board the airplane and think they have the chunk of memory they need to use.

Taming the OOM killer

Posted May 17, 2017 14:43 UTC (Wed) by flussence (subscriber, #85566) [Link]

That's what the cgroups memory pressure notifier is for. It may not have existed at the time this article was written.


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds