KS2012: memcg/mm: Improving kernel-memory accounting for memory cgroups

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Michael Kerrisk
September 17, 2012

2012 Kernel Summit

The 2012 Kernel Summit memcg/mm minisummit started with a session titled "Accounting other than user pages", but the slot was mainly dedicated to the topic of improved kernel-memory accounting for memory cgroups. Glauber Costa has been working on kernel-memory accounting for some time. He began with a basic overview of the motivations for memory cgroups and how they can be used to limit memory usage of a group of processes within a container by accounting for all the pages used by user space. Glauber would like to see better accounting of certain workloads that have low user-memory usage but high kernel-memory usage (see this LWN article for some background). The problem is that, in the current kernel implementation, a single memory control group can occupy a lot of kernel memory without being held responsible for it. This can produce global memory pressure that pushes other cgroups to swap or even triggers the OOM killer. The goal is thus to better track usage of kernel memory, in order to prevent memory cgroups from causing these kinds of problems.

Glauber mentioned earlier work he has done to track kernel memory used for TCP data. He is now turning his focus to other kinds of kernel memory that can possibly "explode". His overall goal is to allow the use of OpenVZ with upstream kernels, rather than patched kernels. A use case of particular interest is hosting providers that use OpenVZ-based containers and heavily overcommit resources. In such scenarios, the kernel can't trust containers to cooperate with the rest of the system in order to keep the system alive. Thus, accurate accounting of resource usage within containers is very important.

Glauber would like to account for general kernel memory usage, including usage of slab objects and kernel stack pages. To begin with, he is trying to merge support for tracking per-task kernel stack usage. (One reason for tracking kernel stack usage is that it could provide a mechanism to mitigate fork bombs and cap their effect inside a cgroup.) The rationale for starting out by tracking kernel stack usage is that it would be a reasonably non-invasive change (unlike tracking slab usage), but would still demonstrate all the memory cgroups changes that would be required for later, larger memory-accounting changes; this smaller piece of work would also serve to show what the user interface (i.e., the configuration files in the cgroup directories) would look like.

Glauber went on to discuss the additional changes necessary to track usage of slab allocations. His description was quite detailed, but, very broadly speaking, slab pages allocated for a cgroup would be for the exclusive use of that cgroup. In essence, every cgroup would get its own slab cache for a certain object type so that pages allocated to a cache are always exclusively used by one memory cgroup. Slab metadata would be copied over to a cgroup-specific slab. Thus, for example, if a dentry was touched, this would lazily trigger the creation of a per-cgroup dentry cache for that cgroup. From that point on, all dentries allocated by that cgroup would come not from the global dentry cache, but rather from cgroup-specific cache. A side effect of that is that pages are filled with objects from the same cgroup. The intention of the proposed changes is to prevent unrelated cgroups pinning each other's memory—that is, preventing the situation where one cgroup, B, forces an unrelated cgroup, A, to keep memory resident that A is being charged for. For example, if cgroups A and B have both allocated objects in a slab page they share, and A is forced to shrink due to hitting its limits, then it might try freeing all of its slab objects, but still fail to free the slab page it is being charged for.

Someone asked specifically about page-table pages, noting that it's easy to create an adverse workload that allocates many page tables while using very little user memory. Glauber felt that it would be easy to track these pages using a memory-cgroup-specific GFP (Get Free Page) flag.

Glauber's current proposal does selective accounting of kernel memory allocation: only explicitly annotated allocations are tracked. Peter Zijlstra asked: why not account for every kernel memory allocation for by default? Glauber pointed out that there would be a performance cost with this, and he wants to avoid a situation where the cost of cgroups is very high. In some respects, charging cgroups by default would be easier to deal with, but it would be too slow to be useful.

Some participants in the room asked that the group re-discuss kernel-plus-user memory accounting. This was discussed at the April 2012 LSF/MM meeting, but it was apparent that there was still some uncertainty as to whether it was the correct approach.

Roughly, the details are as follows. There is a page counter that tracks how many pages have been allocated for exclusive use by the kernel (the kernel page counter, memory.kmem.usage_in_bytes) and a counter that counts both kernel and user page allocations (the kernel+user page counter, memory.usage_in_bytes). In conjunction with two corresponding limits that can be set by the system administrator, this implementation enables three different use cases:

memory.kmem.limit_in_bytes == unlimited: This configuration says that the user is not interested in kernel-memory accounting at all, so only user memory is limited and accounted. This is the default and actually what the memory controller provided until now.
memory.kmem.limit_in_bytes < memory.limit_in_bytes: This configuration enables fine-grained control over how much kernel memory can be allocated. Depending on the amount of user memory, it is more probable that the kmem limit is hit first and allocations would fail. Users of this setting should be aware that size of the kernel memory usage could differ between kernel releases.
memory.kmem.limit_in_bytes >= memory.limit_in_bytes: This configuration allows capping both the kernel and the user memory without any details about how much memory is used for each of them. The user is just interested in the sum of both.

In Glauber's opinion, having the two integrated counters as described above is simpler than having separate exclusive user and kernel counters. However, Michal Hocko pointed out that this makes OOM decisions difficult (i.e., the kernel may make wrong decisions, such that a program that is not using a lot of memory can be killed because of heavy kernel memory usage). Glauber defended his approach on the basis that the current OOM-killing decisions have the same problem, and if his approach makes these problems easier to trigger, then they are more likely to be found and fixed.

There were no fundamental objections to Glauber's patch series to add support for accounting for pages allocated for the stack. However, three things were requested:

Documentation of the use cases and how the counters are to be interpreted by the system administrator. It will be up to other interested users to object if their use cases are not addressed.
Incorporation of feedback on the patches, including issues with naming and minor inconsistencies in the API. The patch series still needs "a bit of love".
A demonstration of the behavior in a NUMA environment in order to determine whether the implementation has problems with CPU cache-line bouncing. If a single counter is shared for an entire cgroup and processes in that cgroup run on multiple nodes then the cache line has to "bounce" on each write to keep the data coherent. It is desirable that Glauber tests in advance how kernel performance is affected if processes are running on multiple nodes. Potentially, every kernel allocation updates a counter shared between NUMA nodes and this would scale very badly. It's worth noting that NUMA is not the key factor here, but checking the behavior in a NUMA environment is a good test because NUMA makes the effects of cache-line bouncing much more visible. (Glauber has subsequently posted a round of benchmarks.)

Next: Kernel-memory shrinking

Index entries for this article
Kernel	Control groups
Kernel	Memory management/Control groups

(Log in to post comments)

KS2012: memcg/mm: Improving kernel-memory accounting for memory cgroups

Posted Sep 18, 2012 19:43 UTC (Tue) by vomlehn (guest, #45588) [Link]

Silly kernel people (including myself)! We keep looking at things from the kernel standpoint, and saying things like:

"memory.kmem.limit_in_bytes < memory.limit_in_bytes: ... Users of this setting should be aware that size of the kernel memory usage could differ between kernel releases."

Users don't manage kernel memory usage, they manage user space application usage. The user-visible knob has to be over something the user can control, so it's really worth considering whether this should be memory.umem.limit_in_bytes, with the kernel able to use the rest of memory. If we do that, the user doesn't have to change anything when upgrading the kernel.