|
|
Subscribe / Log in / New account

Namespaces in operation, part 5: User namespaces

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

By Michael Kerrisk
February 27, 2013

Continuing our ongoing series on namespaces, this article looks more closely at user namespaces, a feature whose implementation was (largely) completed in Linux 3.8. (The remaining work consists of changes for XFS and a number of other filesystems; the latter has already been merged for 3.9.) User namespaces allow per-namespace mappings of user and group IDs. This means that a process's user and group IDs inside a user namespace can be different from its IDs outside of the namespace. Most notably, a process can have a nonzero user ID outside a namespace while at the same time having a user ID of zero inside the namespace; in other words, the process is unprivileged for operations outside the user namespace but has root privileges inside the namespace.

Creating user namespaces

User namespaces are created by specifying the CLONE_NEWUSER flag when calling clone() or unshare(). Starting with Linux 3.8 (and unlike the flags used for creating other types of namespaces), no privilege is required to create a user namespace. In our examples below, all of the user namespaces are created using the unprivileged user ID 1000.

To begin investigating user namespaces, we'll make use of a small program, demo_userns.c, that creates a child in a new user namespace. The child simply displays its effective user and group IDs as well as its capabilities. Running this program as an unprivileged user produces the following result:

    $ id -u          # Display effective user ID of shell process
    1000
    $ id -g          # Effective group ID of shell
    1000
    $ ./demo_userns 
    eUID = 65534;  eGID = 65534;  capabilities: =ep

The output from this program shows some interesting details. One of these is the capabilities that were assigned to the child process. The string "=ep" (produced by the library function cap_to_text(), which converts capability sets to a textual representation) indicates that the child has a full set of permitted and effective capabilities, even though the program was run from an unprivileged account. When a user namespace is created, the first process in the namespace is granted a full set of capabilities in the namespace. This allows that process to perform any initializations that are necessary in the namespace before other processes are created in the namespace.

The second point of interest is the user and group IDs of the child process. As noted above, a process's user and group IDs inside and outside a user namespace can be different. However, there needs to be a mapping from the user IDs inside a user namespace to a corresponding set of user IDs outside the namespace; the same is true of group IDs. This allows the system to perform the appropriate permission checks when a process in a user namespace performs operations that affect the wider system (e.g., sending a signal to a process outside the namespace or accessing a file).

System calls that return process user and group IDs—for example, getuid() and getgid()—always return credentials as they appear inside the user namespace in which the calling process resides. If a user ID has no mapping inside the namespace, then system calls that return user IDs return the value defined in the file /proc/sys/kernel/overflowuid, which on a standard system defaults to the value 65534. Initially, a user namespace has no user ID mapping, so all user IDs inside the namespace map to this value. Likewise, a new user namespace has no mappings for group IDs, and all unmapped group IDs map to /proc/sys/kernel/overflowgid (which has the same default as overflowuid).

There is one other important point worth noting that can't be gleaned from the output above. Although the new process has a full set of capabilities in the new user namespace, it has no capabilities in the parent namespace. This is true regardless of the credentials and capabilities of the process that calls clone(). In particular, even if root employs clone(CLONE_NEWUSER), the resulting child process will have no capabilities in the parent namespace.

One final point to be made about the creation of user namespaces is that namespaces can be nested; that is, each user namespace (other than the initial user namespace) has a parent user namespace, and can have zero or more child user namespaces. The parent of a user namespace is the user namespace of the process that creates the user namespace via a call to clone() or unshare() with the CLONE_NEWUSER flag. The significance of the parent-child relationship between user namespaces will become clearer in the remainder of this article.

Mapping user and group IDs

Normally, one of the first steps after creating a new user namespace is to define the mappings used for the user and group IDs of the processes that will be created in that namespace. This is done by writing mapping information to the /proc/PID/uid_map and /proc/PID/gid_map files corresponding to one of the processes in the user namespace. (Initially, these two files are empty.) This information consists of one or more lines, each of which contains three values separated by white space:

    ID-inside-ns   ID-outside-ns   length

Together, the ID-inside-ns and length values define a range of IDs inside the namespace that are to be mapped to an ID range of the same length outside the namespace. The ID-outside-ns value specifies the starting point of the outside range. How ID-outside-ns is interpreted depends on the whether the process opening the file /proc/PID/uid_map (or /proc/PID/gid_map) is in the same user namespace as the process PID:

  • If the two processes are in the same namespace, then ID-outside-ns is interpreted as a user ID (group ID) in the parent user namespace of the process PID. The common case here is that a process is writing to its own mapping file (/proc/self/uid_map or /proc/self/gid_map).
  • If the two processes are in different namespaces, then ID-outside-ns is interpreted as a user ID (group ID) in the user namespace of the process opening /proc/PID/uid_map (/proc/PID/gid_map). The writing process is then defining the mapping relative to its own user namespace.

Suppose that we once more invoke our demo_userns program, but this time with a single command-line argument (any string). This causes the program to loop, continuously displaying credentials and capabilities every few seconds:

    $ ./demo_userns x
    eUID = 65534;  eGID = 65534;  capabilities: =ep
    eUID = 65534;  eGID = 65534;  capabilities: =ep

Now we switch to another terminal window—to a shell process running in another namespace (namely, the parent user namespace of the process running demo_userns) and create a user ID mapping for the child process in the new user namespace created by demo_userns:

    $ ps -C demo_userns -o 'pid uid comm'      # Determine PID of clone child
      PID   UID COMMAND 
     4712  1000 demo_userns                    # This is the parent
     4713  1000 demo_userns                    # Child in a new user namespace
    $ echo '0 1000 1' > /proc/4713/uid_map

If we return to the window running demo_userns, we now see:

    eUID = 0;  eGID = 65534;  capabilities: =ep

In other words, the user ID 1000 in the parent user namespace (which was formerly mapped to 65534) has been mapped to user ID 0 in the user namespace created by demo_userns. From this point, all operations within the new user namespace that deal with this user ID will see the number 0, while corresponding operations in the parent user namespace will see the same process as having user ID 1000.

We can likewise create a mapping for group IDs in the new user namespace. Switching to another terminal window, we create a mapping for the single group ID 1000 in the parent user namespace to the group ID 0 in the new user namespace:

    $ echo '0 1000 1' > /proc/4713/gid_map

Switching back to the window running demo_userns, we see that change reflected in the display of the effective group ID:

    eUID = 0;  eGID = 0;  capabilities: =ep

Rules for writing to mapping files

There are a number of rules governing writing to uid_map files; analogous rules apply for writing to gid_map files. The most important of these rules are as follows.

Defining a mapping is a one-time operation per namespace: we can perform only a single write (that may contain multiple newline-delimited records) to a uid_map file of exactly one of the processes in the user namespace. Furthermore, the number of lines that may be written to the file is currently limited to five (an arbitrary limit that may be increased in the future).

The /proc/PID/uid_map file is owned by the user ID that created the namespace, and is writeable only by that user (or a privileged user). In addition, all of the following requirements must be met:

  • The writing process must have the CAP_SETUID (CAP_SETGID for gid_map) capability in the user namespace of the process PID.
  • Regardless of capabilities, the writing process must be in either the user namespace of the process PID or inside the (immediate) parent user namespace of the process PID.
  • One of the following must be true:
    • The data written to uid_map (gid_map) consists of a single line that maps (only) the writing process's effective user ID (group ID) in the parent user namespace to a user ID (group ID) in the user namespace. This rule allows the initial process in a user namespace (i.e., the child created by clone()) to write a mapping for its own user ID (group ID).
    • The process has the CAP_SETUID (CAP_SETGID for gid_map) capability in the parent user namespace. Such a process can define mappings to arbitrary user IDs (group IDs) in the parent user namespace. As we noted earlier, the initial process in a new user namespace has no capabilities in the parent namespace. Thus, only a process in the parent namespace can write a mapping that maps arbitrary IDs in the parent user namespace.

Capabilities, execve(), and user ID 0

In an earlier article in this series, we developed the ns_child_exec program. This program uses clone() to create a child process in new namespaces specified by command-line options and then executes a shell command in the child process.

Suppose that we use this program to execute a shell in a new user namespace and then within that shell we try to define the user ID mapping for the new user namespace. In doing so, we run into a problem:

    $ ./ns_child_exec -U  bash
    $ echo '0 1000 1' > /proc/$$/uid_map       # $$ is the PID of the shell
    bash: echo: write error: Operation not permitted

This error occurs because the shell has no capabilities inside the new user namespace, as can be seen from the following commands:

    $ id -u         # Verify that user ID and group ID are not mapped
    65534
    $ id -g
    65534
    $ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000

The problem occurred at the execve() call that executed the bash shell: when a process with non-zero user IDs performs an execve(), the process's capability sets are cleared. (The capabilities(7) manual page details the treatment of capabilities during an execve().)

To avoid this problem, it is necessary to create a user ID mapping inside the user namespace before performing the execve(). This is not possible with the ns_child_exec program; we need a slightly enhanced version of the program that does allow this.

The userns_child_exec.c program performs the same task as the ns_child_exec program, and has the same command-line interface, except that it allows two additional command-line options, -M and -G. These options accept string arguments that are used to define user and group ID maps for the new user namespace. For example, the following command maps both user ID 1000 and group ID 1000 to 0 in the new user namespace:

    $ ./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' bash

This time, updating the mapping files succeeds, and we see that the shell has the expected user ID, group ID, and capabilities:

    $ id -u
    0
    $ id -g
    0
    $ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
    CapInh: 0000000000000000
    CapPrm: 0000001fffffffff
    CapEff: 0000001fffffffff

There are some subtleties to the implementation of the userns_child_exec program. First, either the parent process (i.e., the caller of clone()) or the new child process could update the user ID and group ID maps of the new user namespace. However, following the rules above, the only kind of mapping that the child process could define would be one that maps just its own effective user ID. If we want to define arbitrary user and group ID mappings in the child, then that must be done by the parent process. Furthermore, the parent process must have suitable capabilities, namely CAP_SETUID, CAP_SETGID, and (to ensure that the parent has the permissions needed to open the mapping files) CAP_DAC_OVERRIDE.

Furthermore, the parent must ensure that it updates the mapping files before the child calls execve() (otherwise we have exactly the problem described above, where the child will lose capabilities during the execve()). To do this, the two processes employ a pipe to ensure the required synchronization; comments in the program source code give full details.

Viewing user and group ID mappings

The examples so far showed the use of /proc/PID/uid_map and /proc/PID/gid_map files for defining a mapping. These files can also be used to view the mappings governing a process. As when writing to these files, the second (ID-outside-ns) value is interpreted according to which process is opening the file. If the process opening the file is in the same user namespace as the process PID, then ID-outside-ns is defined with respect to the parent user namespace. If the process opening the file is in a different user namespace, then ID-outside-ns is defined with respect to the user namespace of the process opening the file.

We can illustrate this by creating a couple of user namespaces running shells, and examining the uid_map files of the processes in the namespaces. We begin by creating a new user namespace with a process running a shell:

    $ id -u            # Display effective user ID
    1000
    $ ./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' bash
    $ echo $$          # Show shell's PID for later reference
    2465
    $ cat /proc/2465/uid_map
             0       1000          1
    $ id -u            # Mapping gives this process an effective user ID of 0
    0

Now suppose we switch to another terminal window and create a sibling user namespace that employs different user and group ID mappings:

    $ ./userns_child_exec -U -M '200 1000 1' -G '200 1000 1' bash
    $ cat /proc/self/uid_map
           200       1000          1
    $ id -u            # Mapping gives this process an effective user ID of 200
    200
    $ echo $$          # Show shell's PID for later reference
    2535

Continuing in the second terminal window, which is running in the second user namespace, we view the user ID mapping of the process in the other user namespace:

    $ cat /proc/2465/uid_map
             0        200          1

The output of this command shows that user ID 0 in the other user namespace maps to user ID 200 in this namespace. Note that the same command produced different output when executed in the other user namespace, because the kernel generates the ID-outside-ns value according to the user namespace of the process that is reading from the file.

If we switch back to the first terminal window, and display the user ID mapping file for the process in the second user namespace, we see the converse mapping:

    $ cat /proc/2535/uid_map
           200          0          1

Again, the output here is different from the same command when executed in the second user namespace, because the ID-outside-ns value is generated according to the user namespace of the process that is reading from the file. Of course, in the initial namespace, user ID 0 in the first namespace and user ID 200 in the second namespace both map to user ID 1000. We can verify this by executing the following commands in a third shell window inside the initial user namespace:

    $ cat /proc/2465/uid_map
             0       1000          1
    $ cat /proc/2535/uid_map
           200       1000          1

Concluding remarks

In this article, we've looked at the basics of user namespaces: creating a user namespace, using user and group ID map files, and the interaction of user namespaces and capabilities.

As we noted in an earlier article, one of the motivations for implementing user namespaces is to give non-root applications access to functionality that was formerly limited to the root user. In traditional UNIX systems, various pieces of functionality have been limited to the root user in order to prevent unprivileged users from manipulating the runtime environment of privileged programs, which could affect the operation of those programs in unexpected or undesirable ways.

A user namespace allows a process (that is unprivileged outside the namespace) to have root privileges while at the same time limiting the scope of that privilege to the namespace, with the result that the process cannot manipulate the runtime environment of privileged programs in the wider system. In order to use these root privileges meaningfully, we need to combine user namespaces with other types of namespaces—that topic will form the subject of the next article in this series.

Index entries for this article
KernelNamespaces/User namespaces


(Log in to post comments)

Namespaces in operation, part 5: User namespaces

Posted Feb 27, 2013 19:50 UTC (Wed) by nix (subscriber, #2304) [Link]

Simply excellent documentation. Would that all docs were like this.

Namespaces in operation, part 5: User namespaces

Posted Feb 27, 2013 19:59 UTC (Wed) by einstein (guest, #2052) [Link]

It looks like we're oh so slowly and painfully discovering and re-inventing openvz a little bit at a time. Hopefully we'll get there before too many more years.

Namespaces in operation, part 5: User namespaces

Posted Feb 27, 2013 20:52 UTC (Wed) by SEJeff (guest, #51588) [Link]

@einstein: Parallels (virtuozzo/openvz authors) have been some of the primary contributors to the upstream namespace support in the kernel. While I cringe at seeing the 1Mb+ patch that openvz is, I've got to give them props for going about things the right (and very long) way of getting small bits upstream at a time.

Namespaces in operation, part 5: User namespaces

Posted Feb 27, 2013 22:34 UTC (Wed) by mabshoff (guest, #86444) [Link]

Well, I am not quite sure where the 1 MB patch figure comes from, but all the RHEL 6.x based patches weigh in at 27 MB unpacked. Note that this is 2.6.32 vanilla -> RHEL 6.x+ovz, so I do assume that the vast majority of that diff is the RHEL 6.x changes. Either way, as you mentioned a massive amount of code from the people working for Parallels has been merged, so I would be curious what the RHEL 7.0 diff will look like. I guess we will know in a couple months.

Cheers,

Michael

Namespaces in operation, part 5: User namespaces

Posted Feb 28, 2013 4:25 UTC (Thu) by SEJeff (guest, #51588) [Link]

From a quick google, I found this:
http://openvz.org/Kernel_build#Rebuilding_kernel_from_sou...

[jeff@omniscience tmp]$ wget -q http://download.openvz.org/kernel/branches/2.6.18/028stab...
[jeff@omniscience tmp]$ du -hs patch-ovz028stab056.1-combined.gz
1.2M patch-ovz028stab056.1-combined.gz
[jeff@omniscience tmp]$ gzip -d patch-ovz028stab056.1-combined.gz
[jeff@omniscience tmp]$ du -hs patch-ovz028stab056.1-combined
4.6M patch-ovz028stab056.1-combined

I did the same thing about a year ago and the results were the same. So I still stand by my previous comment. Around a megabyte :)

Namespaces in operation, part 5: User namespaces

Posted Feb 28, 2013 14:33 UTC (Thu) by mabshoff (guest, #86444) [Link]

> From a quick google, I found this: [SNIP]

Yeah, that was the first hit I got, too, but I discarded it for the reason listed below.

> So I still stand by my previous comment. Around a megabyte :)

Well, that specific patch is for a RHEL 5 based kernel, i.e. on top of their version of 2.6.18. The RHEL 6 based 2.6.32 kernel patch weights in at currently 1.3 MB (see [1]). And that patch dates from March 4th 2011, so I would hardly call it current :p.

Anyway, with ploop and some of their other bits being out of mainline for now their patch is a little like the RT patch set: growing some time and shrinking some other time, but as patches move into mainline from it new patches for new functionality get added on top. At least after many years of living mostly out of mainline their efforts like CRIU have shown that you can merge it into mainline assuming all interested parties collaborate, and that is a really positive development imho.

Cheers,

Michael

[1] http://download.openvz.org/kernel/branches/2.6.32/2.6.32-...

Namespaces in operation, part 5: User namespaces

Posted Feb 27, 2013 22:11 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

Oh I would say that the user namespaces at least are much closer to the original vserver approach (which uses a fixed number of the high bits as the container id) and fair bit better than either approach as all of the weird corner cases of mixing userspace uids and gids and the kernel uids and gids are handled.

That is what the remaining XFS work is about ensuring that XFS doesn't mix user space uids with in-kernel uids without adding the appropriate translations, and making it hard to mess confuse those two kinds of uids in the future. XFS has a very unique architecture for it's in-kernel filesystem data structures and many more user facing ioctls than most filesystems which means it can't be treated like just another filesystem.

What was not mentioned is that when a process in a user namespace interacts files, the interaction is the same as interacting with processes. When a file is created the uid of the process is mapped into the initial user namespace those mapped uids are stored on disk. Meanwhile when the process in a user namespace stats those files the uids are mapped back into it's namespace so it sees the uids it wrote with instead of the uids that are stored on disk.

This allows quotas and other filesystem features to work with user namespaces without any changes to the on-disk format.

Complexity?

Posted Feb 28, 2013 10:31 UTC (Thu) by renox (guest, #23785) [Link]

Unix user/group management has always looked very complex to me, I wonder if this is because
1) I've not invested enough effort understand Unix management
2) the problem is itself very complex
3) this is an historical baggage/legacy and other approaches (Plan9? Windows?) could provide the same type of services but in a simpler way..

Thoughts?

Complexity?

Posted Feb 28, 2013 17:26 UTC (Thu) by hummassa (guest, #307) [Link]

IMHO, (2).

Windows-like ACLs (again IMHO) are simpler to apply but cause more esoteric and difficult-to-debug problems.

Complexity?

Posted Mar 5, 2013 10:07 UTC (Tue) by malor (guest, #2973) [Link]

The old Unix permissions system actually isn't very complex, which is its central problem. The permissions are very coarse, and it's very hard to describe complex security arrangements using those very dull tools. It's primarily based on user/group/other, read/write/execute, and the various permutations of those three permissions, granted to those three broad categories. And then you've got system-wide capabilities, which either grant or deny access to users to do things that can be dangerous to the system as a whole. As they presently stand, Unix permissions are very coarsely defined, and can be very far-reaching. Granting a given permission to a program can have nasty security implications that are difficult to understand.

On the Windows side, NT-derivative systems have used ACLs for a long time, and they're much more capable. The permissions themselves are fairly fine-grained, and then you can specify to a gnat's eyebrow exactly who should and should not get them. As long as you realize that the permissions system is looking for any possible excuse to deny a permission, and only if it A) can't find any reason to reject someone, and B) finds an explicit authorization, will it finally grant a permission. Just think of the NT permissions system as a big asshole, and the whole system ends up being easily understandable, and very powerful.

But, ACLs have a very fundamental problem: permissions are granted, normally, to users, not programs, so they do almost nothing to protect programs from each other. If they're being run by the same user (say, "malor"), then they can mess each other up. If I'm running Internet Explorer, then it has any permission that I do, and if it's hijacked, it can erase or corrupt anything that I could erase or corrupt.

Namespaces are kind of an ugly hack that seem to have three basic goals:

  1. Preserve compability with the old Unix blunt instruments;
  2. Allow finer-grained permission controls;
  3. Assign permissions based on programs, rather than users

Once this stuff has been really integrated into the system software, running Firefox as "malor" should grant a very limited exposure to my other files, should it be hijacked. The browser process might be restricted to creating new files in a download directory only, with no other write access anywhere in the filesystem. A separate, user-facing program might have the authorization to rewrite user configuration files, like bookmarks or the settings in about:config. By separating them in this way, it will be enormously harder for a remote exploit, even in a full-featured language like Java, to escape the virtual sandbox it's in. It probably won't be impossible, but it should be much more difficult, perhaps requiring a specific exploit be written to attack your particular combination of OS and Firefox, making it non-feasible for mass exploit attempts.

This is kind of the same thing that Microsoft and Apple are trying to do with their DRM-based software stores, and highly restricted environments, but in this case, YOU hold the keys, not Microsoft or Apple.

The overall solution ends up being kind of ugly, because of the simultaneous need to maintain compatibility with a 40-year-old permissions system, and also to implement a bunch of new permission types that have never existed in Unix before, but I'll tell you this: I'll take an ugly system I can control myself over an imposed system by a corporation any day. If I want the best security, where programs are isolated from one another, but I also want to own my own hardware, Linux namespaces seem to be the way forward.

Complexity?

Posted Mar 5, 2013 10:29 UTC (Tue) by etienne (guest, #25256) [Link]

> permissions are granted, normally, to users, not programs

Maybe that is not complex enough, and permissions should be granted to what the program is doing:
- if the program is updating itself (when no package manager) it should have rights to overwrite its own binaries
- if the program is configuring itself (when user changes something) it should have rights to change its configuration files
- if the program is being only "used", it shall do none of the above.

Ever seen a security system blocking half of the upgrade of a package?
I did not say I would like to manage such a system...

Complexity?

Posted Apr 14, 2015 4:51 UTC (Tue) by bandrami (guest, #94229) [Link]

> NT-derivative systems have used ACLs for a long time, and they're much more capable.

And this is not *necessarily* a good thing. Taking Unix permissions and then adding capabilities and ACLs triples (more than triples, really) the logic required to statically verify a configuration. I guess I sort of appreciate the idea that a library that isn't there can't be misconfigured -- I don't run ACLs or CAP_*s or namespaces on my production Linux servers for that reason even though that takes rebuilding the kernel. It's the same argument I have with mandatory access control systems: knobs I can twist are knobs that I can twist the wrong way. I want my security system to be so brain-dead that I can verify it at 3am in a loud server room with a client calling me every 30 seconds.

Namespaces in operation, part 5: User namespaces

Posted Mar 1, 2013 23:51 UTC (Fri) by darwish07 (guest, #49520) [Link]

Thanks for providing such an interesting, and quite informative, article!

Hurd?

Posted Mar 2, 2013 19:15 UTC (Sat) by cesarb (subscriber, #6266) [Link]

> As we noted in an earlier article, one of the motivations for implementing user namespaces is to give non-root applications access to functionality that was formerly limited to the root user.

Wasn't that one of the motivations for the microkernel design of GNU Hurd?

Namespaces in operation, part 5: User namespaces

Posted Mar 7, 2013 2:40 UTC (Thu) by kevinm (guest, #69913) [Link]

So, a UID in the parent namespace that isn't mapped in the child namespace is mapped to a default UID; but what about a UID in the child namespace that isn't mapped - what UID will that have in the parent namespace (for example, a process in the child namespace with UID=0 uses seteuid(9999) where child namespace UID 9999 isn't included in any mapping.

Example fails on today's Ubuntu 13.04 daily

Posted Mar 7, 2013 13:42 UTC (Thu) by BernardB (subscriber, #47903) [Link]

No luck trying this out on today's Ubuntu 13.04 daily build:
$ id
uid=1000(bernard) gid=1000(bernard)
$ uname -a
Linux dev32 3.8.0-11-generic #20-Ubuntu SMP Tue Mar 5 20:33:22 UTC 2013 i686 athlon i686 GNU/Linux
$ gcc -o demo_userns demo_userns.c  -lcap
$ ./demo_userns
clone: Invalid argument
$ sudo ./demo_userns # It was worth a shot!
clone: Invalid argument
$ strace -e clone ./demo_userns 
clone(child_stack=0x814a064, flags=0x10000000|SIGCHLD) = -1 EINVAL (Invalid argument)
clone: Invalid argument
$ apt-cache policy linux-image-`uname -r`
linux-image-3.8.0-11-generic:
  Installed: 3.8.0-11.20
  Candidate: 3.8.0-11.20
  Version table:
 *** 3.8.0-11.20 0
        500 http://gb.archive.ubuntu.com/ubuntu/ raring/main i386 Packages
        100 /var/lib/dpkg/status
I've yet to delve into the kernel source to find where EINVAL is coming from, but can anyone see if I am missing something obvious? Or maybe it's because Ubuntu's done something magic to their kernel? (The Makefile in their Linux sources purports to be 3.8.2).

Example fails on today's Ubuntu 13.04 daily

Posted Mar 7, 2013 14:05 UTC (Thu) by BernardB (subscriber, #47903) [Link]

Okay, having dug deeper, it turns out that the examples require CONFIG_USER_NS. As the article points out, 3.8 was still missing the changes for XFS and other filesystems. Unsurprisingly, Ubuntu 13.04 chose XFS and NFSĀ support over CONFIG_USER_NS. Bummer :P

"Soon after 13.04 they will be fully supported." -- http://permalink.gmane.org/gmane.linux.kernel.containers....

Namespaces in operation, part 5: User namespaces

Posted Mar 5, 2015 8:32 UTC (Thu) by mkerrisk (subscriber, #1978) [Link]

Note that because of the Linux 3.19 changes that fixed a user namespace security loophole related to the setgroups() system call, the userns_child_exec.c program needs modifications in order to be able to use GID maps on Linux 3.19 and later (and also on earlier stable kernel series that backported the changes). A revised (and backward compatible) version of this program with the necessary changes can be found in the revised user_namespaces(7) man page that will appear in a few days time. (Look for the definition and use of the proc_setgroup_write() function in the example program.)

Namespaces in operation, part 5: User namespaces

Posted Oct 29, 2017 8:49 UTC (Sun) by mkerrisk (subscriber, #1978) [Link]

Slides from my October 2017 presentation on User Namespaces at Open Source Summit Europe can be found here.

Namespaces in operation, part 5: User namespaces

Posted Nov 25, 2018 3:36 UTC (Sun) by fusillator (guest, #128821) [Link]

This is impressive but a bit hard to follow for a noob like me.

I think these are the main points of the articles:
When a user namespace is created by clone, the first cloned process in the new namespace is granted a full set of capabilities in the new namespace.
Invoking exec* functions changes the calling process capabilities following the rules for the transformation of capabilities during exec*:
1) pI' = pI
2) pP' = (X & fP) | (pI & fI)
3) pE' = fE & pP'
Hence the code ns_child_exec gets the permission error because the child shell in the new userspace is executed as an unprivileged user and the permitted and effective/ inheritable flags weren't set on appropriately.

The code userns_child_exec.c takes care of writing the root map for the new userspace from the parent.
Firstly it creates the new userspace by mean of clone, then it ensures the map files is written by the father process before the cloned child launches the shell using a pipe for interprocess communication: the pipe is duplicated when the process clones, so each process (father and child) will have their own copies of two file descriptors (pipe_fd[0] for reading, pipe_fd[1] for writing) pointing to the same pipe. The father exploits the write endpoint of the pipe closing the write channel when it completed the user mapping. The cloned child closes his write endpoint and exploits the read endpoint (it reads a character on the pipe expecting to be NULL - eof) to sync with the father ensuring the map was written by the calling process in the context of the father namespace before the shell execution.

These are my considerations:
In order to write on the mapping file from an unprivileged user (in the context of the parent namespace) the capabilities CAP_SETUID, CAP_SETGID needs to be granted to the calling process (see the first rule in the section Rules for writing to mapping files). The author doesn't show how to grant these privileges, a way is enabling the effective and permitted flags on the userns_child_exec binary.

Moreover I don't get why the writing of the mapping isn't accomplished in the childFunc which has full set of capabilities in the context of the new namespace before executing the shell (following the first point of the third rules in the section Rules for writing to mapping files this should be feasible), this would avoid the need of a sync mechanism.

The author states that to avoid losing the capabilities the parent needs to change the user mapping before executing the shell (since the shell capabilities flags aren't set appropriately for the exec transformation).
Anyway I think that if this sync check wasn't accomplished the shell might execute with an unpriviledged uid /proc/sys/kernel/overflowuid until the parent will be able to write the mapping. The sync is necessary to avoid race condition during the shell execution and to ensure the shell is running as a privileged process in the new userspace from the very beginning.

Namespaces in operation, part 5: User namespaces

Posted Nov 25, 2018 17:39 UTC (Sun) by fusillator (guest, #128821) [Link]

> Moreover I don't get why the writing of the mapping isn't accomplished in the childFunc which has full set of capabilities in the context of the new namespace before executing the shell (following the first point of the third rules in the section Rules for writing to mapping files this should be feasible), this would avoid the need of a sync mechanism.

Since the command userns_child_exec takes the user mappings as argument, mappings to arbitrary user IDs (group IDs) in the parent user namespace must be allowed (see the second point of third rule in the section Rules for writing to mapping files).
Conversely, if the mapping is made from the cloned child in the new namespace, it's only possible to map the user id of the parent process in the parent namespace to any uid in the new namespace, root included.
When a process clones itself the user and group ids in the parent namespace are inherited by the child.

> In order to write on the mapping file from an unprivileged user (in the context of the parent namespace) the capabilities CAP_SETUID, CAP_SETGID needs to be granted to the calling process (see the first rule in the section Rules for writing to mapping files). The author doesn't show how to grant these privileges, a way is enabling the effective and permitted flags on the userns_child_exec binary.

Other rules control the capabilities propagation between namespaces with a parental relationship, from the successive article https://lwn.net/Articles/540087/
"When a user namespace is created, the kernel records the effective user ID of the creating process as being the "owner" of the namespace. A process whose effective user ID matches that of the owner of a user namespace and which is a member of the parent namespace has all capabilities in the namespace."
So the required capabilities cap_set{uid,gid} are granted to the unprivileged parent process on the new namespace by default.

Namespaces in operation, part 5: User namespaces

Posted Aug 14, 2022 9:52 UTC (Sun) by marcozov (guest, #160103) [Link]

Thanks for the article, it is really nice to read and follow.

I'm having an issue when trying to reproduce part of the mentioned steps (which should all be doable as non-root, as far as I understood).
In particular, when running `./demo_userns x`, I *can* run `echo '0 1000 1' > /proc/$DEMO_PID/uid_map` successfully (and the output of thee demo_userns program is updated with the new user id, 0), but I *cannot* run `echo '0 1000 1' > /proc/$DEMO_PID/gid_map`.
If I try to run those echo commands as root, they both work, but this sounds a bit against the purpose of this article (which is about being able to do root actions in a restricted environment -- the new user namespace).

If I proceed with the demo, I have a similar problem when I run `./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' bash`:
```
write /proc/10568/gid_map: Operation not permitted
bash: initialize_job_control: no job control in background: Bad file descriptor
```
Removing the -G part makes the error go away here as well.

Any clue on how I can debug this? As far as I understood, if a process is the parent user namespace it should automatically have the necessary capabilities (cap_set_uid, cap_set_gid) to write to the uid_map / gid_map files of the process in the new user namespace.
Is there anything that I can check to validate this? If I run `cat /proc/$$/status | grep Cap`, I get:
```
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
```
which I'm not sure how to interpret.
In particular, I'm referring to the three rules defined under `Rules for writing to mapping files`: the first one should always hold (based on my understanding, CAP_SETUID/CAP_SETGID are always valid for processes in the parent user namespace), the second one seems to hold as well (the spawned terminal is indeed in the user namespace of the shell that run `./demo_userns x`), the third one seems a bit more ambiguous: the first statement holds (that's basically defined via `0 1000 1`), the second one not really (according the to the `cat /proc/$$/status | grep Cap` output) --> but to me it looks like only one of the two has to hold. Furthermore, if this is the problem, I would expect that writing to the uid_map would lead to the same error.

Namespaces in operation, part 5: User namespaces

Posted Aug 14, 2022 10:20 UTC (Sun) by izbyshev (subscriber, #107996) [Link]

Looks like the issue with setgroups() that another comment talks about: https://lwn.net/Articles/635559/.

Namespaces in operation, part 5: User namespaces

Posted Aug 14, 2022 13:05 UTC (Sun) by marcozov (guest, #160103) [Link]

Thanks, I also found the related change in the code!


Copyright © 2013, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds