If execve failed, we would die in safe_close(), because master was already
closed by fdset_close_others() on line 3123. IIUC, we don't need to keep the
fd open after sending it, so let's just close it immediately.
Reproducer:
sudo build/systemd-nspawn -M rawhide fooooooo
Fixup for 3acc84ebd9.
prefix_root() is equivalent to path_join() in almost all ways, hence
let's remove it.
There are subtle differences though: prefix_root() will try shorten
multiple "/" before and after the prefix. path_join() doesn't do that.
This means prefix_root() might return a string shorter than both its
inputs combined, while path_join() never does that. I like the
path_join() semantics better, hence I think dropping prefix_root() is
totally OK. In the end the strings generated by both functon should
always be identical in terms of path_equal() if not streq().
This leaves prefix_roota() in place. Ideally we'd have path_joina(), but
I don't think we can reasonably implement that as a macro. or maybe we
can? (if so, sounds like something for a later PR)
Also add in a few missing OOM checks
The OCI changes in #9762 broke a use case in which we use nspawn from
inside a container that has dropped capabilities from the bounding set
that nspawn expected to retain. In an attempt to keep OCI compliance
and support our use case, I made hard failing on setting capabilities
not in the bounding set optional (hard fail if using OCI and log only
if using nspawn cmdline).
Fixes#12539
The console tty is now allocated from within the container so it's not
necessary anymore to allocate it from the host and bind mount the pty slave
into the container. The pty master is sent to the host.
/dev/console is now a symlink pointing to the pty slave.
This might also be less confusing for applications running inside the container
and the overall result looks cleaner (we don't need to apply manually the
passed selinux context, if any, to the allocated pty for instance).
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)
Let's rework this, and store the size in bytes of the allocated storage area.
The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.
Fixes#12605.
Since nspawn-settings.h includes seccomp.h, any file that includes
nspawn-settings.h should depend on libseccomp so the correct header path where
seccomp.h lives is added to the header search paths.
It's especially important for distros such as openSUSE where seccomp.h is not
shipped in /usr/include but /usr/include/libseccomp.
This patch is similar to 8238423095.
There's no point in returning the "key" within each loop iteration as
JsonVariant object. Let's simplify things and return it as string. That
simplifies usage (since the caller doesn't have to convert the object to
the string anymore) and is safe since we already validate that keys are
strings when an object JsonVariant is allocated.
We noticed in our tests that occasionally SystemCallFilter= would
fail to set and the service would run with no syscall filtering.
Most of the time the same tests would apply the filter and fail
the service as expected. While it's not totally clear why this happens,
we noticed seccomp_load() in the systemd code base would fail open for
all errors except EPERM and EACCES.
ENOMEM, EINVAL, and EFAULT seem like reasonable values to add to the
error set based on what I gather from libseccomp code and man pages:
-ENOMEM: out of memory, failed to allocate space for a libseccomp structure, or would exceed a defined constant
-EINVAL: kernel isn't configured to support the operations, args are invalid (to seccomp_load(), seccomp(), or prctl())
-EFAULT: addresses passed as args are invalid
/tmp might not be mounted at all yet (given that we support
SYSTEMD_NSPAWN_TMPFS_TMP=0 to turn this off), and /tmp is a dir systemd
usually tries to unmount during shutdown (unlike /run), and we shouldn't
keep it busy. Hence let's just move these deleted files to /run so that
we don't keep /tmp needlessly busy.
Some chattrs only work sensible if you set them right after opening a
file for create (think: FS_NOCOW_FL). Others only work when they are
applied when the file is fully written (think: FS_IMMUTABLE_FL). Let's
take that into account when copying files and applying a chattr to them.
The host mounts it like that, nspawn hence should do too.
Moreover, mount the file system after doing CLONEW_NEWIPC so that it
actually reflects the right mqueues. Finally, mount it wthout
considering it fatal, since POSIX mqueue support is little used and it
should be fine not to support it in the kernel.
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.
Previously both run() and run_container() would free 'fds'. Let's fix
that, and let run() free it but make run_container() already remove all
fds from it, because that's what we actually want to do.
Fixes: #12073
This is a pretty large patch, and adds support for OCI runtime bundles
to nspawn. A new switch --oci-bundle= is added that takes a path to an
OCI bundle. The JSON file included therein is read similar to a .nspawn
settings files, however with a different feature set.
Implementation-wise this mostly extends the pre-existing Settings object
to carry additional properties for OCI. However, OCI supports some
concepts .nspawn files did not support yet, which this patch also adds:
1. Support for "masking" files and directories. This functionatly is now
also available via the new --inaccesible= cmdline command, and
Inaccessible= in .nspawn files.
2. Support for mounting arbitrary file systems. (not exposed through
nspawn cmdline nor .nspawn files, because probably not a good idea)
3. Ability to configure the console settings for a container. This
functionality is now also available on the nspawn cmdline in the new
--console= switch (not added to .nspawn for now, as it is something
specific to the invocation really, not a property of the container)
4. Console width/height configuration. Not exposed through
.nspawn/cmdline, but this may be controlled through $COLUMNS and
$LINES like in most other UNIX tools.
5. UID/GID configuration by raw numbers. (not exposed in .nspawn and on
the cmdline, since containers likely have different user tables, and
the existing --user= switch appears to be the better option)
6. OCI hook commands (no exposed in .nspawn/cmdline, as very specific to
OCI)
7. Creation of additional devices nodes in /dev. Most likely not a good
idea, hence not exposed in .nspawn/cmdline. There's already --bind=
to achieve the same, which is the better alternative.
8. Explicit syscall filters. This is not a good idea, due to the skewed
arch support, hence not exposed through .nspawn/cmdline.
9. Configuration of some sysctls on a whitelist. Questionnable, not
supported in .nspawn/cmdline for now.
10. Configuration of all 5 types of capabilities. Not a useful concept,
since the kernel will reduce the caps on execve() anyway. Not
exposed through .nspawn/cmdline as this is not very useful hence.
Note that this only implements the OCI runtime logic itself. It does not
provide a runc-compatible command line tool. This is left for a later
PR. Only with that in place tools such as "buildah" can use the OCI
support in nspawn as drop-in replacement.
Currently still missing is OCI hook support, but it's already parsed and
everything, and should be easy to add. Other than that it's OCI is
implemented pretty comprehensively.
There's a list of incompatibilities in the nspawn-oci.c file. In a later
PR I'd like to convert this into proper markdown and add it to the
documentation directory.
Let's separate out the raw uid_t/gid_t handling from the username
handling. This is useful later on.
Also, let's use the right gid_t type for group types wherever
appropriate.
if we sync the legacy and unified trees before moving to the right
subcgroup then ultimately the cgroup paths in the hierarchies will be
out-of-sync... Hence, let's move the payload first, and sync then.
Addresses: https://github.com/systemd/systemd/pull/9762#issuecomment-441187979
Previously, when we'd copy an individual file we'd synthesize a
user.crtime_usec xattr with the source's creation time if we can
determine it. As the creation/birth time was until recently not
queriable form userspace this effectively just propagated the same xattr
on the source to the same xattr on the destination. However, current
kernels now allow to query the birthtime using statx() and we do make
use of that now. Which means that suddenly we started synthesizing these
xattrs much more regularly.
Doing this actually does make sense, but only in very few cases:
not for the typical regular files we copy, but certainly when dealing
with disk images. Hence, let's keep this kind of propagation, but let's
make it a flag and default to off. Then turn it on whenever we deal with
disk images, and leave it off otherwise.
This is particularly relevant as overlayfs combining a real fs, and a
tmpfs on top will result in EOPNOTSUPP when it is attempted to open a
file with xattrs for writing, as tmpfs does not support xattrs, and
hence the copy-up cannot work. Hence, let's avoid synthesizing this
needlessly, to increase compat with overlayfs.
Previously, we'd refuse the combination, and claimed we'd imply it, but
actually didn't. Let's allow the combination and imply read-only from
--volatile=, because that's what's documented, what we claim we do, and
what makes sense.
fdopen doesn't accept "e", it's ignored. Let's not mislead people into
believing that it actually sets O_CLOEXEC.
From `man 3 fdopen`:
> e (since glibc 2.7):
> Open the file with the O_CLOEXEC flag. See open(2) for more information. This flag is ignored for fdopen()
As mentioned by @jlebon in #11131.
cg_ns_supported() caches, so the condition was really checked just once, but
it looks weird to assign the return value to arg_use_cgns (if the variable is not present),
because then the other checks are effectively equivalent to
if (cg_ns_supported() && cg_ns_supported()) { ...
and later
if (!cg_ns_supported() || !cg_ns_supported()) { ...
That's what the function is for after all, and only if it's done there
we can verify the effect of .nspawn files correctly too: after all we
should not just validate that everything configured on the command line
makes sense, but the stuff configured in the .nspawn files, too.
This then let's us to ensure it's called after we parsed the cmdline,
and after we loaded the settings file, so that it these env var settings
override everything loaded from there.
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.
No code changes, just some rearranging of source files.
Whenever we invoke external, foreign code from code that has
RLIMIT_NOFILE's soft limit bumped to high values, revert it to 1024
first. This is a safety precaution for compatibility with programs using
select() which cannot operate with fds > 1024.
This commit adds the call to rlimit_nofile_safe() to all invocations of
exec{v,ve,l}() and friends that either are in code that we know runs
with RLIMIT_NOFILE bumped up (which is PID 1 and all journal code for
starters) or that is part of shared code that might end up there.
The calls are placed as early as we can in processes invoking a flavour
of execve(), but after the last time we do fd manipulations, so that we
can still take benefit of the high fd limits for that.
This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.
So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).
To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
Pretty much everything uses just the first argument, and this doesn't make this
common pattern more complicated, but makes it simpler to pass multiple options.
From #10526:
$ sudo systemd-nspawn -i image
Spawning container image on /home/zbyszek/src/mkosi/image.
Press ^] three times within 1s to kill container.
Short read while reading cgroup mode.
We have the machine name anyway, let's use TerminateMachine() on
machined's Manager object directly with it. That way it's a single
method call only, instead of two, to terminate the machine.
When running a read-only file system, we might not be able to create
/var/log/journal. Do not fail on this, unless actually requested by the
--link-journal options.
$ systemd-nspawn --image=image.squashfs ...
All over the place we define local variables for the various sockopts
that take a bool-like "int" value. Sometimes they are const, sometimes
static, sometimes both, sometimes neither.
Let's clean this up, introduce a common const variable "const_int_one"
(as well as one matching "const_int_zero") and use it everywhere, all
acorss the codebase.
With this change almost all log messages that are suppressed through
--quiet are not actually suppressed anymore, but simply downgraded to
LOG_DEBUG. Previously we did it this way for some log messages and fully
suppressed them for others. With this it's pretty much systematic.
Inspired by #10122.
In seccomp code, the code is changed to propagate errors which are about
anything other than unknown/unimplemented syscalls. I *think* such errors
should not happen in normal usage, but so far we would summarilly ignore all
errors, so that part is uncertain. If it turns out that other errors occur and
should be ignored, this should be added later.
In nspawn, we would count the number of added filters, but didn't use this for
anything. Drop that part.
The comments suggested that seccomp_add_syscall_filter_item() returned negative
if the syscall is unknown, but this wasn't true: it returns 0.
The error at this point can only be if the syscall was known but couldn't be
added. If the error comes from our internal whitelist in nspawn, treat this as
error, because it means that our internal table is wrong. If the error comes
from user arguments, warn and ignore. (If some syscall is not known at current
architecture, it is still silently ignored.)
Our logs are full of:
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldstat() / -10037, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call get_thread_area() / -10076, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call set_thread_area() / -10079, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldfstat() / -10034, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldolduname() / -10036, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldlstat() / -10035, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call waitpid() / -10073, ignoring: Numerical argument out of domain
...
This is pointless and makes debug logs hard to read. Let's keep the logs
in test code, but disable it in nspawn and pid1. This is done through a function
parameter because those functions operate recursively and it's not possible to
make the caller to log meaningfully.
There should be no functional change, except the skipped debug logs.
When a network namespace is needed, /sys is mounted as tmpfs (see commit
d8fc6a000f for details).
But in this case mode 755 was used as initial permissions for /sys whereas the
default mode for sysfs is 555.
In practice using 755 doesn't have any impact because /sys is mounted read-only
too but for consistency, let's use the correct mode.
Fixes: #10050
This is a bit like the info link in most of GNU's --help texts, but we
don't do info but man pages, and we make them properly clickable on
terminal supporting that, because awesome.
I think it's generally advisable to link up our (brief) --help texts and
our (more comprehensive) man pages a bit, so this should be an easy and
straight-forward way to do it.
Currently, mount_sysfs() only creates /sys/fs/cgroup if cg_ns_supported().
The comment explains that we need to "Create mountpoint for
cgroups. Otherwise we are not allowed since we remount /sys read-only.";
that is: that we need to do it now, rather than later. However, the
comment doesn't do anything to explain why we only need to do this if
cg_ns_supported(); shouldn't we _always_ need to do it?
The answer is that if !use_cgns, then this was already done by the outer
child, so mount_sysfs() only needs to do it if use_cgns. Now,
mount_sysfs() doesn't know whether use_cgns, but !cg_ns_supported() implies
!use_cgns, so we can optimize" the case where we _know_ !use_cgns, and deal
with a no-op mkdir_p() in the false-positive where cgns_supported() but
!use_cgns.
But is it really much of an optimization? We're potentially spending an
access(2) (cg_ns_supported() could be cached from a previous call) to
potentially save an lstat(2) and mkdir(2); and all of them are on virtual
fileystems, so they should all be pretty cheap.
So, simplify and drop the conditional. It's a dubious optimization that
requires more text to explain than it's worth.