If execve failed, we would die in safe_close(), because master was already
closed by fdset_close_others() on line 3123. IIUC, we don't need to keep the
fd open after sending it, so let's just close it immediately.
Reproducer:
sudo build/systemd-nspawn -M rawhide fooooooo
Fixup for 3acc84ebd9.
prefix_root() is equivalent to path_join() in almost all ways, hence
let's remove it.
There are subtle differences though: prefix_root() will try shorten
multiple "/" before and after the prefix. path_join() doesn't do that.
This means prefix_root() might return a string shorter than both its
inputs combined, while path_join() never does that. I like the
path_join() semantics better, hence I think dropping prefix_root() is
totally OK. In the end the strings generated by both functon should
always be identical in terms of path_equal() if not streq().
This leaves prefix_roota() in place. Ideally we'd have path_joina(), but
I don't think we can reasonably implement that as a macro. or maybe we
can? (if so, sounds like something for a later PR)
Also add in a few missing OOM checks
The OCI changes in #9762 broke a use case in which we use nspawn from
inside a container that has dropped capabilities from the bounding set
that nspawn expected to retain. In an attempt to keep OCI compliance
and support our use case, I made hard failing on setting capabilities
not in the bounding set optional (hard fail if using OCI and log only
if using nspawn cmdline).
Fixes#12539
The console tty is now allocated from within the container so it's not
necessary anymore to allocate it from the host and bind mount the pty slave
into the container. The pty master is sent to the host.
/dev/console is now a symlink pointing to the pty slave.
This might also be less confusing for applications running inside the container
and the overall result looks cleaner (we don't need to apply manually the
passed selinux context, if any, to the allocated pty for instance).
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)
Let's rework this, and store the size in bytes of the allocated storage area.
The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.
Fixes#12605.
Since nspawn-settings.h includes seccomp.h, any file that includes
nspawn-settings.h should depend on libseccomp so the correct header path where
seccomp.h lives is added to the header search paths.
It's especially important for distros such as openSUSE where seccomp.h is not
shipped in /usr/include but /usr/include/libseccomp.
This patch is similar to 8238423095.
There's no point in returning the "key" within each loop iteration as
JsonVariant object. Let's simplify things and return it as string. That
simplifies usage (since the caller doesn't have to convert the object to
the string anymore) and is safe since we already validate that keys are
strings when an object JsonVariant is allocated.
We noticed in our tests that occasionally SystemCallFilter= would
fail to set and the service would run with no syscall filtering.
Most of the time the same tests would apply the filter and fail
the service as expected. While it's not totally clear why this happens,
we noticed seccomp_load() in the systemd code base would fail open for
all errors except EPERM and EACCES.
ENOMEM, EINVAL, and EFAULT seem like reasonable values to add to the
error set based on what I gather from libseccomp code and man pages:
-ENOMEM: out of memory, failed to allocate space for a libseccomp structure, or would exceed a defined constant
-EINVAL: kernel isn't configured to support the operations, args are invalid (to seccomp_load(), seccomp(), or prctl())
-EFAULT: addresses passed as args are invalid
/tmp might not be mounted at all yet (given that we support
SYSTEMD_NSPAWN_TMPFS_TMP=0 to turn this off), and /tmp is a dir systemd
usually tries to unmount during shutdown (unlike /run), and we shouldn't
keep it busy. Hence let's just move these deleted files to /run so that
we don't keep /tmp needlessly busy.
Some chattrs only work sensible if you set them right after opening a
file for create (think: FS_NOCOW_FL). Others only work when they are
applied when the file is fully written (think: FS_IMMUTABLE_FL). Let's
take that into account when copying files and applying a chattr to them.
The host mounts it like that, nspawn hence should do too.
Moreover, mount the file system after doing CLONEW_NEWIPC so that it
actually reflects the right mqueues. Finally, mount it wthout
considering it fatal, since POSIX mqueue support is little used and it
should be fine not to support it in the kernel.
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.
Previously both run() and run_container() would free 'fds'. Let's fix
that, and let run() free it but make run_container() already remove all
fds from it, because that's what we actually want to do.
Fixes: #12073
This is a pretty large patch, and adds support for OCI runtime bundles
to nspawn. A new switch --oci-bundle= is added that takes a path to an
OCI bundle. The JSON file included therein is read similar to a .nspawn
settings files, however with a different feature set.
Implementation-wise this mostly extends the pre-existing Settings object
to carry additional properties for OCI. However, OCI supports some
concepts .nspawn files did not support yet, which this patch also adds:
1. Support for "masking" files and directories. This functionatly is now
also available via the new --inaccesible= cmdline command, and
Inaccessible= in .nspawn files.
2. Support for mounting arbitrary file systems. (not exposed through
nspawn cmdline nor .nspawn files, because probably not a good idea)
3. Ability to configure the console settings for a container. This
functionality is now also available on the nspawn cmdline in the new
--console= switch (not added to .nspawn for now, as it is something
specific to the invocation really, not a property of the container)
4. Console width/height configuration. Not exposed through
.nspawn/cmdline, but this may be controlled through $COLUMNS and
$LINES like in most other UNIX tools.
5. UID/GID configuration by raw numbers. (not exposed in .nspawn and on
the cmdline, since containers likely have different user tables, and
the existing --user= switch appears to be the better option)
6. OCI hook commands (no exposed in .nspawn/cmdline, as very specific to
OCI)
7. Creation of additional devices nodes in /dev. Most likely not a good
idea, hence not exposed in .nspawn/cmdline. There's already --bind=
to achieve the same, which is the better alternative.
8. Explicit syscall filters. This is not a good idea, due to the skewed
arch support, hence not exposed through .nspawn/cmdline.
9. Configuration of some sysctls on a whitelist. Questionnable, not
supported in .nspawn/cmdline for now.
10. Configuration of all 5 types of capabilities. Not a useful concept,
since the kernel will reduce the caps on execve() anyway. Not
exposed through .nspawn/cmdline as this is not very useful hence.
Note that this only implements the OCI runtime logic itself. It does not
provide a runc-compatible command line tool. This is left for a later
PR. Only with that in place tools such as "buildah" can use the OCI
support in nspawn as drop-in replacement.
Currently still missing is OCI hook support, but it's already parsed and
everything, and should be easy to add. Other than that it's OCI is
implemented pretty comprehensively.
There's a list of incompatibilities in the nspawn-oci.c file. In a later
PR I'd like to convert this into proper markdown and add it to the
documentation directory.
Let's separate out the raw uid_t/gid_t handling from the username
handling. This is useful later on.
Also, let's use the right gid_t type for group types wherever
appropriate.
if we sync the legacy and unified trees before moving to the right
subcgroup then ultimately the cgroup paths in the hierarchies will be
out-of-sync... Hence, let's move the payload first, and sync then.
Addresses: https://github.com/systemd/systemd/pull/9762#issuecomment-441187979
Previously, when we'd copy an individual file we'd synthesize a
user.crtime_usec xattr with the source's creation time if we can
determine it. As the creation/birth time was until recently not
queriable form userspace this effectively just propagated the same xattr
on the source to the same xattr on the destination. However, current
kernels now allow to query the birthtime using statx() and we do make
use of that now. Which means that suddenly we started synthesizing these
xattrs much more regularly.
Doing this actually does make sense, but only in very few cases:
not for the typical regular files we copy, but certainly when dealing
with disk images. Hence, let's keep this kind of propagation, but let's
make it a flag and default to off. Then turn it on whenever we deal with
disk images, and leave it off otherwise.
This is particularly relevant as overlayfs combining a real fs, and a
tmpfs on top will result in EOPNOTSUPP when it is attempted to open a
file with xattrs for writing, as tmpfs does not support xattrs, and
hence the copy-up cannot work. Hence, let's avoid synthesizing this
needlessly, to increase compat with overlayfs.
Previously, we'd refuse the combination, and claimed we'd imply it, but
actually didn't. Let's allow the combination and imply read-only from
--volatile=, because that's what's documented, what we claim we do, and
what makes sense.
fdopen doesn't accept "e", it's ignored. Let's not mislead people into
believing that it actually sets O_CLOEXEC.
From `man 3 fdopen`:
> e (since glibc 2.7):
> Open the file with the O_CLOEXEC flag. See open(2) for more information. This flag is ignored for fdopen()
As mentioned by @jlebon in #11131.
cg_ns_supported() caches, so the condition was really checked just once, but
it looks weird to assign the return value to arg_use_cgns (if the variable is not present),
because then the other checks are effectively equivalent to
if (cg_ns_supported() && cg_ns_supported()) { ...
and later
if (!cg_ns_supported() || !cg_ns_supported()) { ...
That's what the function is for after all, and only if it's done there
we can verify the effect of .nspawn files correctly too: after all we
should not just validate that everything configured on the command line
makes sense, but the stuff configured in the .nspawn files, too.
This then let's us to ensure it's called after we parsed the cmdline,
and after we loaded the settings file, so that it these env var settings
override everything loaded from there.
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.
No code changes, just some rearranging of source files.
Whenever we invoke external, foreign code from code that has
RLIMIT_NOFILE's soft limit bumped to high values, revert it to 1024
first. This is a safety precaution for compatibility with programs using
select() which cannot operate with fds > 1024.
This commit adds the call to rlimit_nofile_safe() to all invocations of
exec{v,ve,l}() and friends that either are in code that we know runs
with RLIMIT_NOFILE bumped up (which is PID 1 and all journal code for
starters) or that is part of shared code that might end up there.
The calls are placed as early as we can in processes invoking a flavour
of execve(), but after the last time we do fd manipulations, so that we
can still take benefit of the high fd limits for that.
This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.
So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).
To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
Pretty much everything uses just the first argument, and this doesn't make this
common pattern more complicated, but makes it simpler to pass multiple options.
From #10526:
$ sudo systemd-nspawn -i image
Spawning container image on /home/zbyszek/src/mkosi/image.
Press ^] three times within 1s to kill container.
Short read while reading cgroup mode.
We have the machine name anyway, let's use TerminateMachine() on
machined's Manager object directly with it. That way it's a single
method call only, instead of two, to terminate the machine.
When running a read-only file system, we might not be able to create
/var/log/journal. Do not fail on this, unless actually requested by the
--link-journal options.
$ systemd-nspawn --image=image.squashfs ...
All over the place we define local variables for the various sockopts
that take a bool-like "int" value. Sometimes they are const, sometimes
static, sometimes both, sometimes neither.
Let's clean this up, introduce a common const variable "const_int_one"
(as well as one matching "const_int_zero") and use it everywhere, all
acorss the codebase.
With this change almost all log messages that are suppressed through
--quiet are not actually suppressed anymore, but simply downgraded to
LOG_DEBUG. Previously we did it this way for some log messages and fully
suppressed them for others. With this it's pretty much systematic.
Inspired by #10122.
In seccomp code, the code is changed to propagate errors which are about
anything other than unknown/unimplemented syscalls. I *think* such errors
should not happen in normal usage, but so far we would summarilly ignore all
errors, so that part is uncertain. If it turns out that other errors occur and
should be ignored, this should be added later.
In nspawn, we would count the number of added filters, but didn't use this for
anything. Drop that part.
The comments suggested that seccomp_add_syscall_filter_item() returned negative
if the syscall is unknown, but this wasn't true: it returns 0.
The error at this point can only be if the syscall was known but couldn't be
added. If the error comes from our internal whitelist in nspawn, treat this as
error, because it means that our internal table is wrong. If the error comes
from user arguments, warn and ignore. (If some syscall is not known at current
architecture, it is still silently ignored.)
Our logs are full of:
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldstat() / -10037, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call get_thread_area() / -10076, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call set_thread_area() / -10079, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldfstat() / -10034, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldolduname() / -10036, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call oldlstat() / -10035, ignoring: Numerical argument out of domain
Sep 19 09:22:10 autopkgtest systemd[690]: Failed to add rule for system call waitpid() / -10073, ignoring: Numerical argument out of domain
...
This is pointless and makes debug logs hard to read. Let's keep the logs
in test code, but disable it in nspawn and pid1. This is done through a function
parameter because those functions operate recursively and it's not possible to
make the caller to log meaningfully.
There should be no functional change, except the skipped debug logs.
When a network namespace is needed, /sys is mounted as tmpfs (see commit
d8fc6a000f for details).
But in this case mode 755 was used as initial permissions for /sys whereas the
default mode for sysfs is 555.
In practice using 755 doesn't have any impact because /sys is mounted read-only
too but for consistency, let's use the correct mode.
Fixes: #10050
This is a bit like the info link in most of GNU's --help texts, but we
don't do info but man pages, and we make them properly clickable on
terminal supporting that, because awesome.
I think it's generally advisable to link up our (brief) --help texts and
our (more comprehensive) man pages a bit, so this should be an easy and
straight-forward way to do it.
Currently, mount_sysfs() only creates /sys/fs/cgroup if cg_ns_supported().
The comment explains that we need to "Create mountpoint for
cgroups. Otherwise we are not allowed since we remount /sys read-only.";
that is: that we need to do it now, rather than later. However, the
comment doesn't do anything to explain why we only need to do this if
cg_ns_supported(); shouldn't we _always_ need to do it?
The answer is that if !use_cgns, then this was already done by the outer
child, so mount_sysfs() only needs to do it if use_cgns. Now,
mount_sysfs() doesn't know whether use_cgns, but !cg_ns_supported() implies
!use_cgns, so we can optimize" the case where we _know_ !use_cgns, and deal
with a no-op mkdir_p() in the false-positive where cgns_supported() but
!use_cgns.
But is it really much of an optimization? We're potentially spending an
access(2) (cg_ns_supported() could be cached from a previous call) to
potentially save an lstat(2) and mkdir(2); and all of them are on virtual
fileystems, so they should all be pretty cheap.
So, simplify and drop the conditional. It's a dubious optimization that
requires more text to explain than it's worth.
One of the things that tmpfs_patch_options does is take an (optional) UID,
and insert "uid=${UID},gid=${UID}" into the options string. So we need a
uid_t argument, and a way of telling if we should use it. Fortunately,
that is built in to the uid_t type by having UID_INVALID as a possible
value.
So this is really a feature that requires one argument. Yet, it is somehow
taking 4! That is absurd. Simplify it to only take one argument, and have
that trickle all the way up to mount_all()'s usage.
Now, in may of the uses, the argument becomes
uid_shift == 0 ? UID_INVALID : uid_shift
because it used to treat uid_shift=0 as invalid unless the patch_ids flag
was also set. This keeps the behavior the same. Note that in all cases
where it is invoked, if !use_userns (sometimes called !userns), then
uid_shift is 0; we don't have to add any checks for that.
That said, I'm pretty sure that "uid=0" and not setting "uid=" are the
same, but Christian Brauner seemed to not think so when implementing the
cgns support. https://github.com/systemd/systemd/pull/3589
One of the things that mkdir_userns{,_p}() does is take an (optional) UID,
and chown the directory to that. So we need a uid_t argument, and a way of
telling if we should use that uid_t argument. Fortunately, that is built
in to the uid_t type by having UID_INVALID as a possible value.
However, currently mkdir_userns() also takes a MountSettingsMask and checks
a couple of bits in it to decide if it should perform the chown.
Drop the mask argument, and instead have the caller pass UID_INVALID if it
shouldn't chown.
nspawn as it is now is a generally useful tool, hence let's drop the
comments about it being useful for debug and so on only.
The new wording just makes the first sentence of the main page also the
summary.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Let's always write "1 << 0", "1 << 1" and so on, except where we need
more than 31 flag bits, where we write "UINT64(1) << 0", and so on to force
64bit values.
This is mostly fall-out from d1a1f0aaf0,
however some cases are older bugs.
There might be more issues lurking, this was a simple grep for "%m"
across the tree, with all lines removed that mention "errno" at all.
This makes it easier to link the nspawn implementation to the tests.
Right now this just means that nspawn-patch-uid.c is not compiled
twice, which is nice, but results in test-patch-uid being slightly bigger,
which is not nice. But in general, we should use convenience libs to
compile everything just once, as far as possible. Otherwise, once we
start compiling a few files here twice, and a few file there thrice, we
soon end up in a state where we are doing hundreds of extra compilations.
So let's do the "right" thing, even if is might not be more efficient.
Let's rework error handling a bit in image_find() and friends: when we
can't find an image, return -ENOENT rather than 0. That's better as
before we violated the usual rule in our codebase that return parameters
are initialized when the return value is >= 0 and otherwise not touched.
This also makes enumeration and validation a bit more strict: we'll only
accept ".raw" as suffix for regular files, and filter out this suffix
handling on directories/subvolumes, where it makes no sense.
This distuingishes two different classes of images, one for the purpose
of npsawn-like containers, i.e. "machines", and one for portable
services.
This distinction is mostly about search paths. We look for machine
images in /var/lib/machines and for portable images in
/var/lib/portables.
We already do this kind of validation in nspawn when we operate on a
plain directory, let's also do this on raw images under the same
condition: that we are about too boot the image. Also, do this when we
are about to read OS metadata from it.
If we log to the pty that is configured as stdin/stdout/stderr of the
container too early we risk filling it up in full before we start
processing the pty from the parent process, resulting in deadlocks.
Let's hence keep a copy of the original tty we were started on before
setting up stdin/stdout/stderr, so that we can log to it, and keep using
it as long as we can.
Since the kernel's pty internal buffer is pretty small this actually
triggered deadlocks when we debug logged at lot from nspawn's child
processes, see: https://github.com/systemd/systemd/pull/9024#issuecomment-390403674
With this change we won't use the pty at all, only the actual payload we
start will, and hence we won't deadlock on it, ever.
Let's make use of log_set_open_when_needed() in nspawn too, i.e. at the
point where we close logging because we are about to rearrange fds,
let's automatically reopen the logging fds when we need them, the same
way as we do that in the service manager. This makes things simpler and
more robust.
This makes most header files easier to look at. Also Emacs gets really
slow when browsing through large sections of overly long prototypes,
which is much improved by this macro.
We should probably not do something similar with too many other cases,
as macros like this might help readability for some, but make it worse
for others. But I think given the complexity of this specific prototype
and how often we use it, it's worth doing.
The use of UINT64_C() in the SettingsMask enum definition is misleading:
it does not mean that individual fields have this width. E.g., with
enum {
FOO = UINT64_C(1)
}
sizeof(FOO) gives 4. It only means that the shift is done properly. So
1 << 35 is undefined, but UINT64_C(1) << 35 is the expected 64 bit
constant. Thus, the use UINT64_C() is useful, because we know that the shifts
are done properly, no matter what the value of _RLIMIT_MAX is, but when those
fields are used in expressions, we don't know what size they will be
(probably 4). Let's add a define which "hides" the enum definition behind a
define which gives the same value but is actually 64 bit. I think this is a
nicer solution than requiring all users to cast SETTING_RLIMIT_FIRST before
use.
Fixes#9035.
Let's separate the loading of the settings object and the merging into
our arg_xyz fields into two.
This will become particularly useful when we eventually are able to load
settings from OCI runtime files in addition to .nspawn files.
Similar as the other options added before, this is primarily useful to
provide comprehensive OCI runtime compatbility, but might be useful
otherwise, too.
This simply controls the PR_SET_NO_NEW_PRIVS flag for the container.
This too is primarily relevant to provide OCI runtime compaitiblity, but
might have other uses too, in particular as it nicely complements the
existing --capability= and --drop-capability= flags.