If /sys is read only filesystem, e.g., nspawn is running in container,
then usually udev is not running. In such a case, let's assume that
the interface is already initialized. Also, this makes nspawn refuse
to use the network interface which is under renaming.
Fixes#14223.
This will call json_variant_sensitive() internally while parsing for
each allocated sub-variant. This is better than calling it a posteriori
at the end, because partially parsed variants will always be properly
erased from memory this way.
The commit:
a3fc6b55ac nspawn: mask out CAP_NET_ADMIN again if settings file turns off private networking
turned off the CAP_NET_ADMIN capability whenever no private networking
feature was enabled. This broke configurations where the CAP_NET_ADMIN
capability was explicitly requested in the configuration.
Changing the order of evalution here to allow the Capability= setting
to overrule this implicit setting:
Order of evaluation:
1. if no private network setting is enabled, CAP_NET_ADMIN is removed
2. if a private network setting is enabled, CAP_NET_ADMIN is added
3. the settings of Capability= are added
4. the settings of DropCapability= are removed
This allows the fix for #11755 to be retained and to still allow the
admin to specify CAP_NET_ADMIN as additional capability.
Fixes: a3fc6b55acFixes: #13995
Initially I thought this is a good idea, but when reviewing a different PR
(https://github.com/systemd/systemd/pull/13862#discussion_r340604313) I changed
my mind about this. At some point we probably should start warning about the
old option name, and yet later remove it. But it'll make it easier for people
to transition to the new option name if there's a period of support for both
names without any fuss. There's nothing particularly wrong about the old name,
and there is no support cost.
Fixes#13919 (by avoiding the issue completely).
It's user-facing, parsed from the command line and we typically mangle
in these cases, let's do so here too. (In particular as the identical
switch for systemd-run already does it.)
We already shut the machine down ourselves (and pid1 will also do
cleanup for us after we exit if anything was left behind). No need for
systemd-machined to try to stop the unit too.
(This calls the new machined method. If we are running against an older
machined, we will not deregister the machine. If we are simply exiting,
machined should notice that the unit is gone on its own. If we are restarting,
we will fail to register the machine after restart and fail. But this case
was already broken, because machined would create a stop job, breaking the
restart. So not doing anything with old machined should not make anything
more broken than it already is.)
Fixes#13766.
See https://bugzilla.redhat.com/show_bug.cgi?id=1763488: when we say that
'foo@*.service' is not a valid unit name, this is not clear enough. Let's
include the name of the operation that does not support globbing in the
error message:
$ build/systemctl enable 'foo@*.service'
Glob pattern passed to enable, but globs are not supported for this.
Invalid unit name "foo@*.service" escaped as "foo@\x2a.service".
...
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.
(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)
We should never have used an unprefixed environment variable name.
All other systemd-nspawn variables have the "SYSTEMD_NSPAWN_" prefix,
and all other systemd variables have the "SYSTEMD_" prefix.
The new variable name takes precedence, but we fall back to checking the
old one. If only the old one is found, a warning is emitted.
In addition, SYSTEMD_NSPAWN_UNIFIED_HIERARCHY="" is accepted as an override
to avoid looking for the old variable name.
We have a variable with the same name ($UNIFIED_CGROUP_HIERARCHY) in tests,
which governs both systemd-nspawn and qemu behaviour. It is not renamed.
We would parse the environment twice (to re-apply settings after reading
config from disk), but we would not check the return code first time.
This means that for some settings we would ignore invalid values, while
for others, we'd fail at some point.
Let's just consistently fail. Those environment variables define important
aspects of behaviour, and it is better for the user if we ignore invalid
values. (Unknown settings are still ignored, so forward compatibility is
maintained.)
This way less stuff needs to be in basic. Initially, I wanted to move all the
parts of cgroup-utils.[ch] that depend on efivars.[ch] to shared, because
efivars.[ch] is in shared/. Later on, I decide to split efivars.[ch], so the
move done in this patch is not necessary anymore. Nevertheless, it is still
valid on its own. If at some point we want to expose libbasic, it is better to
to not have stuff that belong in libshared there.
This avoid the use of the global variable.
Also rename cgroup_unified_update() to cgroup_unified_cached() and
cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.
This reverts commit 72d967df3e.
Revert this because it broke the `norbind` option of the bind flags
because it does bind-mounts unconditionally recursive.
Let's bring the old logic back.
Fixes: #13170
If execve failed, we would die in safe_close(), because master was already
closed by fdset_close_others() on line 3123. IIUC, we don't need to keep the
fd open after sending it, so let's just close it immediately.
Reproducer:
sudo build/systemd-nspawn -M rawhide fooooooo
Fixup for 3acc84ebd9.
prefix_root() is equivalent to path_join() in almost all ways, hence
let's remove it.
There are subtle differences though: prefix_root() will try shorten
multiple "/" before and after the prefix. path_join() doesn't do that.
This means prefix_root() might return a string shorter than both its
inputs combined, while path_join() never does that. I like the
path_join() semantics better, hence I think dropping prefix_root() is
totally OK. In the end the strings generated by both functon should
always be identical in terms of path_equal() if not streq().
This leaves prefix_roota() in place. Ideally we'd have path_joina(), but
I don't think we can reasonably implement that as a macro. or maybe we
can? (if so, sounds like something for a later PR)
Also add in a few missing OOM checks
The OCI changes in #9762 broke a use case in which we use nspawn from
inside a container that has dropped capabilities from the bounding set
that nspawn expected to retain. In an attempt to keep OCI compliance
and support our use case, I made hard failing on setting capabilities
not in the bounding set optional (hard fail if using OCI and log only
if using nspawn cmdline).
Fixes#12539
The console tty is now allocated from within the container so it's not
necessary anymore to allocate it from the host and bind mount the pty slave
into the container. The pty master is sent to the host.
/dev/console is now a symlink pointing to the pty slave.
This might also be less confusing for applications running inside the container
and the overall result looks cleaner (we don't need to apply manually the
passed selinux context, if any, to the allocated pty for instance).
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)
Let's rework this, and store the size in bytes of the allocated storage area.
The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.
Fixes#12605.
Since nspawn-settings.h includes seccomp.h, any file that includes
nspawn-settings.h should depend on libseccomp so the correct header path where
seccomp.h lives is added to the header search paths.
It's especially important for distros such as openSUSE where seccomp.h is not
shipped in /usr/include but /usr/include/libseccomp.
This patch is similar to 8238423095.
There's no point in returning the "key" within each loop iteration as
JsonVariant object. Let's simplify things and return it as string. That
simplifies usage (since the caller doesn't have to convert the object to
the string anymore) and is safe since we already validate that keys are
strings when an object JsonVariant is allocated.
We noticed in our tests that occasionally SystemCallFilter= would
fail to set and the service would run with no syscall filtering.
Most of the time the same tests would apply the filter and fail
the service as expected. While it's not totally clear why this happens,
we noticed seccomp_load() in the systemd code base would fail open for
all errors except EPERM and EACCES.
ENOMEM, EINVAL, and EFAULT seem like reasonable values to add to the
error set based on what I gather from libseccomp code and man pages:
-ENOMEM: out of memory, failed to allocate space for a libseccomp structure, or would exceed a defined constant
-EINVAL: kernel isn't configured to support the operations, args are invalid (to seccomp_load(), seccomp(), or prctl())
-EFAULT: addresses passed as args are invalid
/tmp might not be mounted at all yet (given that we support
SYSTEMD_NSPAWN_TMPFS_TMP=0 to turn this off), and /tmp is a dir systemd
usually tries to unmount during shutdown (unlike /run), and we shouldn't
keep it busy. Hence let's just move these deleted files to /run so that
we don't keep /tmp needlessly busy.
Some chattrs only work sensible if you set them right after opening a
file for create (think: FS_NOCOW_FL). Others only work when they are
applied when the file is fully written (think: FS_IMMUTABLE_FL). Let's
take that into account when copying files and applying a chattr to them.
The host mounts it like that, nspawn hence should do too.
Moreover, mount the file system after doing CLONEW_NEWIPC so that it
actually reflects the right mqueues. Finally, mount it wthout
considering it fatal, since POSIX mqueue support is little used and it
should be fine not to support it in the kernel.
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.
Previously both run() and run_container() would free 'fds'. Let's fix
that, and let run() free it but make run_container() already remove all
fds from it, because that's what we actually want to do.
Fixes: #12073