This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.
This takes inspiration from Rust:
https://doc.rust-lang.org/std/option/enum.Option.html#method.take
and was suggested by Alan Jenkins (@sourcejedi).
It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
Let's remove a number of synchronization points from our service
startups: let's drop synchronous match installation, and let's opt for
asynchronous instead.
Also, let's use sd_bus_match_signal() instead of sd_bus_add_match()
where we can.
This renames wait_for_terminate_and_warn() to
wait_for_terminate_and_check(), and adds a flags parameter, that
controls how much to log: there's one flag that means we log about
abnormal stuff, and another one that controls whether we log about
non-zero exit codes. Finally, there's a shortcut flag value for logging
in both cases, as that's what we usually use.
All callers are accordingly updated. At three occasions duplicate logging
is removed, i.e. where the old function was called but logged in the
caller, too.
Add a new option `--network-namespace-path` to systemd-nspawn to allow
users to specify an arbitrary network namespace, e.g. `/run/netns/foo`.
Then systemd-nspawn will open the netns file, pass the fd to
outer_child, and enter the namespace represented by the fd before
running inner_child.
```
$ sudo ip netns add foo
$ mount | grep /run/netns/foo
nsfs on /run/netns/foo type nsfs (rw)
...
$ sudo systemd-nspawn -D /srv/fc27 --network-namespace-path=/run/netns/foo \
/bin/readlink -f /proc/self/ns/net
/proc/1/ns/net:[4026532009]
```
Note that the option `--network-namespace-path=` cannot be used together
with other network-related options such as `--private-network` so that
the options do not conflict with each other.
Fixes https://github.com/systemd/systemd/issues/7361
If user namespacing is used, let's make sure that the root user in the
container gets access to both /sys/fs/cgroup/systemd and
/sys/fs/cgroup/unified.
This matches similar logic in cg_set_access().
In -U mode we might need to re-chown() all files and directories to
match the UID shift we want for the image. That's problematic on fat
partitions, such as the ESP (and which is generated by mkosi's
--bootable switch), because fat of course knows no UID/GID file
ownership natively.
With this change we take benefit of the uid= and gid= mount options FAT
knows: instead of chown()ing all files and directories we can just
specify the right UID/GID to use at mount time.
This beefs up the image dissection logic in two ways:
1. First of all support for mounting relevant file systems with
uid=/gid= is added: when a UID is specified during mount it is used for
all applicable file systems.
2. Secondly, two new mount flags are added:
DISSECT_IMAGE_MOUNT_ROOT_ONLY and DISSECT_IMAGE_MOUNT_NON_ROOT_ONLY.
If one is specified the mount routine will either only mount the root
partition of an image, or all partitions except the root partition.
This is used by nspawn: first the root partition is mounted, so that
we can determine the UID shift in use so far, based on ownership of
the image's root directory. Then, we mount the remaining partitions
in a second go, this time with the right UID/GID information.
If we operate on a disk image (i.e. --image=) then it's pointless to
look into the mount directory before it is actually mounted to see which
systemd version is running inside...
Unfortunately we only mount the disk image in the child process, but the
parent needs to know the cgroup mode, hence add some IPC for this
purpose and communicate the cgroup mode determined from the image back
to the parent.
When "-U" is used we look for a UID range we can use for our container.
We start with the UID the tree is already assigned to, and if that
didn't work we'd pick random ranges so far. With this change we'll first
try to hash a suitable range from the container name, and use that if it
works, in order to make UID assignments more likely to be stable.
This follows a similar logic PID 1 follows when using DynamicUser=1.
Since time began, scope units had a concept of "Controllers", a bus peer
that would be notified when somebody requested a unit to stop. None of
our code used that facility so far, let's change that.
This way, nspawn can print a nice message when somebody invokes
"systemctl stop" on the container's scope unit, and then react with the
right action to shut it down.
We currently use the ownership of the top-level directory as a hint
whether we need to descent into the whole tree to chown() it recursively
or not. This is problematic with the previous chown()ing algorithm, as
when descending into the tree we'd first chown() and then descend
further down, which meant that the top-level directory would be chowned
first, and an aborted recursive chowning would appear on the next
invocation as successful, even though it was not. Let's reshuffle things
a bit, to make the re-chown()ing safe regarding interruptions:
a) We chown() the dir we are looking at last, and descent into all its
children first. That way we know that if the top-level dir is
properly owned everything inside of it is properly owned too.
b) Before starting a chown()ing operation, we mark the top-level
directory as owned by a special "busy" UID range, which we can use to
recognize whether a tree was fully chowned: if it is marked as busy,
it's definitely not fully chowned, as the busy ownership will only be
fixed as final step of the chowning.
Fixes: #6292
This adds some simply detection logic for cases where dissection is
invoked on an externally created loop device, and partitions have been
detected on it, but partition scanning so far was off. If this is
detected we now print a brief message indicating what the issue is,
instead of failing with a useless EINVAL message the kernel passed to
us.
This adds some basic discovery of block device images for nspawn and
friends. Note that this doesn't add searching for block devices using
udev, but instead expects users to symlink relevant block devices into
/var/lib/machines. Discovery is hence done exactly like for
dir/subvol/raw file images, except that what is found may be a (symlink
to) a block device.
For now, we do not support cloning these images, but removal, renaming
and read-only flags are supported to the point where that makes sense.
Fixe: #6990
The advantage is that is the name is mispellt, cpp will warn us.
$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build
squash! build-sys: use #if Y instead of #ifdef Y everywhere
v2:
- fix incorrect setting of HAVE_LIBIDN2
keyring material should not leak into the container. So far we relied on
seccomp to deny access to the keyring, but given that we now made the
seccomp configurable, and access to keyctl() and friends may optionally
be permitted to containers now let's make sure we disconnect the callers
keyring from the keyring of PID 1 in the container.
Now that we have ported nspawn's seccomp code to the generic code in
seccomp-util, let's extend it to support whitelisting and blacklisting
of specific additional syscalls.
This uses similar syntax as PID1's support for system call filtering,
but in contrast to that always implements a blacklist (and not a
whitelist), as we prepopulate the filter with a blacklist, and the
unit's system call filter logic does not come with anything
prepopulated.
(Later on we might actually want to invert the logic here, and
whitelist rather than blacklist things, but at this point let's not do
that. In case we switch this over later, the syscall add/remove logic of
this commit should be compatible conceptually.)
Fixes: #5163
Replaces: #5944
glibc appears to propagate different errors in different ways, let's fix
this up, so that our own code doesn't get confused by this.
See #6752 + #6737 for details.
Fixes: #6755
Given that we set NOTIFY_SOCKET unconditionally it's not surprising that
processes way down the process tree think it's smart to send us a
notification message.
It's still useful to keep this message, for debugging things, but it
shouldn't be generated by default.
Previously, only when --register=yes was set (the default) the invoked
container would get its own scope, created by machined on behalf of
nspawn. With this change if --register=no is set nspawn will still get
its own scope (which is a good thing, so that --slice= and --property=
take effect), but this is not done through machined but by registering a
scope unit directly in PID 1.
Summary:
--register=yes → allocate a new scope through machined (the default)
--register=yes --keep-unit → use the unit we are already running in an register with machined
--register=no → allocate a new scope directly, but no machined
--register=no --keep-unit → do not allocate nor register anything
Fixes: #5823
When using pkg-config to determine the include flags for blkid the
flags are returned as:
$ pkg-config blkid --cflags
-I/usr/include/blkid -I/usr/include/uuid
We use the <blkid/blkid.h> include which would be correct when using
the default compiler /usr/include header search path. However, when
cross-compiling the blkid.h will not be installed at /usr/include and
highly likely in a temporary system root. It is futher compounded if
the cross-compile packages are split up and the blkid package is not
available in the same sysroot as the compiler.
Regardless of the compilation setup, the correct include path should be
<blkid.h> if using the pkg-config returned CFLAGS.
We use our cgroup APIs in various contexts, including from our libraries
sd-login, sd-bus. As we don#t control those environments we can't rely
that the unified cgroup setup logic succeeds, and hence really shouldn't
assert on it.
This more or less reverts 415fc41cea.
Currently the hybrid mode mounts cgroup v2 on /sys/fs/cgroup instead of the v1
name=systemd hierarchy. While this works fine for systemd itself, it breaks
tools which expect cgroup v1 hierarchy on /sys/fs/cgroup/systemd.
This patch updates the hybrid mode so that it mounts v2 hierarchy on
/sys/fs/cgroup/unified and keeps v1 "name=systemd" hierarchy on
/sys/fs/cgroup/systemd for compatibility. systemd itself doesn't depend on the
"name=systemd" hierarchy at all. All operations take place on the v2 hierarchy
as before but the v1 hierarchy is kept in sync so that any tools which expect
it to be there can keep doing so. This allows systemd to take advantage of
cgroup v2 process management without requiring other tools to be aware of the
hybrid mode.
The hybrid mode is implemented by mapping the special systemd controller to
/sys/fs/cgroup/unified and making the basic cgroup utility operations -
cg_attach(), cg_create(), cg_rmdir() and cg_trim() - also operate on the
/sys/fs/cgroup/systemd hierarchy whenever the cgroup2 hierarchy is updated.
While a bit messy, this will allow dropping complications from using cgroup v1
for process management a lot sooner than otherwise possible which should make
it a net gain in terms of maintainability.
v2: Fixed !cgns breakage reported by @evverx and renamed the unified mount
point to /sys/fs/cgroup/unified as suggested by @brauner.
v3: chown the compat hierarchy too on delegation. Suggested by @evverx.
v4: [zj]
- drop the change to default, full "legacy" is still the default.
cg_[all_]unified() test whether a specific controller or all controllers are on
the unified hierarchy. While what's being asked is a simple binary question,
the callers must assume that the functions may fail any time, which
unnecessarily complicates their usages. This complication is unnecessary.
Internally, the test result is cached anyway and there are only a few places
where the test actually needs to be performed.
This patch simplifies cg_[all_]unified().
* cg_[all_]unified() are updated to return bool. If the result can't be
decided, assertion failure is triggered. Error handlings from their callers
are dropped.
* cg_unified_flush() is updated to calculate the new result synchrnously and
return whether it succeeded or not. Places which need to flush the test
result are updated to test for failure. This ensures that all the following
cg_[all_]unified() tests succeed.
* Places which expected possible cg_[all_]unified() failures are updated to
call and test cg_unified_flush() before calling cg_[all_]unified(). This
includes functions used while setting up mounts during boot and
manager_setup_cgroup().
cgroup mode detection is broken in two different ways.
* detect_unified_cgroup_hierarchy() is called too nested in outer_child().
sync_cgroup() which is used by run() also needs to know the requested cgroup
mode but it's currently always getting CGROUP_UNIFIED_UNKNOWN. This makes it
skip syncing the inner cgroup hierarchy on some config combinations.
$ cat /proc/self/cgroup | grep systemd
1:name=systemd:/user.slice/user-0.slice/session-c1.scope
$ UNIFIED_CGROUP_HIERARCHY=0 SYSTEMD_NSPAWN_USE_CGNS=0 systemd-nspawn -M container
...
[root@container ~]# cat /proc/self/cgroup | grep systemd
1:name=systemd:/machine.slice/machine-container.x86_64.scope
$ exit
$ UNIFIED_CGROUP_HIERARCHY=1 SYSTEMD_NSPAWN_USE_CGNS=0 systemd-nspawn -M container
[root@container ~]# cat /proc/self/cgroup | grep 0::
0::/
$ exit
Note how the unified hierarchy case's path is not synchronized with the host.
This for example can cause issues when there are multiple such containers.
Fixed by moving detect_unified_cgroup_hierarchy() invocation to main().
* inner_child() was invoking cg_unified_flush(). inner_child() executes fully
scoped and can't determine which cgroup mode the host was in. It doesn't
make sense to keep flushing the detected mode when the host mode can't
change.
Fixed by replacing cg_unified_flush() invocations in outer_child() and
inner_child() with one in main().