Since cryptsetup 2.3.0 a new API to verify dm-verity volumes by a
pkcs7 signature, with the public key in the kernel keyring,
is available. Use it if libcryptsetup supports it.
https://tools.ietf.org/html/draft-knodel-terminology-02https://lwn.net/Articles/823224/
This gets rid of most but not occasions of these loaded terms:
1. scsi_id and friends are something that is supposed to be removed from
our tree (see #7594)
2. The test suite defines an API used by the ubuntu CI. We can remove
this too later, but this needs to be done in sync with the ubuntu CI.
3. In some cases the terms are part of APIs we call or where we expose
concepts the kernel names the way it names them. (In particular all
remaining uses of the word "slave" in our codebase are like this,
it's used by the POSIX PTY layer, by the network subsystem, the mount
API and the block device subsystem). Getting rid of the term in these
contexts would mean doing some major fixes of the kernel ABI first.
Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
Let's make sure $XDG_RUNTIME_DIR for the user instance and /run for the
system instance is always organized the same way: the "inaccessible"
device nodes should be placed in a subdir of either called "systemd" and
a subdir of that called "inaccessible".
This way we can emphasize the common behaviour, and only differ where
really necessary.
Follow-up for #13823
dm-verity support in dissect-image at the moment is restricted to GPT
volumes.
If the image a single-filesystem type without a partition table (eg: squashfs)
and a roothash/verity file are passed, set the verity flag and mark as
read-only.
We always need to make them unions with a "struct cmsghdr" in them, so
that things properly aligned. Otherwise we might end up at an unaligned
address and the counting goes all wrong, possibly making the kernel
refuse our buffers.
Also, let's make sure we initialize the control buffers to zero when
sending, but leave them uninitialized when reading.
Both the alignment and the initialization thing is mentioned in the
cmsg(3) man page.
Consider such configuration:
$ systemd-nspawn --read-only --timezone=copy --resolv-conf=copy-host \
--overlay="+/etc::/etc" <...>
Assuming one wants `/` to be read-only, DNS and `/etc/localtime` to
work. One way to do it is to create an overlay filesystem in `/etc/`.
However, systemd-nspawn tries to create `/etc/resolv.conf` and
`/etc/localtime` before mounting the custom paths, while `/` (and, by
extension, `/etc`) is read-only. Thus it fails to create those files.
Mounting custom paths before modifying anything in `/etc/` makes this
possible.
Full example:
```
$ debootstrap buster /var/lib/machines/t1 http://deb.debian.org/debian
$ systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -c 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
ping: example.com: Temporary failure in name resolution
Container t1 failed with error code 130.
```
With the patch:
```
$ sudo ./build/systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -qc 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
PING example.com (93.184.216.34) 56(84) bytes of data.
--- example.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 110.912/110.912/110.912/0.000 ms
Container t1 exited successfully.
```
Let's be extra careful whenever we return from recvmsg() and see
MSG_CTRUNC set. This generally means we ran into a programming error, as
we didn't size the control buffer large enough. It's an error condition
we should at least log about, or propagate up. Hence do that.
This is particularly important when receiving fds, since for those the
control data can be of any size. In particular on stream sockets that's
nasty, because if we miss an fd because of control data truncation we
cannot recover, we might not even realize that we are one off.
(Also, when failing early, if there's any chance the socket might be
AF_UNIX let's close all received fds, all the time. We got this right
most of the time, but there were a few cases missing. God, UNIX is hard
to use)
Let's add flavours for copying stub/uplink resolv.conf versions.
Let's add a more brutal "replace" mode, where we'll replace any existing
destination file.
Let's also change what "auto" means: instead of copying the static file,
let's use the stub file, so that DNS search info is copied over.
Fixes: #15340
This has been requested many times before. Let's add it finally.
GPT auto-discovery for /var is a bit more complex than for other
partition types: the other partitions can to some degree be shared
between multiple OS installations on the same disk (think: swap, /home,
/srv). However, /var is inherently something bound to an installation,
i.e. specific to its identity, or actually *is* its identity, and hence
something that cannot be shared.
To deal with this this new code is particularly careful when it comes to
/var: it will not mount things blindly, but insist that the UUID of the
partition matches a hashed version of the machine-id of the
installation, so that each installation has a very specific /var
associated with it, and would never use any other. (We actually use
HMAC-SHA256 on the GPT partition type for /var, keyed by the machine-id,
since machine-id is something we want to keep somewhat private).
Setting the right UUID for installations takes extra care. To make
things a bit simpler to set up, we avoid this safety check for nspawn
and RootImage= in unit files, under the assumption that such container
and service images unlikely will have multiple installations on them.
The check is hence only required when booting full machines, i.e. in
in systemd-gpt-auto-generator.
To help with putting together images for full machines, PR #14368
introduces a repartition tool that can automatically fill in correctly
calculated UUIDs on first boot if images have the var partition UUID
initialized to all zeroes. With that in place systems can be put
together in a way that on first boot the machine ID is determined and
the partition table automatically adjusted to have the /var partition
with the right UUID.
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
The commit:
a3fc6b55ac nspawn: mask out CAP_NET_ADMIN again if settings file turns off private networking
turned off the CAP_NET_ADMIN capability whenever no private networking
feature was enabled. This broke configurations where the CAP_NET_ADMIN
capability was explicitly requested in the configuration.
Changing the order of evalution here to allow the Capability= setting
to overrule this implicit setting:
Order of evaluation:
1. if no private network setting is enabled, CAP_NET_ADMIN is removed
2. if a private network setting is enabled, CAP_NET_ADMIN is added
3. the settings of Capability= are added
4. the settings of DropCapability= are removed
This allows the fix for #11755 to be retained and to still allow the
admin to specify CAP_NET_ADMIN as additional capability.
Fixes: a3fc6b55acFixes: #13995
Initially I thought this is a good idea, but when reviewing a different PR
(https://github.com/systemd/systemd/pull/13862#discussion_r340604313) I changed
my mind about this. At some point we probably should start warning about the
old option name, and yet later remove it. But it'll make it easier for people
to transition to the new option name if there's a period of support for both
names without any fuss. There's nothing particularly wrong about the old name,
and there is no support cost.
Fixes#13919 (by avoiding the issue completely).
It's user-facing, parsed from the command line and we typically mangle
in these cases, let's do so here too. (In particular as the identical
switch for systemd-run already does it.)
We already shut the machine down ourselves (and pid1 will also do
cleanup for us after we exit if anything was left behind). No need for
systemd-machined to try to stop the unit too.
(This calls the new machined method. If we are running against an older
machined, we will not deregister the machine. If we are simply exiting,
machined should notice that the unit is gone on its own. If we are restarting,
we will fail to register the machine after restart and fail. But this case
was already broken, because machined would create a stop job, breaking the
restart. So not doing anything with old machined should not make anything
more broken than it already is.)
Fixes#13766.
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.
(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)