Quoting Andy Lutomirski:
> The upcoming Linux SGX driver has a device node /dev/sgx. User code opens
> it, does various setup things, mmaps it, and needs to be able to create
> PROT_EXEC mappings. This gets quite awkward if /dev is mounted noexec.
We already didn't use noexec in spawn, and this extends this behaviour to other
systems.
Afaik, the kernel would refuse execve() on a character or block device
anyway. Thus noexec on /dev matters only for actual binaries copied to /dev,
which requires root privileges in the first place.
We don't do noexec on either /tmp or /dev/shm (because that causes immediate
problems with stuff like Java and cffi). And if you have those two at your
disposal anyway, having noexec on /dev doesn't seem important. So the 'noexec'
attribute on /dev doesn't really mean much, since there are multiple other
similar directories which don't require root privileges to write to.
C.f. 33c10ef43b.
Avoid warning messages when booting systemd-nspawn containers and using
hybrid or legacy cgroups. systemd-nspawn mounts the cgroups v1 controller
tree as read-only so these errors are expected and not problematic.
Partially fixes#17862.
Test plan:
- Before: `mkosi --default .mkosi/mkosi.fedora boot`
```
‣ Processing default...
Spawning container image on /home/daan/projects/systemd/image.raw.
Press ^] three times within 1s to kill container.
systemd 247 running in system mode. (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization systemd-nspawn.
Detected architecture x86-64.
Welcome to Fedora 33 (Thirty Three)!
Queued start job for default target Graphical Interface.
-.slice: Failed to migrate controller cgroups from , ignoring: Read-only file system
system.slice: Failed to delete controller cgroups /system.slice, ignoring: Read-only file system
[ OK ] Created slice system-getty.slice.
[ OK ] Created slice system-modprobe.slice.
user.slice: Failed to delete controller cgroups /user.slice, ignoring: Read-only file system
[ OK ] Created slice User and Session Slice.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Started Forward Password Requests to Wall Directory Watch.
[ OK ] Reached target Local Encrypted Volumes.
[ OK ] Reached target Paths.
[ OK ] Reached target Remote File Systems.
[ OK ] Reached target Slices.
[ OK ] Reached target Swap.
[ OK ] Listening on Process Core Dump Socket.
[ OK ] Listening on initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket (/dev/log).
[ OK ] Listening on Journal Socket.
[ OK ] Listening on User Database Manager Socket.
dev-hugepages.mount: Failed to delete controller cgroups /dev-hugepages.mount, ignoring: Read-only file system
Mounting Huge Pages File System...
sys-fs-fuse-connections.mount: Failed to delete controller cgroups /sys-fs-fuse-connections.mount, ignoring: Read-only file system
Mounting FUSE Control File System...
Starting Journal Service...
Starting Remount Root and Kernel File Systems...
system.slice: Failed to delete controller cgroups /system.slice, ignoring: Read-only file system
```
After: `mkosi --default .mkosi/mkosi.fedora boot`
```
‣ Processing default...
Spawning container image on /home/daan/projects/systemd/mkosi.output/image.raw.
Press ^] three times within 1s to kill container.
systemd 247 running in system mode. (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization systemd-nspawn.
Detected architecture x86-64.
Welcome to Fedora 33 (Thirty Three)!
Queued start job for default target Graphical Interface.
[ OK ] Created slice system-getty.slice.
[ OK ] Created slice system-modprobe.slice.
[ OK ] Created slice User and Session Slice.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Started Forward Password Requests to Wall Directory Watch.
[ OK ] Reached target Local Encrypted Volumes.
[ OK ] Reached target Paths.
[ OK ] Reached target Remote File Systems.
[ OK ] Reached target Slices.
[ OK ] Reached target Swap.
[ OK ] Listening on Process Core Dump Socket.
[ OK ] Listening on initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket (/dev/log).
[ OK ] Listening on Journal Socket.
[ OK ] Listening on User Database Manager Socket.
Mounting Huge Pages File System...
Mounting FUSE Control File System...
Starting Journal Service...
Starting Remount Root and Kernel File Systems...
[ OK ] Mounted Huge Pages File System.
[ OK ] Mounted FUSE Control File System.
[ OK ] Finished Remount Root and Kernel File Systems.
Starting Create Static Device Nodes in /dev...
[ OK ] Finished Create Static Device Nodes in /dev.
[ OK ] Reached target Local File Systems (Pre).
[ OK ] Reached target Local File Systems.
Starting Restore /run/initramfs on shutdown...
[ OK ] Finished Restore /run/initramfs on shutdown.
[ OK ] Started Journal Service.
Starting Flush Journal to Persistent Storage...
[ OK ] Finished Flush Journal to Persistent Storage.
Starting Create Volatile Files and Directories...
[ OK ] Finished Create Volatile Files and Directories.
Starting Network Name Resolution...
Starting Update UTMP about System Boot/Shutdown...
[ OK ] Finished Update UTMP about System Boot/Shutdown.
[ OK ] Reached target System Initialization.
[ OK ] Started Daily Cleanup of Temporary Directories.
[ OK ] Reached target Timers.
[ OK ] Listening on D-Bus System Message Bus Socket.
[ OK ] Reached target Sockets.
[ OK ] Reached target Basic System.
Starting Home Area Manager...
Starting User Login Management...
Starting Permit User Sessions...
[ OK ] Finished Permit User Sessions.
[ OK ] Started Console Getty.
[ OK ] Reached target Login Prompts.
Starting D-Bus System Message Bus...
[ OK ] Started D-Bus System Message Bus.
[ OK ] Started Home Area Manager.
[ OK ] Started User Login Management.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.
Starting Update UTMP about System Runlevel Changes...
[ OK ] Finished Update UTMP about System Runlevel Changes.
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
Fedora 33 (Thirty Three) (built from systemd tree)
Kernel 5.9.11-arch2-1 on an x86_64 (console)
```
When manager_{set|override}_watchdog is called, set the watchdog timeout
regardless of whether the hardware watchdog was successfully initialized. If
the watchdog was requested but could not be initialized, then instead of
pinging it, attempt to initialize it again. This ensures that the hardware
watchdog is initialized even if the kernel module for it isn't loaded when
systemd starts (which is quite likely, unless it is compiled in).
This builds on work by @danc86 in https://github.com/systemd/systemd/pull/17460,
but fixes the issue of not updating the watchdog timeout with the actual value
from the hardware.
Fixes https://github.com/systemd/systemd/issues/17838
Co-authored-by: Dan Callaghan <djc@djc.id.au>
Co-authored-by: Michael Marley <michael@michaelmarley.com>
Commit [1] added a workaround when unified cgroups are used but missed
legacy cgroups where there is the same issue.
[1] <2dbc45aea7>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
The BLKID and ELFUTILS strings were present twice. Let's reaarange things so that
each times requires definition in exactly one place.
Also let's sort things a bit:
the "heavy hitters" like PAM/MAC first,
then crypto libs,
then other libs, alphabetically,
compressors,
and external compat integrations.
I think it's useful for users to group similar concepts together to some extent.
For example, when checking what compression is available, it helps a lot to have
them listed together.
FDISK is renamed to LIBFDISK to make it clear that this is about he library and
the executable.
Other similar variables use the binary name underscorified and upppercased
(with "_BINARY" appended in some cases to avoid ambiguity). Add "S" to follow
the same pattern for systemd-cgroups-agent.
Based on the discussion in #16715.
Commit 428a9f6f1d freed u->pids which is
problematic since the references to this unit in m->watch_pids were no more
removed when the unit was freed.
This patch makes sure to clean all this refs up before freeing u->pids by
calling unit_unwatch_all_pids().
In many cases the tables are largely the same, hence define a common set
of macros to generate the common parts.
This adds in a couple of missing specifiers here and there, so is more
thant just refactoring: it actually fixes accidental omissions.
Note that some entries that look like they could be unified under these
macros can't really be unified, since they are slightly different. For
example in the DNSSD service logic we want to use the DNSSD hostname for
%H rather than the unmodified kernel one.
Otherwise if a daemon-reload happens somewhere between the enqueue of the job
start for the scope unit and scope_start() then u->pids might be lost and none
of the processes specified by "PIDs=" will be moved into the scope cgroup.
This is useful for development where overwriting files out side
the configured prefix will affect the host as well as stateless
systems such as NixOS that don't let packages install to /etc but handle
configuration on their own.
Alternative to https://github.com/systemd/systemd/pull/17501
tested with:
$ mkdir inst build && cd build
$ meson \
-Dcreate-log-dirs=false \
-Dsysvrcnd-path=$(realpath ../inst)/etc/rc.d \
-Dsysvinit-path=$(realpath ../inst)/etc/init.d \
-Drootprefix=$(realpath ../inst) \
-Dinstall-sysconfdir=false \
--prefix=$(realpath ../inst) ..
$ ninja install
With the grandparent change to move most units to app.slice,
those units would be ordered After=app.slice which doesn't make any sense.
Actually they appear earlier, before the manager is even started, and
conceputally it doesn't seem useful to put them under any slice.
... when called with a valid environment variable name. This means that
any time we call it with a fixed string, it is guaranteed to return 0.
(Also when the variable is not present in the environment block.)
FixedRandomDelay=yes will use
`siphash24(sd_id128_get_machine() || MANAGER_IS_SYSTEM(m) || getuid() || u->id)`,
where || is concatenation, instead of a random number to choose a value between
0 and RandomizedDelaySec= as the timer delay.
This essentially sets up a fixed, but seemingly random, offset for each timer
iteration rather than having a random offset recalculated each time it fires.
Closes#10355
Co-author: Anita Zhang <the.anitazha@gmail.com>
This beefs up the READ_FULL_FILE_CONNECT_SOCKET logic of
read_full_file_full() a bit: when used a sender socket name may be
specified. If specified as NULL behaviour is as before: the client
socket name is picked by the kernel. But if specified as non-NULL the
client can pick a socket name to use when connecting. This is useful to
communicate a minimal amount of metainformation from client to server,
outside of the transport payload.
Specifically, these beefs up the service credential logic to pass an
abstract AF_UNIX socket name as client socket name when connecting via
READ_FULL_FILE_CONNECT_SOCKET, that includes the requesting unit name
and the eventual credential name. This allows servers implementing the
trivial credential socket logic to distinguish clients: via a simple
getpeername() it can be determined which unit is requesting a
credential, and which credential specifically.
Example: with this patch in place, in a unit file "waldo.service" a
configuration line like the following:
LoadCredential=foo:/run/quux/creds.sock
will result in a connection to the AF_UNIX socket /run/quux/creds.sock,
originating from an abstract namespace AF_UNIX socket:
@$RANDOM/unit/waldo.service/foo
(The $RANDOM is replaced by some randomized string. This is included in
the socket name order to avoid namespace squatting issues: the abstract
socket namespace is open to unprivileged users after all, and care needs
to be taken not to use guessable names)
The services listening on the /run/quux/creds.sock socket may thus
easily retrieve the name of the unit the credential is requested for
plus the credential name, via a simpler getpeername(), discarding the
random preifx and the /unit/ string.
This logic uses "/" as separator between the fields, since both unit
names and credential names appear in the file system, and thus are
designed to use "/" as outer separators. Given that it's a good safe
choice to use as separators here, too avoid any conflicts.
This is a minimal patch only: the new logic is used only for the unit
file credential logic. For other places where we use
READ_FULL_FILE_CONNECT_SOCKET it is probably a good idea to use this
scheme too, but this should be done carefully in later patches, since
the socket names become API that way, and we should determine the right
amount of info to pass over.
This adds a way to control SO_TIMESTAMP/SO_TIMESTAMPNS socket options
for sockets PID 1 binds to.
This is useful in journald so that we get proper timestamps even for
ingress log messages that are submitted before journald is running.
We recently turned on packet info metadata from PID 1 for these sockets,
but the timestamping info was still missing. Let's correct that.
If processes remain in the unit's cgroup after the final SIGKILL is
sent and the unit has exceeded stop timeout, don't release the unit's
cgroup information. Pid1 will have failed to `rmdir` the cgroup path due
to processes remaining in the cgroup and releasing would leave the cgroup
path on the file system with no tracking for pid1 to clean it up.
Instead, keep the information around until the last process exits and pid1
sends the cgroup empty notification. The service/scope can then prune
the cgroup if the unit is inactive/failed.
This changes the default from putting all units into the root slice to
placing them into the app slice in the user manager. The advantage is
that we get the right behaviour in most cases, and we'll need special
case handling in all other cases anyway.
Note that we have currently defined that applications *should* start
their unit names with app-, so we could also move only these by creating
a drop-in for app-.scope and app-.service.
However, that would not answer the question on how we should manage
session.slice. And we would end up placing anything that does not fit
the system (e.g. anything started by dbus-broker currently) into the
root slice.
The test was failing because it couldn't start the service:
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
path-modified.path: state = waiting; result = success
path-modified.service: state = failed; result = exit-code
Failed to connect to system bus: No such file or directory
-.slice: Failed to enable/disable controllers on cgroup /system.slice/kojid.service, ignoring: Permission denied
path-modified.service: Failed to create cgroup /system.slice/kojid.service/path-modified.service: Permission denied
path-modified.service: Failed to attach to cgroup /system.slice/kojid.service/path-modified.service: No such file or directory
path-modified.service: Failed at step CGROUP spawning /bin/true: No such file or directory
path-modified.service: Main process exited, code=exited, status=219/CGROUP
path-modified.service: Failed with result 'exit-code'.
Test timeout when testing path-modified.path
In fact any of the services that we try to start may fail, especially
considering that we're doing some rogue cgroup operations. See
https://github.com/systemd/systemd/pull/16603#issuecomment-679133641.