This commit moves the first-boot system preset-settings evaluation out
of main and into the manager startup logic itself. Notably, it reverses
the order between generators and presets evaluation, so that any changes
performed by first-boot generators are taken into the account by presets
logic.
After this change, units created by a generator can be enabled as part
of a preset.
/sys is not guaranteed to exist when a new mount namespace is created.
It is only mounted under conditions specified by
`namespace_info_mount_apivfs`.
Checking if the three available MAC LSMs are enabled requires a sysfs
mounted at /sys, so the checks are moved to before a new mount ns is
created.
When we create a log stream connection to journald, we pass along the
unit ID. With this change we do this only when we run as system
instance, not as user instance, to remove the ambiguity whether a user
or system unit is specified. The effect of this change is minor:
journald ignores the field anyway from clients with UID != 0. This patch
hence only fixes the unit attribution for the --user instance of the
root user.
When a service unit uses "ProtectKernelTunables=yes", it currently
remounts /sys/fs/selinux read-only. This makes libselinux report SELinux
state as "disabled", because most SELinux features are not usable. For
example it is not possible to validate security contexts (with
security_check_context_raw() or /sys/fs/selinux/context). This behavior
of libselinux has been described in
http://danwalsh.livejournal.com/73099.html and confirmed in a recent
email, https://marc.info/?l=selinux&m=149220233032594&w=2 .
Since commit 0c28d51ac8 ("units: further lock down our long-running
services"), systemd-localed unit uses ProtectKernelTunables=yes.
Nevertheless this service needs to use libselinux API in order to create
/etc/vconsole.conf, /etc/locale.conf... with the right SELinux contexts.
This is broken when /sys/fs/selinux is mounted read-only in the mount
namespace of the service.
Make SELinux-aware systemd services work again when they are using
ProtectKernelTunables=yes by keeping selinuxfs mounted read-write.
mount_load does not require fragment files to be present in order to
load mount units which are perpetual, or come from /proc/self/mountinfo.
mount_verify should do the same, otherwise a synthesized '-.mount' would
be marked as failed with "No such file or directory", as it is perpetual
but not marked to come from /proc/self/mountinfo at this point.
This happens for the user instance, and I suspect it was the cause of #5375
for the system instance, without gpt-generator.
We just abort startup, without printing any error. Make sure we always
print something, and when we cannot deserialize some unit, just ignore it and
continue.
Fixup for 4bc5d27b94. Without this, we would hang
in daemon-reexec after upgrade.
Some kdbus_flag and memfd related parts are left behind, because they
are entangled with the "legacy" dbus support.
test-bus-benchmark is switched to "manual". It was already broken before
(in the non-kdbus mode) but apparently nobody noticed. Hopefully it can
be fixed later.
Since busname units are only useful with kdbus, they weren't actively
used. This was dead code, only compile-tested. If busname units are
ever added back, it'll be cleaner to start from scratch (possibly reverting
parts of this patch).
Example log:
Jul 22 15:55:21 fedora systemd[1]: a1.service: Found ordering cycle on a2.service/start
Jul 22 15:55:21 fedora systemd[1]: a1.service: Found dependency on a3.service/start
Jul 22 15:55:21 fedora systemd[1]: a1.service: Found dependency on a1.service/start
Jul 22 15:55:21 fedora systemd[1]: a1.service: Job a2.service/start deleted to break ordering cycle starting with a1.service/start
Jul 22 15:55:21 fedora systemd[1]: Starting a1.service...
Jul 22 15:55:21 fedora systemd[1]: Started a1.service.
Example log entry:
Sat 2017-07-22 15:55:21.372389 EDT [s=0004bb6302d94ac3aa69987fb6157338;i=9ae;b=a96eb6153d4f4f3686c7b4
_BOOT_ID=a96eb6153d4f4f3686c7b4db8a432908
_MACHINE_ID=ad18f69b80264b52bb3b766240742383
_HOSTNAME=fedora
PRIORITY=3
SYSLOG_FACILITY=3
SYSLOG_IDENTIFIER=systemd
_UID=0
_GID=0
_PID=1
_TRANSPORT=journal
_CAP_EFFECTIVE=3fffffffff
_COMM=systemd
_EXE=/usr/lib/systemd/systemd
_SYSTEMD_CGROUP=/init.scope
_SYSTEMD_UNIT=init.scope
_SYSTEMD_SLICE=-.slice
_SELINUX_CONTEXT=system_u:system_r:kernel_t:s0
CODE_FILE=../src/core/transaction.c
CODE_FUNC=transaction_verify_order_one
UNIT=a3.service
UNIT=a1.service
UNIT=a2.service
CODE_LINE=430
MESSAGE=a1.service: Job a2.service/start deleted to break ordering cycle starting with a1.service
_CMDLINE=/usr/lib/systemd/systemd --system --deserialize 28
_SOURCE_REALTIME_TIMESTAMP=1500753321372389
This should make it easier to see when any of the units are involved in an
ordering cycle.
Fixes#6336.
v2:
- also update the "Unable to break cycle" message.
This reverts commit 2d058a87ff.
When we add another name to a unit (by following an alias), we need to
reload all drop-ins. This is necessary to load any additional dropins
found in the dirs created from the alias name.
Fixes#6334.
As a follow-up for db3f45e2d2 let's do the
same for all other cases where we create a FILE* with local scope and
know that no other threads hence can have access to it.
For most cases this shouldn't change much really, but this should speed
dbus introspection and calender time formatting up a bit.
This moves pretty much all uses of getpid() over to getpid_raw(). I
didn't specifically check whether the optimization is worth it for each
replacement, but in order to keep things simple and systematic I
switched over everything at once.
This introduces {State,Cache,Log,Configuration}Directory= those are
similar to RuntimeDirectory=. They create the directories under
/var/lib, /var/cache/, /var/log, or /etc, respectively, with the mode
specified in {State,Cache,Log,Configuration}DirectoryMode=.
This also fixes#6391.
2d79a0bbb9 did that for TimeoutSec=,
89beff89ed did that for JobTimeoutSec=,
and 0004f698df did that for
x-systemd.device-timeout=. But after parsing x-systemd.device-timeout=xxx
we write it out as JobRunningTimeoutSec=xxx. Two options:
- write out JobRunningTimeoutSec=<a very big number>,
- change JobRunningTimeoutSec= to behave like the other options.
I think it would be confusing for JobRunningTimeoutSec= to have different
syntax then TimeoutSec= and JobTimeoutSec=, so this patch implements the
second option.
Fixes#6264, https://bugzilla.redhat.com/show_bug.cgi?id=1462378.
cgroup namespace wasn't useful for delegation because it allowed resource
control interface files (e.g. memory.high) to be written from inside the
namespace - this allowed the namespace parent's resource distribution to be
disturbed by its namespace-scoped children.
A new mount option, "nsdelegate", was added to cgroup v2 to address this issue.
The flag is meangingful only when mounting cgroup v2 in the init namespace and
makes a cgroup namespace a delegation boundary. The kernel feature is pending
for v4.13.
This should have been the default behavior on cgroup namespaces and this commit
makes systemd try "nsdelegate" first when trying to mount cgroup v2 and fall
back if the option is not supported.
Note that this has danger of breaking usages which depend on modifying the
parent's resource settings from the namespace root, which isn't a valid thing
to do, but such usages may still exist.
If an error is encountered in any of the Exec* lines, WorkingDirectory,
SELinuxContext, ApparmorProfile, SmackProcessLabel, Service (in .socket
units), User, or Group, refuse to load the unit. If the config stanza
has support, ignore the failure if '-' is present.
For those configuration directives, even if we started the unit, it's
pretty likely that it'll do something unexpected (like write files
in a wrong place, or with a wrong context, or run with wrong permissions,
etc). It seems better to refuse to start the unit and have the admin
clean up the configuration without giving the service a chance to mess
up stuff.
Note that all "security" options that restrict what the unit can do
(Capabilities, AmbientCapabilities, Restrict*, SystemCallFilter, Limit*,
PrivateDevices, Protect*, etc) are _not_ treated like this. Such options are
only supplementary, and are not always available depending on the architecture
and compilation options, so unit authors have to make sure that the service
runs correctly without them anyway.
Fixes#6237, #6277.
When running systemd-analyze verify I would get a random subset of warnings
(sometimes none, sometimes one or two):
dev-mapper-luks\x2d8db85dcf\x2d6230\x2d4e88\x2d940d\x2dba176d062b31.swap: Unit is bound to inactive unit dev-mapper-luks\x2d8db85dcf\x2d6230\x2d4e88\x2d940d\x2dba176d062b31.device. Stopping, too.
home.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device. Stopping, too.
boot.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-56c56bfd\x2d93f0\x2d48fb\x2dbc4b\x2d90aa67144ea5.device. Stopping, too.
When running with debug on, it's pretty obvious what is happening:
home.mount: Changed dead -> mounted
home.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device. Stopping, too.
home.mount: Trying to enqueue job home.mount/stop/fail
home.mount: Installed new job home.mount/stop as 27
home.mount: Enqueued job home.mount/stop as 27
...
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Installed new job dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device/start as 47
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Changed dead -> plugged
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Job dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device/start finished, result=done
Fixes#2206, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=808151.
Commit 74dd6b515f (core: run each system
service with a fresh session keyring) broke adding keys to user keyring.
Added keys could not be accessed with error message:
keyctl_read_alloc: Permission denied
So link the user keyring to our session keyring.
When umounting an NFS filesystem, it is not safe to lstat(2) the mountpoint at
all as that can block indefinitely if the NFS server is down.
umount() will not block, but lstat() will.
This patch therefore removes the call to lstat(2) and defers the handling of
any error to the child process which will issue the umount call.
This extends 2d79a0bbb9 to the kernel
command line parsing.
The parsing is changed a bit to only understand "0" as infinity. If units are
specified, parse normally, e.g. "0s" is just 0. This makes it possible to
provide a zero timeout if necessary.
Simple test is added.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1462378.
Under nspawn, systemd would print:
Got address error code: Operation not permitted
Got address error code: Operation not permitted
Got start error code: Operation not permitted
which is quite unclear out of context. Change that to:
Failed to add address 127.0.0.1 to loopback interface: Operation not permitted
Failed to add address ::1 to loopback interface: Operation not permitted
Failed to bring loopback interface up: Operation not permitted
This is just a cosmetic issue.
Garbage collection of jobs (especially the ones that we create automatically)
is something of an internal implementation detail and should not be made
visible to the users. But it's probably still useful to log this in the
journal, so the code is rearranged to skip one of the messages if we log to the
console and the journal separately, and to keep the message if we log
everything to the console.
Fixes#6254.
Fun fact 1 suggests that a "close()" is needed, but that close() has long since been
removed. So the comment in now meaningless and possibly confusing.
Fun fact 2 refers to a bug that has been fixed in Linux prior to v4.12
Commit: 9fa4eb8e490a ("autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAIL")
so revise the comment so that no-one goes pointlessly looking for the bug.
The CapabilityBoundingSet option only makes sense if we are running as
PID1.
The system.conf.d(5) manpage, already states that the CapabilityBoundingSet
option:
Controls which capabilities to include in the capability bounding set
for PID 1 and its children.
https://github.com/systemd/systemd/issues/6080