If the whole call is simple and we don't need to look at the return value
apart from the conditional, let's use a form without assignment of the return
value. When the function call is more complicated, it still makes sense to
use a temporary variable.
Lennart wanted to do this back in
01c33c1eff.
For better or worse, this wasn't done because I thought that turning on MountAPIVFS
is a compat break for RootDirectory and people might be negatively surprised by it.
Without this, search for binaries doesn't work (access_fd() requires /proc).
Let's turn it on, but still allow overriding to "no".
When RootDirectory=/, MountAPIVFS=1 doesn't work. This might be a buglet on its
own, but this patch doesn't change the situation.
(Well, at least the ones where that makes sense. Where it does't make
sense are the ones that re invoked on the root path, which cannot
possibly be a symlink.)
Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log
level and flags param. In particular the latter matters, since we
typically don't actually want to follow symlinks when unmounting.
This has two advantages:
- we save a bit of IO in early boot because we don't look for executables
which we might never call
- if the executable is in a different place and it was specified as a
non-absolute path, it is OK if it moves to a different place. This should
solve the case paths are different in the initramfs.
Since the executable path is only available quite late, the call to
mac_selinux_get_child_mls_label() which uses the path needs to be moved down
too.
Fixes#16076.
Similar to free_and_replace. I think this should be uppercase to make it
clear that this is a macro. free_and_replace should probably be uppercased
too.
Other tools silently ignore non-executable names found in path. By checking
F_OK, we would could pick non-executable path even though there is an executable
one later.
We had a special test case that the second semicolon would be interpreted
as an executable name. We would then try to find the executable and rely
on ";" not being found to cause ENOEXEC to be returned. I think that's just
crazy. Let's treat the second semicolon as a separator and ignore the
whole thing as we would whitespace.
Just some refactoring: let's place the various verity related parameters
in a common structure, and pass that around instead of the individual
parameters.
Also, let's load the PKCS#7 signature data when finding metadata
right-away, instead of delaying this until we need it. In all cases we
call this there's not much time difference between the metdata finding
and the loading, hence this simplifies things and makes sure root hash
data and its signature is now always acquired together.
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.
---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
Define explicit action "kill" for SystemCallErrorNumber=.
In addition to errno code, allow specifying "kill" as action for
SystemCallFilter=.
---
v7: seccomp_parse_errno_or_action() returns -EINVAL if !HAVE_SECCOMP
v6: use streq_ptr(), let errno_to_name() handle bad values, kill processes,
init syscall_errno
v5: actually use seccomp_errno_or_action_to_string(), don't fail bus unit
parsing without seccomp
v4: fix build without seccomp
v3: drop log action
v2: action -> number
Like it's customary in our codebase bus_error_message() internally takes
abs() of the passed error anyway, hence no need to explicitly negate it.
We mostly got this right, but in too many cases we didn't. Fix that.
We already do this for socket and automount units, do it for path units
too: if the triggered service keeps hitting the start limit, then fail
the triggering unit too, so that we don#t busy loop forever.
(Note that this leaves only timer units out in the cold for this kind of
protection, but it shouldn't matter there, as they are naturally
protected against busy loops: they are scheduled by time anyway).
Fixes: #16669
In 4c2ef32767 we enabled propagating
triggered unit state to the triggering unit for service units in more
load states, so that we don't accidentally stop tracking state
correctly.
Do the same for our other triggering unit states: automounts, paths, and
timers.
Also, make this an assertion rather than a simple test. After all it
should never happen that we get called for half-loaded units or units of
the wrong type. The load routines should already have made this
impossible.
We want to be able to use BusName= in services that run during early boot
already, and thus don't synthesize deps on dbus there. Instead add them
when Type=dbus is set, because in that case we actually really need
D-Bus support.
Fixes: #17037
In containers we might lack the privs to up the socket buffers. Let's
not complain so loudly about that. Let's hence downgrade this to debug
logging if it's a permission problem.
(This wasn't an issue before b92f350789
because back then the failures wouldn't be detected at all.)
We generally don't support prefix being != /usr, and this is hardcoded
all over the place. In the systemd.pc file it wasn't so far. Let's
adjust this to match the rest of the codebase.
A variety of sockopts exist both for IPv4 and IPv6 but require a
different pair of sockopt level/option number. Let's add helpers for
these that internally determine the right sockopt to call.
This should shorten code that generically wants to support both ipv4 +
ipv6 and for the first time adds correct support for some cases where we
only called the ipv4 versions, and not the ipv6 options.
The securebit keep-caps retains the capabilities in the permitted set
over an UID change (ambient capabilities are cleared though).
Setting the keep-caps securebit after the uid change and before execve
doesn't make sense as it is cleared during execve and there is no
additional user ID change after this point.
Altough the documentation (man 7 capabilities) is ambigious, keep-caps
is reset during execve although keep-caps-locked is set. After execve
only keep-caps-locked is set and keep-caps is cleared.
Low-level cgroup freezer state manipulation is invoked directly from the
job engine when we are about to execute the job in order to make sure
the unit is not frozen and job execution is not blocked because of
that.
Currently with cgroup v1 we would needlessly do a bunch of work in the
function and even falsely update the freezer state. Don't do any of this
and skip the function silently when v2 freezer is not available.
Following bug is fixed by this commit,
$ systemd-run --unit foo.service /bin/sleep infinity
$ systemctl restart foo.service
$ systemctl show -p FreezerState foo.service
Before (cgroup v1, i.e. full "legacy" mode):
FreezerState=thawing
After:
FreezerState=running
In various cases, we would say 'return log_warning()' or 'return log_error()'. Those
functions return 0 if no error is passed in. For log_warning or log_error this doesn't
make sense, and we generally want to propagate the error. In the few cases where
the error should be ignored, I think it's better to split it in two, and call 'return 0'
on a separate line.
The new methods work as the unflavoured ones, but takes flags as a
single uint64_t DBUS parameters instead of different booleans, so
that it can be extended without breaking backward compatibility.
Add new flag to allow adding/removing symlinks in
[/etc|/run]/systemd/system.attached so that portable services
configuration files can be self-contained in those directories, without
affecting the system services directories.
Use the new methods and flags from portablectl --enable.
Useful in case /etc is read-only, with only the portable services
directories being mounted read-write.
socket_instantiate_service() was doing unit_ref_set(), and the caller was
immediately doing unit_ref_unset(). After we get rid of this, it doesn't seem
worth it to have two functions.
This means that the connection was aborted before we even got to figure out
what the service name will be. Let's treat this as a non-event and close the
connection fd without any further messages.
Code last changed in 934ef6a5.
Reported-by: Thiago Macieira <thiago.macieira@intel.com>
With the patch:
systemd[1]: foobar.socket: Incoming traffic
systemd[1]: foobar.socket: Got ENOTCONN on incoming socket, assuming aborted connection attempt, ignoring.
...
Also, when we get ENOMEM, don't give the hint about missing unit.
Let's ensure that a device once tagged can become active/inactive simply
by toggling the current tag.
Note that this means that a device once tagged with "systemd" will
always have a matching .device unit. However, the active/inactive state
of the unit reflects whether it is currently tagged that way (and
doesn't have SYSTEMD_READY=0 set).
Fixes: #7587
Desired functionality:
Set securebits for services started as non-root user.
Failure:
The starting of the service fails if no ambient capability shall be
raised.
... systemd[217941]: ...: Failed to set process secure bits: Operation
not permitted
... systemd[217941]: ...: Failed at step SECUREBITS spawning
/usr/bin/abc.service: Operation not permitted
... systemd[1]: abc.service: Failed with result 'exit-code'.
Reason:
For setting securebits the capability CAP_SETPCAP is required. However
the securebits (if no ambient capability shall be raised) are set after
setresuid.
When setresuid is invoked all capabilities are dropped from the
permitted, effective and ambient capability set. If the securebit
SECBIT_KEEP_CAPS is set the permitted capability set is retained, but
the effective and the ambient set are cleared.
If ambient capabilities shall be set, the securebit SECBIT_KEEP_CAPS is
added to the securebits configured in the service file and set together
with the securebits from the service file before setresuid is executed
(in enforce_user).
Before setresuid is executed the capabilities are the same as for pid1.
This means that all capabilities in the effective, permitted and
bounding set are set. Thus the capability CAP_SETPCAP is in the
effective set and the prctl(PR_SET_SECUREBITS, ...) succeeds.
However, if the secure bits aren't set before setresuid is invoked they
shall be set shortly after the uid change in enforce_user.
This fails as SECBIT_KEEP_CAPS wasn't set before setresuid and in
consequence the effective and permitted set was cleared, hence
CAP_SETPCAP is not set in the effective set (and cannot be raised any
longer) and prctl(PR_SET_SECUREBITS, ...) failes with EPERM.
Proposed solution:
The proposed solution consists of three parts
1. Check in enforce_user, if securebits are configured in the service
file. If securebits are configured, set SECBIT_KEEP_CAPS
before invoking setresuid.
2. Don't set any other securebits than SECBIT_KEEP_CAPS in enforce_user,
but set all requested ones after enforce_user.
This has the advantage that securebits are set at the same place for
root and non-root services.
3. Raise CAP_SETPCAP to the effective set (if not already set) before
setting the securebits to avoid EPERM during the prctl syscall.
For gaining CAP_SETPCAP the function capability_bounding_set_drop is
splitted into two functions:
- The first one raises CAP_SETPCAP (required for dropping bounding
capabilities)
- The second drops the bounding capabilities
Why are ambient capabilities not affected by this change?
Ambient capabilities get cleared during setresuid, no matter if
SECBIT_KEEP_CAPS is set or not.
For raising ambient capabilities for a user different to root, the
requested capability has to be raised in the inheritable set first. Then
the SECBIT_KEEP_CAPS securebit needs to be set before setresuid is
invoked. Afterwards the ambient capability can be raised, because it is
in the inheritable and permitted set.
Security considerations:
Although the manpage is ambiguous SECBIT_KEEP_CAPS is cleared during
execve no matter if SECBIT_KEEP_CAPS_LOCKED is set or not. If both are
set only SECBIT_KEEP_CAPS_LOCKED is set after execve.
Setting SECBIT_KEEP_CAPS in enforce_user for being able to set
securebits is no security risk, as the effective and permitted set are
set to the value of the ambient set during execve (if the executed file
has no file capabilities. For details check man 7 capabilities).
Remark:
In capability-util.c is a comment complaining about the missing
capability CAP_SETPCAP in the effective set, after the kernel executed
/sbin/init. Thus it is checked there if this capability has to be raised
in the effective set before dropping capabilities from the bounding set.
If this were true all the time, ambient capabilities couldn't be set
without dropping at least one capability from the bounding set, as the
capability CAP_SETPCAP would miss and setting SECBIT_KEEP_CAPS would
fail with EPERM.
Instead of assuming that more-recently modified directories have higher mtime,
just look for any mtime changes, up or down. Since we don't want to remember
individual mtimes, hash them to obtain a single value.
This should help us behave properly in the case when the time jumps backwards
during boot: various files might have mtimes that in the future, but we won't
care. This fixes the following scenario:
We have /etc/systemd/system with T1. T1 is initially far in the past.
We have /run/systemd/generator with time T2.
The time is adjusted backwards, so T2 will be always in the future for a while.
Now the user writes new files to /etc/systemd/system, and T1 is updated to T1'.
Nevertheless, T1 < T1' << T2.
We would consider our cache to be up-to-date, falsely.
This check was added in d904afc730. It would only
apply in the case where the cache hasn't been loaded yet. I think we pretty
much always have the cache loaded when we reach this point, but even if we
didn't, it seems better to try to reload the unit. So let's drop this check.
We really only care if the cache has been reloaded between the time when we
last attempted to load this unit and now. So instead of recording the actual
time we try to load the unit, just store the timestamp of the cache. This has
the advantage that we'll notice if the cache mtime jumps forward or backward.
Also rename fragment_loadtime to fragment_not_found_time. It only gets set when
we failed to load the unit and the old name was suggesting it is always set.
In https://bugzilla.redhat.com/show_bug.cgi?id=1871327
(and most likely https://bugzilla.redhat.com/show_bug.cgi?id=1867930
and most likely https://bugzilla.redhat.com/show_bug.cgi?id=1872068) we try
to load a non-existent unit over and over from transaction_add_job_and_dependencies().
My understanding is that the clock was in the future during inital boot,
so cache_mtime is always in the future (since we don't touch the fs after initial boot),
so no matter how many times we try to load the unit and set
fragment_loadtime / fragment_not_found_time, it is always higher than cache_mtime,
so manager_unit_cache_should_retry_load() always returns true.
The name is misleading, since we aren't really loading the unit from cache — if
this function returns true, we'll try to load the unit from disk, updating the
cache in the process.
Any uevent other then the initial and the last uevent we see for a
device (which is "add" and "remove") should result in a reload being
triggered, including "bind" and "unbind". Hence, let's fix up the check.
("move" is kinda a combined "remove" + "add", hence cover that too)
The parent process may not perform any label operation, so the
database might not get updated on a SELinux policy change on its own.
Reload the label database once on a policy change, instead of n times
in every started child.
Switch from security_getenforce() and netlink notifications to the
SELinux status page.
This usage saves system calls and will also be the default in
libselinux > 3.1 [1].
[1]: 05bdc03130
I find this version much more readable.
Add replacement defines so that when acl/libacl.h is not available, the
ACL_{READ,WRITE,EXECUTE} constants are also defined. Those constants were
declared in the kernel headers already in 1da177e4c3f41524e886b7f1b8a0c1f,
so they should be the same pretty much everywhere.
Ideally we would like to hide all other service's credentials for all
services. That would imply for us to enable mount namespacing for all
services, which is something we cannot do, both due to compatibility
with the status quo ante, and because a number of services legitimately
should be able to install mounts in the host hierarchy.
Hence we do the second best thing, we hide the credentials automatically
for all services that opt into mount namespacing otherwise. This is
quite different from other mount sandboxing options: usually you have to
explicitly opt into each. However, given that the credentials logic is a
brand new concept we invented right here and now, and particularly
security sensitive it's OK to reverse this, and by default hide
credentials whenever we can (i.e. whenever mount namespacing is
otherwise opt-ed in to).
Long story short: if you want to hide other service's credentials, the
most basic options is to just turn on PrivateMounts= and there you go,
they should all be gone.
Kernel 5.8 gained a hidepid= implementation that is truly per procfs,
which allows us to mount a distinct once into every unit, with
individual hidepid= settings. Let's expose this via two new settings:
ProtectProc= (wrapping hidpid=) and ProcSubset= (wrapping subset=).
Replaces: #11670
it's not entirely clear what shall be passed via parameter and what via
struct, but these two definitely fit well with the other protect_xyz
fields, hence let's move them over.
We probably should move a lot more more fields into the structure
actuall (most? all even?).
Similarly to "setup" vs. "set up", "fallback" is a noun, and "fall back"
is the verb. (This is pretty clear when we construct a sentence in the
present continous: "we are falling back" not "we are fallbacking").
Follow the same model established for RootImage and RootImageOptions,
and allow to either append a single list of options or tuples of
partition_number:options.
The concept is flawed, and mostly useless. Let's finally remove it.
It has been deprecated since 90a2ec10f2 (6
years ago) and we started to warn since
55dadc5c57 (1.5 years ago).
Let's get rid of it altogether.
Previously, we'd create them from user-runtime-dir@.service. That has
one benefit: since this service runs privileged, we can create the full
set of device nodes. It has one major drawback though: it security-wise
problematic to create files/directories in directories as privileged
user in directories owned by unprivileged users, since they can use
symlinks to redirect what we want to do. As a general rule we hence
avoid this logic: only unpriv code should populate unpriv directories.
Hence, let's move this code to an appropriate place in the service
manager. This means we lose the inaccessible block device node, but
since there's already a fallback in place, this shouldn't be too bad.
Let's make /run/host the sole place we pass stuff from host to container
in and place the "inaccessible" nodes in /run/host too.
In contrast to the previous two commits this is a minor compat break, but
not a relevant one I think. Previously the container manager would place
these nodes in /run/systemd/inaccessible/ and that's where PID 1 in the
container would try to add them too when missing. Container manager and
PID 1 in the container would thus manage the same dir together.
With this change the container manager now passes an immutable directory
to the container and leaves /run/systemd entirely untouched, and managed
exclusively by PID 1 inside the container, which is nice to have clear
separation on who manages what.
In order to make sure systemd then usses the /run/host/inaccesible/
nodes this commit changes PID 1 to look for that dir and if it exists
will symlink it to /run/systemd/inaccessible.
Now, this will work fine if new nspawn and new pid 1 in the container
work together. as then the symlink is created and the difference between
the two dirs won't matter.
For the case where an old nspawn invokes a new PID 1: in this case
things work as they always worked: the dir is managed together.
For the case where different container manager invokes a new PID 1: in
this case the nodes aren't typically passed in, and PID 1 in the
container will try to create them and will likely fail partially (though
gracefully) when trying to create char/block device nodes. THis is fine
though as there are fallbacks in place for that case.
For the case where a new nspawn invokes an old PID1: this is were the
(minor) incompatibily happens: in this case new nspawn will place the
nodes in the /run/host/inaccessible/ subdir, but the PID 1 in the
container won't look for them there. Since the nodes are also not
pre-created in /run/systed/inaccessible/ PID 1 will try to create them
there as if a different container manager sets them up. This is of
course not sexy, but is not a total loss, since as mentioned fallbacks
are in place anyway. Hence I think it's OK to accept this minor
incompatibility.
arg_system == true and getpid() == 1 hold under the very same condition
this early in the main() function (this only changes later when we start
parsing command lines, where arg_system = true is set if users invoke us
in test mode even when getpid() != 1.
Hence, let's simplify things, and merge a couple of if branches and not
pretend they were orthogonal.
Some masks shouldn't be needed externally, so keep their functions in
the module (others would fit there too but they're used in tests) to
think twice if something would depend on them.
Drop unused function cg_attach_many_everywhere.
Use cgroup_realized instead of cgroup_path when we actually ask for
realized.
This should not cause any functional changes.
The usage in unit_get_own_mask is redundant, we only need apply
disable_mask at the end befor application, i.e. calculating enable or
target mask.
(IOW, we allow all configurations, but disabling affects effective
controls.)
Modify tests accordingly and add testing of enable mask.
This is intended as cleanup, with no effect but changing unit_dump
output.
The unit_add_siblings_to_cgroup_realize_queue does more than mere
siblings queueing, hence define a family of a unit as (immediate)
children of the unit and immediate children of all ancestors.
Working with this abstraction simplifies the queuing calls and it
shouldn't change the functionality.
Merge members mask invalidation into
unit_add_siblings_to_cgroup_realize_queue, this way unit_realize_cgroup
needn't be called with members mask invalidation.
We have to retain the members mask invalidation in unit_load -- although
active units would have cgroups (re)realized (unit_load queues for
realization), the realization would happen with potentially stale mask.
unit_free(u) realizes direct parent and invalidates members mask of all
ancestors. This isn't sufficient in v1 controller hierarchies since
siblings of the freed unit may have existed only because of the removed
unit.
We cannot be lazy about the siblings because if parent(u) is also
removed, it'd migrate and rmdir cgroups for siblings(u). However,
realized masks of siblings(u) won't reflect this change.
This was a non-issue earlier, because we weren't removing cgroup
directories properly (effectively matching the stale realized mask),
removal failed because of tasks left by missing migration (see previous
commit).
Therefore, ensure realization of all units necessary to clean up after
the free'd unit.
Fixes: #14149
When we are about to derealize a controller on v1 cgroup, we first
attempt to delete the controller cgroup and migrate afterwards. This
doesn't work in practice because populated cgroup cannot be deleted.
Furthermore, we leave out slices from migration completely, so
(un)setting a control value on them won't realize their controller
cgroup.
Rework actual realization, unit_create_cgroup() becomes
unit_update_cgroup() and make sure that controller hierarchies are
reduced when given controller cgroup ceased to be needed.
Note that with this we introduce slight deviation between v1 and v2 code
-- when a descendant unit turns off a delegated controller, we attempt
to disable it in ancestor slices. On v2 this may fail (kernel enforced,
because of child cgroups using the controller), on v1 we'll migrate
whole subtree and trim the subhierachy. (Previously, we wouldn't take
away delegated controller, however, derealization was broken anyway.)
Fixes: #14149
When available, enable memory_recursiveprot. Realistically it always
makes sense to delegate MemoryLow= and MemoryMin= to all children of a
slice/unit.
The kernel option is not enabled by default as it might cause
regressions in some setups. However, it is the better default in
general, and it results in a more flexible and obvious behaviour.
The alternative to using this option would be for user's to also set
DefaultMemoryLow= on slices when assigning MemoryLow=. However, this
makes the effect of MemoryLow= on some children less obvious, as it
could result in a lower protection rather than increasing it.
From the kernel documentation:
memory_recursiveprot
Recursively apply memory.min and memory.low protection to
entire subtrees, without requiring explicit downward
propagation into leaf cgroups. This allows protecting entire
subtrees from one another, while retaining free competition
within those subtrees. This should have been the default
behavior but is a mount-option to avoid regressing setups
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).
This was added in kernel commit 8a931f801340c (mm: memcontrol:
recursive memory.low protection), which became available in 5.7 and was
subsequently fixed in kernel 5.7.7 (mm: memcontrol: handle div0 crash
race condition in memory.low).
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.
Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
This code was changed in this pull request:
https://github.com/systemd/systemd/pull/16571
After some discussion and more investigation, we better understand
what's going on. So, update the comment, so things are more clear
to future readers.
From a report in https://bugzilla.redhat.com/show_bug.cgi?id=1861463:
usb-gadget.target: Failed to load configuration: No such file or directory
usb-gadget.target: Failed to load configuration: No such file or directory
usb-gadget.target: Trying to enqueue job usb-gadget.target/start/fail
usb-gadget.target: Failed to load configuration: No such file or directory
Assertion '!bus_error_is_dirty(e)' failed at src/libsystemd/sd-bus/bus-error.c:239, function bus_error_setfv(). Ignoring.
sys-devices-platform-soc-2100000.bus-2184000.usb-ci_hdrc.0-udc-ci_hdrc.0.device: Failed to enqueue SYSTEMD_WANTS= job, ignoring: Unit usb-gadget.target not found.
I *think* this is the place where the reuse occurs: we call
bus_unit_validate_load_state(unit, e) twice in a row.
The explicit limit is dropped, which means that we return to the kernel default
of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it
seems. It turns out various applications need more space in /dev/shm and we
would break them by imposing a low limit.
While at it, rename the define and use a single macro for various tmpfs mounts.
We don't really care what the purpose of the given tmpfs is, so it seems
reasonable to use a single macro.
This effectively reverts part of 7d85383edb. Fixes#16617.
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:
RootImageOptions=1:ro,dev 2:nosuid nodev
In absence of a partition number, 0 is assumed.
When we're disabling controller on a direct child of root cgroup, we
forgot to add root slice into cgroup realization queue, which prevented
proper disabling of the controller (on unified hierarchy).
The mechanism relying on "bounce from bottom and propagate up" in
unit_create_cgroup doesn't work on unified hierarchy (leaves needn't be
enabled). Drop it as we rely on the ancestors to be queued -- that's now
intentional but was artifact of combining the two patches:
cb5e3bc37d ("cgroup: Don't explicitly check for member in UNIT_BEFORE") v240~78
65f6b6bdcb ("core: fix re-realization of cgroup siblings") v245-rc1~153^2
Fixes: #14917
The current implementation is LIFO, which is a) confusing b) prevents
some ordered operations on the cgroup tree (e.g. removing children
before parents).
Fix it quickly. Current list implementation turns this from O(1) to O(n)
operation. Rework the lists later.
Previously, we assumed that success meant we definitely got a valid
pointer. There is at least one edge case where this is not true (i.e.,
we can get both a 0 return value, and *also* a NULL pointer):
4246bb550d/libselinux/src/procattr.c (L175)
When this case occurrs, if we don't check the pointer we SIGSEGV in
early initialization.
Fixes#16401.
c80a9a33d0 introduced the .can_fail field,
but didn't set it on .targets. Targets can fail through dependencies.
This leaves .slice and .device units as the types that cannot fail.
$ systemctl cat bad.service bad.target bad-fallback.service
[Service]
Type=oneshot
ExecStart=false
[Unit]
OnFailure=bad-fallback.service
[Service]
Type=oneshot
ExecStart=echo Fixing everythign!
$ sudo systemctl start bad.target
systemd[1]: Starting bad.service...
systemd[1]: bad.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: bad.service: Failed with result 'exit-code'.
systemd[1]: Failed to start bad.service.
systemd[1]: Dependency failed for bad.target.
systemd[1]: bad.target: Job bad.target/start failed with result 'dependency'.
systemd[1]: bad.target: Triggering OnFailure= dependencies.
systemd[1]: Starting bad-fallback.service...
echo[46901]: Fixing everythign!
systemd[1]: bad-fallback.service: Succeeded.
systemd[1]: Finished bad-fallback.service.
When the RTC time at boot is off in the future by a few days, OnCalendar=
timers will be scheduled based on the time at boot. But if the time has been
adjusted since boot, the timers will end up scheduled way in the future, which
may cause them not to fire as shortly or often as expected.
Update the logic so that the time will be adjusted based on monotonic time.
We do that by calculating the adjusted manager startup realtime from the
monotonic time stored at that time, by comparing that time with the realtime
and monotonic time of the current time.
Added a test case to validate this works as expected. The test case creates a
QEMU virtual machine with the clock 3 days in the future. Then we adjust the
clock back 3 days, and test creating a timer with an OnCalendar= for every 15
minutes. We also check the manager startup timestamp from both `systemd-analyze
dump` and from D-Bus.
Test output without the corresponding code changes that fix the issue:
Timer elapse outside of the expected 20 minute window.
next_elapsed=1594686119
now=1594426921
time_delta=259198
With the code changes in, the test passes as expected.
Read-only /var/tmp is more likely, because it's backed by a real device. /tmp
is (by default) backed by tmpfs, but it doesn't have to be. In both cases the
same consideration applies.
If we boot with read-only /var/tmp, any unit with PrivateTmp=yes would fail
because we cannot create the subdir under /var/tmp to mount the private directory.
But many services actually don't require /var/tmp (either because they only use
it occasionally, or because they only use /tmp, or even because they don't use the
temporary directories at all, and PrivateTmp=yes is used to isolate them from
the rest of the system).
To handle both cases let's create a read-only directory under /run/systemd and
mount it as the private /tmp or /var/tmp. (Read-only to not fool the service into
dumping too much data in /run.)
$ sudo systemd-run -t -p PrivateTmp=yes bash
Running as unit: run-u14.service
Press ^] three times within 1s to disconnect TTY.
[root@workstation /]# ls -l /tmp/
total 0
[root@workstation /]# ls -l /var/tmp/
total 0
[root@workstation /]# touch /tmp/f
[root@workstation /]# touch /var/tmp/f
touch: cannot touch '/var/tmp/f': Read-only file system
This commit has more changes than I like to put in one commit, but it's touching all
the same paths so it's hard to split.
exec_runtime_make() was using the wrong cleanup function, so the directory would be
left behind on error.