For example in a container we'd log:
Oct 17 17:01:10 rawhide systemd[1]: Started Power-Off.
Oct 17 17:01:10 rawhide systemd[1]: Forcibly powering off: unit succeeded
Oct 17 17:01:10 rawhide systemd[1]: Reached target Power-Off.
Oct 17 17:01:10 rawhide systemd[1]: Shutting down.
and on the console we'd write (in red)
[ !! ] Forcibly powering off: unit succeeded
This is not useful in any way, and the fact that we're calling an "emergency action"
is an internal implementation detail. Let's log about c-a-d and the watchdog actions
only.
The setting is now only looked at when considering an action for a job timeout
or unit start limit. It is ignored for ctrl-alt-del, SuccessAction, SuccessFailure.
v2: turn the parameter into a flag field
v3: rename Options to Flags
All over the place we define local variables for the various sockopts
that take a bool-like "int" value. Sometimes they are const, sometimes
static, sometimes both, sometimes neither.
Let's clean this up, introduce a common const variable "const_int_one"
(as well as one matching "const_int_zero") and use it everywhere, all
acorss the codebase.
If a memory error occurred, we would still go through the path which sets the
error on ferror(). It is unlikely that ferror() returns true, but it's seems
cleaner to just propagate the error we already have.
The handling of fgets() returning NULL is also simplified: according to the man
page, it returns NULL only on EOF or error. So if feof() returns true, I don't
think we should call ferror() again.
While at it, let's set errno to 0 and check that it is set before returning it
as an error. The man pages for fgets() and feof() do not say anything about
setting errno.
Here the behaviour is nominally changed, because we will decrease the
counter on error. But the only caller quits the program if error occurs,
so this makes no practical difference.
Currently they aren't covered and it probably isn't worth adding another
kind of timestamp just for this, hence simply include it in the regular
generator timestamps.
Let's do so already when we are about to complete startup/reload, so
that manager_catchup() is run in a context where MANAGER_IS_RUNNING()
returns true, as the intention is.
Fixes: #9518
Both functions do partly the same, let's make sure they do it in the
same order, and that we don't miss some calls.
This makes a number of changes:
1. Moves exec_runtime_vacuum() two calls down in manager_startup(). This
should not have any effect but makes manager_startup() more like
manager_reload().
2. Calls manager_recheck_journal(), manager_recheck_dbus(),
manager_enqueue_sync_bus_names() in manager_startup() too. This is a
good idea since during reeexec we pass through manager_startup() and
hence can't assume dbus and journald weren't up yet, hence let's
check if they are ready to be connected to.
3. Include manager_enumerate_perpetual() in manager_reload(), too. This
is not strictly necessary, since these units are included in the
serialization anyway, but it's still a nice thing, in particular as
theoretically the deserialization could fail.
let's clean up error handling and logging in manager_reload() a bit.
Specifically: make sure we log about every error we might encounter at
least and at most once.
When we encounter an error before the "point of no return" then log at
LOG_ERR about it and propagate it. Otherwise, eat it up, but warn about
it and proceed, it's the best we can do.
If manager_serialize() fails in the middle (which it hopefully doesn't)
make sure to fix up m->n_reloading correctly again so that we don't
leave it > 0 when it really shouldn't be.
Let's make them typesafe, and let's add a nice macro helper for checking
if we are in a test run, which should make testing for this much easier
to read for most cases.
Instead of blacklisting when not to trim the cgroup tree, let's instead
whitelist when to do it, as an excercise of being careful when being
destructive.
This should not change behaviour with exception that during switch roots
we now won't attempt to trim the cgroup tree anymore. Which is more
correct behaviour after all we serialize/deserialize during the
transition and should be needlessly destructive.
"ExitCode" is a bit of a misnomer in two ways: it suggests this was
about the "exit code" concept that exit()/waitid() deal with, but really
isn't. Moreover, it's not event just about exiting either, but more
often about reloading/reexecing or rebooting. Let's hence pick a new
name for this that is a bit more correct.
I initially thought about naming this the "state", but that'd be a
misnomer too, as the value really encodes a "goal" more than a current
state. Also we already have the externally visible ManagerState.
No actual changes in behaviour, just the rename.
Previously, we'd act immediately on StopWhenUnneeded= when a unit state
changes. With this rework we'll maintain a queue instead: whenever
there's the chance that StopWhenUneeded= might have an effect we enqueue
the unit, and process it later when we have nothing better to do.
This should make the implementation a bit more reliable, as the unit notify event
cannot immediately enqueue tons of side-effect jobs that might
contradict each other, but we do so only in a strictly ordered fashion,
from the main event loop.
This slightly changes the check when to consider a unit "unneeded".
Previously, we'd assume that a unit in "deactivating" state could also
be cleaned up. With this new logic we'll only consider units unneeded
that are fully up and have no job queued. This means that whenever
there's something pending for a unit we won't clean it up.
Looking at a recent Bad Day, my log contains over 100 lines of
systemd[23895]: Failed to connect to API bus: Connection refused
It is due to "systemd --user" retrying to connect to an API bus.[*] I
would prefer to avoid spamming the logs. I don't think it is good for us
to retry so much like this.
systemd was mislead by something setting DBUS_SESSION_BUS_ADDRESS. My best
guess is an unfortunate series of events caused gdm to set this. gdm has
code to start a session dbus if there is not a bus available already (and
in this case it exports the environment variable). I believe it does not
normally do this when running under systemd, because "systemd --user" and
hence "dbus.service" would already have been started by pam_systemd.
I see two possibilities
1. Rip out the check for DBUS_SESSION_BUS_ADDRESS entirely.
2. Only check for DBUS_SESSION_BUS_ADDRESS on startup. Not in the
"recheck" logic.
The justification for 2), is that the recheck is called from unit_notify(),
this is used to check whether the service just started (or stopped) was
"dbus.service". This reason for rechecking does not apply if we think
the session bus was started outside our logic.
But I think we can justify 1). dbus-daemon ships a statically-enabled
/usr/lib/systemd/user/dbus.service, which would conflict with an attempt to
use an external dbus. Also "systemd --user" is started from user@.service;
if you try to start it manually so that it inherits an environment
variable, it will conflict if user@.service was started by pam_systemd
(or loginctl enable-linger).
We add n_installed_jobs and n_failed_jobs to our inner state after
deserialization. This is fine during daemon-reexec when we start with clear
Manager (and some jobs possibly queued before deserialization), however,
daemon-reload works with the same manager and adding the values would
effectively double the counters. Reset the counters before we deserialize and
add their values again.