Otherwise we keep collecting stuff from env generators, and we really
shouldn't.
This was working properly on reexec but not on reload, as for reexec we
would always start fresh, but for reload would reuse the Manager object
and hence its default environment set.
Fixes: #10671
Ideally we'd even propagate this all the way to the client, by having a
separate JobType enum value for this. But it's hard to add this without
breaking compat, hence for now let's at least internally propagate this
case differently from the case "already on it".
This is then used to call job_finish_and_invalidate() slightly
differently, with the already= parameter false, as in the failed
condition case no message was likely produced so far.
This splits the "environment" field of Manager into two:
transient_environment and client_environment. The former is generated
from configuration file, kernel cmdline, environment generators. The
latter is the one the user can control with "systemctl set-environment"
and similar.
Both sets are merged transparently whenever needed. Separating the two
sets has the benefit that we can safely flush out the former while
keeping the latter during daemon reload cycles, so that env var settings
from env generators or configuration files do not accumulate, but
dynamic API changes are kept around.
Note that this change is not entirely transparent to users: if the user
first uses "set-environment" to override a transient variable, and then
uses "unset-environment" to unset it again things will revert to the
original transient variable now, while previously the variable was fully
removed. This change in behaviour should not matter too much though I
figure.
Fixes: #9972
If unit_deserialize() fails (because one read line is overly long), it returns
an error and we would have assumed that the next read would point to the next
unit to deserialize.
But instead unit_deserialize() can leave the file offset in the middle of a
line.
Therefore we need to ignore and skip the current unit in this case too.
While at it, move unit deserialization in a dedicated functions. That should
make the code easier to read.
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.
In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.
While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
The predicate function manager_timestamp_shall_serialize() simply says
whether to serialize or not serialize a timestamp, and should make
things a bit easier to read.
This should be much better than fgets(), as we can read substantially
longer lines and overly long lines result in proper errors.
Fixes a vulnerability discovered by Jann Horn at Google.
CVE-2018-15686
LP: #1796402https://bugzilla.redhat.com/show_bug.cgi?id=1639071
For example in a container we'd log:
Oct 17 17:01:10 rawhide systemd[1]: Started Power-Off.
Oct 17 17:01:10 rawhide systemd[1]: Forcibly powering off: unit succeeded
Oct 17 17:01:10 rawhide systemd[1]: Reached target Power-Off.
Oct 17 17:01:10 rawhide systemd[1]: Shutting down.
and on the console we'd write (in red)
[ !! ] Forcibly powering off: unit succeeded
This is not useful in any way, and the fact that we're calling an "emergency action"
is an internal implementation detail. Let's log about c-a-d and the watchdog actions
only.
The setting is now only looked at when considering an action for a job timeout
or unit start limit. It is ignored for ctrl-alt-del, SuccessAction, SuccessFailure.
v2: turn the parameter into a flag field
v3: rename Options to Flags
All over the place we define local variables for the various sockopts
that take a bool-like "int" value. Sometimes they are const, sometimes
static, sometimes both, sometimes neither.
Let's clean this up, introduce a common const variable "const_int_one"
(as well as one matching "const_int_zero") and use it everywhere, all
acorss the codebase.
If a memory error occurred, we would still go through the path which sets the
error on ferror(). It is unlikely that ferror() returns true, but it's seems
cleaner to just propagate the error we already have.
The handling of fgets() returning NULL is also simplified: according to the man
page, it returns NULL only on EOF or error. So if feof() returns true, I don't
think we should call ferror() again.
While at it, let's set errno to 0 and check that it is set before returning it
as an error. The man pages for fgets() and feof() do not say anything about
setting errno.
Here the behaviour is nominally changed, because we will decrease the
counter on error. But the only caller quits the program if error occurs,
so this makes no practical difference.
Currently they aren't covered and it probably isn't worth adding another
kind of timestamp just for this, hence simply include it in the regular
generator timestamps.
Let's do so already when we are about to complete startup/reload, so
that manager_catchup() is run in a context where MANAGER_IS_RUNNING()
returns true, as the intention is.
Fixes: #9518
Both functions do partly the same, let's make sure they do it in the
same order, and that we don't miss some calls.
This makes a number of changes:
1. Moves exec_runtime_vacuum() two calls down in manager_startup(). This
should not have any effect but makes manager_startup() more like
manager_reload().
2. Calls manager_recheck_journal(), manager_recheck_dbus(),
manager_enqueue_sync_bus_names() in manager_startup() too. This is a
good idea since during reeexec we pass through manager_startup() and
hence can't assume dbus and journald weren't up yet, hence let's
check if they are ready to be connected to.
3. Include manager_enumerate_perpetual() in manager_reload(), too. This
is not strictly necessary, since these units are included in the
serialization anyway, but it's still a nice thing, in particular as
theoretically the deserialization could fail.
let's clean up error handling and logging in manager_reload() a bit.
Specifically: make sure we log about every error we might encounter at
least and at most once.
When we encounter an error before the "point of no return" then log at
LOG_ERR about it and propagate it. Otherwise, eat it up, but warn about
it and proceed, it's the best we can do.
If manager_serialize() fails in the middle (which it hopefully doesn't)
make sure to fix up m->n_reloading correctly again so that we don't
leave it > 0 when it really shouldn't be.
Let's make them typesafe, and let's add a nice macro helper for checking
if we are in a test run, which should make testing for this much easier
to read for most cases.
Instead of blacklisting when not to trim the cgroup tree, let's instead
whitelist when to do it, as an excercise of being careful when being
destructive.
This should not change behaviour with exception that during switch roots
we now won't attempt to trim the cgroup tree anymore. Which is more
correct behaviour after all we serialize/deserialize during the
transition and should be needlessly destructive.
"ExitCode" is a bit of a misnomer in two ways: it suggests this was
about the "exit code" concept that exit()/waitid() deal with, but really
isn't. Moreover, it's not event just about exiting either, but more
often about reloading/reexecing or rebooting. Let's hence pick a new
name for this that is a bit more correct.
I initially thought about naming this the "state", but that'd be a
misnomer too, as the value really encodes a "goal" more than a current
state. Also we already have the externally visible ManagerState.
No actual changes in behaviour, just the rename.
Previously, we'd act immediately on StopWhenUnneeded= when a unit state
changes. With this rework we'll maintain a queue instead: whenever
there's the chance that StopWhenUneeded= might have an effect we enqueue
the unit, and process it later when we have nothing better to do.
This should make the implementation a bit more reliable, as the unit notify event
cannot immediately enqueue tons of side-effect jobs that might
contradict each other, but we do so only in a strictly ordered fashion,
from the main event loop.
This slightly changes the check when to consider a unit "unneeded".
Previously, we'd assume that a unit in "deactivating" state could also
be cleaned up. With this new logic we'll only consider units unneeded
that are fully up and have no job queued. This means that whenever
there's something pending for a unit we won't clean it up.
Looking at a recent Bad Day, my log contains over 100 lines of
systemd[23895]: Failed to connect to API bus: Connection refused
It is due to "systemd --user" retrying to connect to an API bus.[*] I
would prefer to avoid spamming the logs. I don't think it is good for us
to retry so much like this.
systemd was mislead by something setting DBUS_SESSION_BUS_ADDRESS. My best
guess is an unfortunate series of events caused gdm to set this. gdm has
code to start a session dbus if there is not a bus available already (and
in this case it exports the environment variable). I believe it does not
normally do this when running under systemd, because "systemd --user" and
hence "dbus.service" would already have been started by pam_systemd.
I see two possibilities
1. Rip out the check for DBUS_SESSION_BUS_ADDRESS entirely.
2. Only check for DBUS_SESSION_BUS_ADDRESS on startup. Not in the
"recheck" logic.
The justification for 2), is that the recheck is called from unit_notify(),
this is used to check whether the service just started (or stopped) was
"dbus.service". This reason for rechecking does not apply if we think
the session bus was started outside our logic.
But I think we can justify 1). dbus-daemon ships a statically-enabled
/usr/lib/systemd/user/dbus.service, which would conflict with an attempt to
use an external dbus. Also "systemd --user" is started from user@.service;
if you try to start it manually so that it inherits an environment
variable, it will conflict if user@.service was started by pam_systemd
(or loginctl enable-linger).
We add n_installed_jobs and n_failed_jobs to our inner state after
deserialization. This is fine during daemon-reexec when we start with clear
Manager (and some jobs possibly queued before deserialization), however,
daemon-reload works with the same manager and adding the values would
effectively double the counters. Reset the counters before we deserialize and
add their values again.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
To make debugging easier, this patches allows one to change the log target and
do reload/reexec without modifying configuration permanently, which makes
debugging easier.
Indeed if one changed the log target at runtime (via the bus or via signals),
the change was lost on the next reload/reexecution.
In order to restore back the default value (set via system.conf, environment
variables or any other means ), the empty string in the "LogTarget" property is
now supported as well as sending SIGTRMIN+26 signal.
To make debugging easier, this patches allows one to change the log level and
do reload/reexec without modifying configuration permanently, which makes
debugging easier.
Indeed if one changed the log max level at runtime (via the bus or via
signals), the change was lost on the next daemon reload/reexecution.
In order to restore the original value back (set via system.conf, environment
variables or any other means), the empty string in the "LogLevel" property is
now supported as well as sending SIGRTMIN+23 signal.
Previously the enumerate() callback defined for each unit type would do
two things:
1. It would create perpetual units (i.e. -.slice, system.slice, -.mount and
init.scope)
2. It would enumerate units from /proc/self/mountinfo, /proc/swaps and
the udev database
With this change these two parts are split into two seperate methods:
enumerate() now only does #2, while enumerate_perpetual() is responsible
for #1. Why make this change? Well, perpetual units should have a
slightly different effect that those found through enumeration: as
perpetual units should be up unconditionally, perpetually and thus never
change state, they should also not pull in deps by their state changing,
not even when the state is first set to active. Thus, their state is
generally initialized through the per-device coldplug() method in
similar fashion to the deserialized state from a previous run would be
put into place. OTOH units found through regular enumeration should
result in state changes (and thus pull in deps due to state changes),
hence their state should be put in effect in the catchup() method
instead. Hence, given this difference, let's also separate the
functions, so that the rule is:
1. What is created in enumerate_perpetual() should be started in
coldplug()
2. What is created in enumerate() should be started in catchup().
This is very similar to the existing unit method coldplug() but is
called a bit later. The idea is that that coldplug() restores the unit
state from before any prior reload/restart, i.e. puts the deserialized
state in effect. The catchup() call is then called a bit later, to
catch up with the system state for which we missed notifications while
we were reloading. This is only really useful for mount, swap and device
mount points were we should be careful to generate all missing unit
state change events (i.e. call unit_notify() appropriately) for
everything that happened while we were reloading.
It's a bit prettier that day as the function won't silently overwrite
any possibly pre-initialized field, and destroy it right before we
allocate a new event source.
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
When "systemctl daemon-reload" is run at the same time as "systemctl
start foo", the latter might hang. That's because commands like start
wait for JobRemoved signal to know when the job is finished. But if the
job is finished during reloading, the signal is never sent.
The hang can be easily reproduced by running
# for ((N=1; N>0; N++)) ; do echo $N ; systemctl daemon-reload ; done
# for ((N=1; N>0; N++)) ; do echo $N ; systemctl start systemd-coredump.socket ; done
in two different terminals. The start command will hang after 1-2
iterations.
This keeps track of jobs that were started before reload and finished
during it and sends JobRemoved after the reload has finished.
We'd write a sequence that was invalid unicode and this caused the d-bus
connection to be terminated:
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_2esocket org.freedesktop.systemd1.Unit SubState
s "running"
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_e2socket org.freedesktop.systemd1.Unit SubState
Remote peer disconnected
$ busctl get-property org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/dbus_e2socket org.freedesktop.systemd1.Unit SubState
(hangs)
Fixes#8978.
When I see "test", I have to think three times what the return value
means. With "below" this is immediately clear. ratelimit_below(&limit)
sounds almost like English and is imho immediately obvious.
(I also considered ratelimit_ok, but this strongly implies that being under the
limit is somehow better. Most of the times this is true, but then we use the
ratelimit to detect triple-c-a-d, and "ok" doesn't fit so well there.)
C.f. a1bcaa07.
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.
Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.
So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.
This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:
1. strv_length()' return type becomes size_t
2. the unit file changes array size becomes size_t
3. DNS answer and query array sizes become size_t
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
Currently we add target dependencies while we are loading units. This
can create ordering loops even if configuration doesn't contain any
loop. Take for example following configuration,
$ systemctl get-default
multi-user.target
$ cat /etc/systemd/system/test.service
[Unit]
After=default.target
[Service]
ExecStart=/bin/true
[Install]
WantedBy=multi-user.target
If we encounter such unit file early during manager start-up (e.g. load
queue is dispatched while enumerating devices due to SYSTEMD_WANTS in
udev rules) we would add stub unit default.target and we order it Before
test.service. At the same time we add implicit Before to
multi-user.target. Later we merge two units and we create ordering cycle
in the process.
To fix the issue we will now never add any target dependencies until we
loaded all the unit files and resolved all the aliases.
This is similar to TAKE_PTR() but operates on file descriptors, and thus
assigns -1 to the fd parameter after returning it.
Removes 60 lines from our codebase. Pretty good too I think.
Let's better check this inside of the call than before it, so that we
never issue this while reloading, even should these calls be called due
to other reasons than just the unit notify.
This makes sure the reload state is unset a bit earlier in
manager_reload() so that we can safely call this function from there and
they do the right thing.
Follow-up for e63ebf71ed.
When running tests like test-unit-name, there is not point in setting
up the cgroup and signals and interacting with the environment. Similarly
when running fuzz testing of the parser.
Add new MANAGER_TEST_RUN_BASIC which takes the role of MANAGER_TEST_RUN_MINIMAL,
and redefine MANAGER_TEST_RUN_MINIMAL to just create the basic data structures.
Reproducer:
$ meson build && cd build
$ ninja
$ sudo useradd test
$ sudo su test
$ ./systemd --system --test
...
Failed to create /user.slice/user-1000.slice/session-6.scope/init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
Above error message is caused by the fact that user test didn't have its
own session and we tried to set up init.scope already running as user
test in the directory owned by different user.
Let's skip setting up init.scope altogether since we won't be launching
processes anyway.
We maintain a queue of units and jobs that we are supposed to generate
change/new notifications for because they were either just created or
some of their property has changed. Let's throttle processing of this
queue a bit: as soon as > 1K of bus messages are queued for writing
let's skip processing the queue, and then recheck on the next
iteration again.
Moreover, never process more than 100 units in one go, return to the
event loop after that. Both limits together should put effective limits
on both space and time usage of the function, delaying further
operations until a later moment, when the queue is empty or the the
event loop is sufficiently idle again.
This should keep the number of generated messages much lower than
before on busy systems or where some client is hanging.
Note that this also means a bad client can slow down message dispatching
substantially for up to 90s if it likes to, for all clients. But that
should be acceptable as we only allow trusted bus clients, anyway.
Fixes: #8166
config_parse_join_controllers would free the destination argument on failure,
which is contrary to our normal style, where failed parsing has no effect.
Moving it to shared also allows a test to be added.
A .socket will reference a .service unit, by registering a UnitRef with the
.service unit. If this .service unit has the .socket unit listed in Wants or
Sockets or such, a cycle will be created. We would not free this cycle
properly, because we treated any unit with non-empty refs as uncollectable. To
solve this issue, treats refs with UnitRef in u->refs_by_target similarly to
the refs in u->dependencies, and check if the "other" unit is known to be
needed. If it is not needed, do not treat the reference from it as preventing
the unit we are looking at from being freed.
"check" is unclear: what is true, what is false? Let's rename to "can_gc" and
revert the return value ("positive" values are easier to grok).
v2:
- rename from unit_can_gc to unit_may_gc
I think if we log the error as being _ignored_, we should also consider
the event as handled and clear it. This was the behaviour prior to
575b300b (PR #7968).
I don't think we particularly wanted to change behaviour and keep retrying.
Sometimes that's useful, other times you cause more problems by filling the
logs.
Plus a nearby typo fix.
Previously, we'd synchronize bus names immediately when we succeeded
connecting to the bus, potentially even before coldplugging the units.
This was problematic, as synchronizing bus names meant invoking the
per-unit name change handler function which might change the unit's
state — which will result in consistency when done before we coldplug
things.
With this change we instead enqueue a job for the event loop to resync
the names in a later loop iteration, i.e. at a point where we know
coldplugging has finished.
Let's also use the journal if it is currently reloading. In that state
it should also be able to process our requests. Moreover, we might
otherwise end up disconnecting/reconnecting from the journal without
really any need to hence, relax the check accordingly.
This removes the current bus_init() call, as it had multiple problems:
it munged handling of the three bus connections we care about (private,
"api" and system) into one, even though the conditions when which was
ready are very different. It also added redundant logging, as the
individual calls it called all logged on their own anyway.
The three calls bus_init_api(), bus_init_private() and bus_init_system()
are now made public. A new call manager_dbus_is_running() is added that
works much like manager_journal_is_running() and is a lot more careful
when checking whether dbus is around. Optionally it checks the unit's
deserialized_state rather than state, in order to accomodate for cases
where we cant to connect to the bus before deserializing the
"subscribed" list, before coldplugging the units.
manager_recheck_dbus() is added, that works a lot like
manager_recheck_journal() and is invoked in unit_notify(), i.e. when
units change state.
All in all this should make handling a bit more alike to journal
handling, and it also fixes one major bug: when running in user mode
we'll now connect to the system bus early on, without conditionalizing
this in anyway.
Let's simplify things a bit: we so far called both functions every
single time, let's just merge one into the other, so that we have fewer
functions to call.
After discussions with @htejun it appears it's OK now to enable memory
accounting by default for all units without affecting system performance
too badly. facebook has made good experiences with deploying memory
accounting across their infrastructure.
This hence turns MemoryAccounting= from opt-in to opt-out, similar to
how TasksAccounting= is already handled. The other accounting options
remain off, their performance impact is too big still.
Before this, each ExecRuntime object is owned by a unit. However,
it may be shared with other units which enable JoinsNamespaceOf=.
Thus, by the serialization/deserialization process, its sharing
information, more specifically, reference counter is lost, and
causes issue #7790.
This makes ExecRuntime objects be managed by manager, and changes
the serialization/deserialization process.
Fixes#7790.
log_open_console() did not switch from stderr to /dev/console, when
"always_reopen_console" was set. It was necessary to call
log_close_console() first.
By contrast, log_open() did switch between e.g. journald and kmsg according
to the value of "prohibit_ipc".
Let's fix log_open() to respect the values of all the log options, and we
can make log_close_*() private.
Also log_close_console() is changed. There was some precaution, avoiding
closing the console fd if we are not PID 1. I think commit 48a601fe made
a little mistake in leaving this in, and it only served to confuse
readers :).
Also I changed systemd-shutdown. Now we have log_set_prohibit_ipc(), let's
use it to clarify that systemd-shutdown is not expected to try and log via
journald (which it is about to kill). We avoided ever asking it to, but
it's more convenient for the reader if they don't have to think about that.
In that sense, it's similar to using assert() to validate a function's
arguments.
If we have to force the logging to close the journal fd, then we can open
any fallback log target. E.g. kmsg, if the target was the default
JOURNAL_OR_KMSG.
This is the behaviour I would expect from the documentation. I couldn't
find any justification in the code, for why we would want to start dropping
log messages instead of sending them to the fallback target.
This means we will match the behaviour of processes which we fork and which
set `open_when_needed`, and with generators - which use
log_set_prohibit_ipc(true) - which we fork+exec during a reload.
IMO this illustrates that the log_open/log_close interface is too clunky.
So with the behaviour settled, I will refactor the interface in the next
commit :).
Let's add a per-unit boolean that tells us whether our unit is currently
counted or not. This way it's unlikely we get out of sync again and
things are generally more robust.
This also allows us to remove the counting logic specific to service
units (which was in fact mostly a copy from the generic implementation),
in favour of fully generic code.
Replaces: #7824
Since the the whole function ultimately is just a fancy getter for the
show_status field, let's actually return it as last step literally
without an extra needless "if".
Previously, we'd maintain two hashmaps keyed by PIDs, pointing to Unit
interested in SIGCHLD events for them. This scheme allowed a specific
PID to be watched by exactly 0, 1 or 2 units.
With this rework this is replaced by a single hashmap which is primarily
keyed by the PID and points to a Unit interested in it. However, it
optionally also keyed by the negated PID, in which case it points to a
NULL terminated array of additional Unit objects also interested. This
scheme means arbitrary numbers of Units may now watch the same PID.
Runtime and memory behaviour should not be impact by this change, as for
the common case (i.e. each PID only watched by a single unit) behaviour
stays the same, but for the uncommon case (a PID watched by more than
one unit) we only pay with a single additional memory allocation for the
array.
Why this all? Primarily, because allowing exactly two units to watch a
specific PID is not sufficient for some niche cases, as processes can
belong to more than one unit these days:
1. sd_notify() with MAINPID= can be used to attach a process from a
different cgroup to multiple units.
2. Similar, the PIDFile= setting in unit files can be used for similar
setups,
3. By creating a scope unit a main process of a service may join a
different unit, too.
4. On cgroupsv1 we frequently end up watching all processes remaining in
a scope, and if a process opens lots of scopes one after the other it
might thus end up being watch by many of them.
This patch hence removes the 2-unit-per-PID limit. It also makes a
couple of other changes, some of them quite relevant:
- manager_get_unit_by_pid() (and the bus call wrapping it) when there's
ambiguity will prefer returning the Unit the process belongs to based on
cgroup membership, and only check the watch-pids hashmap if that
fails. This change in logic is probably more in line with what people
expect and makes things more stable as each process can belong to
exactly one cgroup only.
- Every SIGCHLD event is now dispatched to all units interested in its
PID. Previously, there was some magic conditionalization: the SIGCHLD
would only be dispatched to the unit if it was only interested in a
single PID only, or the PID belonged to the control or main PID or we
didn't dispatch a signle SIGCHLD to the unit in the current event loop
iteration yet. These rules were quite arbitrary and also redundant as
the the per-unit handlers would filter the PIDs anyway a second time.
With this change we'll hence relax the rules: all we do now is
dispatch every SIGCHLD event exactly once to each unit interested in
it, and it's up to the unit to then use or ignore this. We use a
generation counter in the unit to ensure that we only invoke the unit
handler once for each event, protecting us from confusion if a unit is
both associated with a specific PID through cgroup membership and
through the "watch_pids" logic. It also protects us from being
confused if the "watch_pids" hashmap is altered while we are
dispatching to it (which is a very likely case).
- sd_notify() message dispatching has been reworked to be very similar
to SIGCHLD handling now. A generation counter is used for dispatching
as well.
This also adds a new test that validates that "watch_pid" registration
and unregstration works correctly.
This fundamentally makes one change: we never process more than one
signal or more than one waitid() event per event loop. We'll never tight
loop around waitid() or around read() on our signalfd instead, but
always return to the main event loop after processing one event.
By doing this we put the event priorization handling into full power
again, as we'll always check for higher priority events before looking
at the next signal or waitid() again.
This introduces a new "defer" event source "sigchld_event". It's enabled
as soon as we see SIGCHLD, and disabled as soon as waitid() reported no
further children pending. It's running at a relatively high priority,
one step higher than signal handling itself, but lower than
/proc/self/mountinfo event handling, so that the latter always takes
precedence.
Since we want to process sd_notify() events at an even higher priority
than SIGCHLD (as before) it is moved one priority step up, too.
Fixes: #7932
Possibly fixes: #7966
Let's shorten manager_check_finished() a bit by splitting out checking
of basic.target and the two things we do when we reach it.
This should not change behaviour, except for one thing: we now check
basic.target's actual state for figuring out whether it is up, instead
of generically checking whether it has any job queued. This is arguably
more correct, and is what other code does too for similar purposes, for
example manager_state()
This happens to be almost the same moment as when we send READY=1 in the user
instance, but the logic is slightly different, since we log taint when
basic.target is reached in the system manager, but we send the notification
only in the user manager. So add a separate flag for this and propagate it
across reloads.
Fixes#7683.
Let's be more restrictive when validating PID files and MAINPID=
messages: don't accept PIDs that make no sense, and if the configuration
source is not trusted, don't accept out-of-cgroup PIDs. A configuratin
source is considered trusted when the PID file is owned by root, or the
message was received from root.
This should lock things down a bit, in case service authors write out
PID files from unprivileged code or use NotifyAccess=all with
unprivileged code. Note that doing so was always problematic, just now
it's a bit less problematic.
When we open the PID file we'll now use the CHASE_SAFE chase_symlinks()
logic, to ensure that we won't follow an unpriviled-owned symlink to a
privileged-owned file thinking this was a valid privileged PID file,
even though it really isn't.
Fixes: #6632
If we have to chose between truncated escape sequences and strings
exploded to 4 times the desried length by fully escaping, prefer the
latter.
It's for debug only, hence doesn't really matter much.
Let's rename it manager_sanitize_environment() which is a more precise
name. Moreover, sort the environment implicitly inside it, as all our
callers do that anyway afterwards and we can save some code this way.
Also, update the list of env vars to drop, i.e. the env vars we manage
ourselves and don't want user code to interfear with. Also sort this
list to make it easier to update later on.
This is useful so that callers know whether anything at all and how much
was flushed.
This patches through users of this functions to ensure that the return
values > 0 which may be returned now are not propagated in public APIs.
Also, users that ignore the return value are changed to do so explicitly
now.
This makes things a bit easier to read I think, and also makes sure we
always use the _unlikely_ wrapper around it, which so far we used
sometimes and other times we didn't. Let's clean that up.