Use a trivial header file to share mnt_free_tablep and mnt_free_iterp.
It would be nicer put this in mount-util.h, but libmount.h is not in the
default include path, and the build system would have to be adjusted to pass
pkg-config include path in various places, and it's just not worth the trouble.
A separate header file works nicely.
Some PIDs can remain in the watched list even though their processes have
exited since a long time. It can easily happen if the main process of a forking
service manages to spawn a child before the control process exits for example.
However when a pid is about to be mapped to a unit by calling unit_watch_pid(),
the caller usually knows if the pid should belong to this unit exclusively: if
we just forked() off a child, then we can be sure that its PID is otherwise
unused. In this case we take this opportunity to remove any stalled PIDs from
the watched process list.
If we learnt about a PID in any other form (for example via PID file, via
searching, MAINPID= and so on), then we can't assume anything.
Just some renaming, no change in behaviour.
Background: I'd like to add more functions unit_test_xyz() that test
various things, hence let's streamline the naming a bit.
As device units will be reloaded by systemd whenever the corresponding device generates a "changed" event, if the mount unit / cryptsetup service is wanted by its device unit, the former can be restarted by systemd unexpectedly after the user stopped them explicitly. It is not sensible at all and can be considered dangerous. Neither is the behaviour conventional (as `auto` in fstab should only affect behaviour on boot and `mount -a`) or ever documented at all (not even in systemd, see systemd.mount(5) and crypttab(5)).
The problem was introduced in a37422045fbb68ad68f734e5dc00e0a5b1759773:
we have a unit which has a fragment, and when we'd update it based on
/proc/self/mountinfo, we'd say that e.g. What=/dev/loop8 has origin-fragment.
This commit changes two things:
- origin-fragment is changed to origin-mountinfo-implicit
- when we stop a unit, mountinfo information is flushed and all deps based
on it are dropped.
The second step is important, because when we restart the unit, we want to
notice that we have "fresh" mountinfo information. We could keep the old info
around and solve this in a different way, but keeping stale information seems
inelegant.
Fixes#11342.
This reverts commit 89f9752ea0.
This patch causes various problems during boot, where a "mount storm" occurs
naturally. Current approach is flakey, and it seems very risky to push a
feature like this which impacts boot right before a release. So let's revert
for now, and consider a more robust solution after later.
Fixes#11209.
> https://github.com/systemd/systemd/pull/11196#issuecomment-448523186:
"Reverting 89f9752ea0 and fcfb1f775e fixes this test."
The starting of mount units requires that changes to
/proc/self/mountinfo be processed before the SIGCHILD from the
completion of /sbin/mount is processed, as described by the comment
/* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded
* before we process the SIGCHLD for the mount command. */
The recently-added mount-storm protection can defeat this as it
will sometimes deliberately delay processing of /proc/self/mountinfo.
So we need to disable mount-storm protection when a mount unit is starting.
We do this by keeping a counter of the number of pending
mounts, and disabling the protection when this is non-zero.
Thanks to @asavah for finding and reporting this problem.
If we create 2000 mounts (on a 1-CPU qemu VM) with
mkdir -p /MNT/{1..2000}
time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done
it takes around 20 seconds to complete. Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.
I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.
One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.
This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs off when a storm is detected. Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed. This ensures systemd won't use more than 10% of
real time processing mountinfo.
With this patch, my test above takes about 5 seconds.
For services (and other units) we generally follow the rule that at the
beginning of each cycle, i.e. when the INACTIVE/FAILED state is left for
ACTIVATING/ACTIVE we flush out various state variables. Mount units
handled this differently so far when the unit state change was effected
outside of systemd: in that case these variables would be flushed out
when going back to INACTIVE/FAILED already.
Let's fix that, and flush out this state always during the activating
transition, not during the deactivating transition.
We never know what the changes triggered by mount_set_state() do to the
unit. Let's be safe and copy the device path into our set, so that we
are safe against that.
It doesn't matter what kind of precise failure we had earlier with
loading the unit, let's report that it loaded successfully now, after
all the kernel is an OK source for that, like any other.
Whenever we notice a change on an existing /proc/self/mountinfo line,
let's update the deps generated from it. For that, let's flush out the
old deps generated this way, and add in the new ones.
This takes benefit of the fact that today (unlike a comment this patch
removes says) we can remove deps in a somewhat reasonable way.
Let's set 'from_proc_self_mountinfo' right away, since we know its from
there. This is important so that when the load queue is dispatched (and
thus mount_load() called) this
fact is already known.
In a previous commit we added logic that mount_add_extras() (or more
precisely mount_add_default_dependencies()) adds in dependencies on
remote-fs.target and local-fs.target, hence we can drop this from
mount_setup_new_unit() and let the usual load queue dispatching take
care of this.
Let's initialize two fields with free_and_strdup() rather than directly
with strdup(). The fields should not be initialized so far, but it's
still nicer to be prepared for futzre code changes and always free
what's stored before replacing it.
If we can't process a specific line in /proc/self/mountinfo we should
log about it (which we do), but this should not affect other lines, nor
further processing of mount units. Let's keep these failures local.
Fixes: #10874
We need to make sure that the slice property is initialized whenever
mount_load() is invoked, even if we fail to load things properly off
disk. This is important since we generally don't allow changing the
slice after a unit has been started. But given that we must track the
state of external objects with mount units we must hence initialize the
property no matter what.
This deps are very similar to the -pre deps, hence establish them at the
same place, in particular as they should only be generated if default
deps are on.
This allows us to later on remove similar code that adds in these deps
whenever /proc/self/mountinfo changes.
This allows clients to follow our internal state changes safely.
Previously, quick state changes (for example, when we restart a unit due
to Restart= after it quickly transitioned through DEAD/FAILED states)
would be coalesced into one bus signal event, with this change there's
the guarantee that all state changes after the unit was announced ones
are reflected on th bus.
Note we only do this kind of guaranteed flushing only for unit state
changes, not for other unit property changes, where clients still have
to expect coalescing. This is because the unit state is a very
important, high-level concept.
Fixes: #10185
This splits the "environment" field of Manager into two:
transient_environment and client_environment. The former is generated
from configuration file, kernel cmdline, environment generators. The
latter is the one the user can control with "systemctl set-environment"
and similar.
Both sets are merged transparently whenever needed. Separating the two
sets has the benefit that we can safely flush out the former while
keeping the latter during daemon reload cycles, so that env var settings
from env generators or configuration files do not accumulate, but
dynamic API changes are kept around.
Note that this change is not entirely transparent to users: if the user
first uses "set-environment" to override a transient variable, and then
uses "unset-environment" to unset it again things will revert to the
original transient variable now, while previously the variable was fully
removed. This change in behaviour should not matter too much though I
figure.
Fixes: #9972
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.
In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.
While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.