It's a bit prettier that day as the function won't silently overwrite
any possibly pre-initialized field, and destroy it right before we
allocate a new event source.
This tests a couple of corner cases of the sd-event API including
changing priorities of existing event sources, as well as overflow
conditions of the inotify queue.
This adds a new call sd_event_add_inotify() which allows watching for
inotify events on specified paths.
sd-event will try to minimize the number of inotify fds allocated, and
will try to add file watches to the same inotify fd objects as far as
that's possible. Doing this kind of inotify object should optimize
behaviour in programs that watch a limited set of mostly independent
files as in most cases a single inotify object will suffice for watching
all files.
Traditionally, this kind of coalescing logic (i.e. that multiple event
sources are implemented on top of a single inotify object) was very hard
to do, as the inotify API had serious limitations: it only allowed
adding watches by path, and would implicitly merge watches installed on
the same node via different path, without letting the caller know about
whether such merging took place or not.
With the advent of O_PATH this issue can be dealt with to some point:
instead of adding a path to watch to an inotify object with
inotify_add_watch() right away, we can open the path with O_PATH first,
call fstat() on the fd, and check the .st_dev/.st_ino fields of that
against a list of watches we already have in place. If we find one we
know that the inotify_add_watch() will update the watch mask of the
existing watch, otherwise it will create a new watch. To make this
race-free we use inotify_add_watch() on the /proc/self/fd/ path of the
O_PATH fd, instead of the original path, so that we do the checking and
watch updating with guaranteed the same inode.
This approach let's us deal safely with inodes that may appear under
various different paths (due to symlinks, hardlinks, bind mounts, fs
namespaces). However it's not a perfect solution: currently the kernel
has no API for changing the watch mask of an existing watch -- unless
you have a path or fd to the original inode. This means we can "merge"
the watches of the same inode of multiple event sources correctly, but
we cannot "unmerge" it again correctly in many cases, as access to the
original inode might have been lost, due to renames, mount/unmount, or
deletions. We could in theory always keep open an O_PATH fd of the inode
to watch so that we can change the mask anytime we want, but this is
highly problematics, as it would consume too many fds (and in fact the
scarcity of fds is the reason why watch descriptors are a separate
concepts from fds) and would keep the backing mounts busy (wds do not
keep mounts busy, fds do). The current implemented approach to all this:
filter in userspace and accept that the watch mask on some inode might
be higher than necessary due to earlier installed event sources that
might have ceased to exist. This approach while ugly shouldn't be too
bad for most cases as the same inodes are probably wacthed for the same
masks in most implementations.
In order to implement priorities correctly a seperate inotify object is
allocated for each priority that is used. This way we get separate
per-priority event queues, of which we never dequeue more than a few
events at a time.
Fixes: #3982
Scope units don't have a main or control process we can watch, hence
let's explicitly watch the PIDs contained in them early on, just to make
things more robust and have at least something to watch.
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.
This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.
Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.
All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.
Fixes: #9138
This way we don't need to repeat the argument twice.
I didn't replace all instances. I think it's better to leave out:
- asserts
- comparisons like x & y == x, which are mathematically equivalent, but
here we aren't checking if flags are set, but if the argument fits in the
flags.
Previously, we'd not care about failures that were seen earlier and
remain in "exited" state. This could be triggered if the main process of
a service failed while ExecStartPost= was still running, as in that case
we'd not immediately act on the main process failure because we needed
to wait for ExecStartPost= to finish, before acting on it.
Fixes: #8929
Let's show a message at the time of logout i.e. entering the "closing"
state, not just e.g. once the user closes `tmux` and the session can be
removed completely. (At least when KillUserProcesses=no applies. My
thinking was we can spare the log noise if we're killing the processes
anyway).
These are two independent events. I think the logout event is quite
significant in the session lifecycle. It will be easier for a user who
does not know logind details to understand why "Removed session" doesn't
appear at logout time, if we have a specific message we can show at this
time :).
Tested using tmux and KillUserProcesses=no. I can also confirm the extra
message doesn't show when using KillUserProcesses=yes. Maybe it looks a
bit mysterious when you use KillOnlyUsers= / KillExcludeUsers=, but
hopefully not alarmingly so.
I was looking at systemd-logind messages on my system, because I can
reproduce two separate problems with Gnome on Fedora 28 where
sessions are unexpectedly in state "closing". (One where a GUI session
limps along in a degraded state[1], and another where spice-vdagent is left
alive after logout, keeping the session around[2]). It logged when
sessions were created and removed, but it didn't log when the session
entered the "closing" state.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1583240#c1
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1583261Closes#9096
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
We don't need a temporary variable when parsing just one number, because
our parsing functions do not touch the output variable on error.
TAKE_PTR is more expressive than 'n = NULL'.
First, ellipsize() and ellipsize_mem() should not read past the input
buffer. Those functions take an explicit length for the input data, so they
should not assume that the buffer is terminated by a nul.
Second, ellipsization was off in various cases where wide on multi-byte
characters were used.
We had some basic test for ellipsize(), but apparently it wasn't enough to
catch more serious cases.
Should fix https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=8686.
* it fails with gcc8 when -O1 or -Os is used (and -ftree-vrp which is added by -O2 and higher isn't used)
../git/src/basic/time-util.c: In function 'format_timespan':
../git/src/basic/time-util.c:508:46: error: '%0*llu' directive output between 1 and 2147483647 bytes may cause result to exceed 'INT_MAX' [-Werror=format-truncation=]
"%s"USEC_FMT".%0*"PRI_USEC"%s",
^~~~
../git/src/basic/time-util.c:508:60: note: format string is defined here
"%s"USEC_FMT".%0*"PRI_USEC"%s",
../git/src/basic/time-util.c:508:46: note: directive argument in the range [0, 18446744073709551614]
"%s"USEC_FMT".%0*"PRI_USEC"%s",
^~~~
../git/src/basic/time-util.c:507:37: note: 'snprintf' output 4 or more bytes (assuming 2147483651) into a destination of size 4294967295
k = snprintf(p, l,
^~~~~~~~~~~~~~
"%s"USEC_FMT".%0*"PRI_USEC"%s",
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
p > buf ? " " : "",
~~~~~~~~~~~~~~~~~~~
a,
~~
j,
~~
b,
~~
table[i].suffix);
~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
[zj: change 'char' to 'signed char']
sparc sets the carry bit when a syscall fails. Use this information to
set errno and return -1 as appropriate.
The added test case calls raw_clone() with flags known to be invalid
according to the clone(2) manpage.
We already do that in get_process_cmdline(), which is very similar in
behaviour otherwise. Hence, let's be safe and also filter them in
get_process_comm(). Let's try to retain as much information as we can
though and escape rather than suppress unprintable characters. Let's not
increase comm names beyond the kernel limit on such names however.
Also see discussion about this here:
https://marc.info/?l=linux-api&m=152649570404881&w=2
For short buffer sizes cellescape() was a bit wasteful, as it might
suffice to to drop a single character to find enough place for the full
four byte ellipsis, if that one character was a four character escape.
With this rework we'll guarantee to drop the minimum number of
characters from the end to fit in the ellipsis.
If the buffers we write to are large this doesn't matter much. However,
if they are short (as they are when talking about the process comm
field) then it starts to matter that we put as much information as we
can in the space we get.