We already refuse sd_event_add_signal() if the specified signal is not
blocked, let's do this also for sd_event_add_child(), since we might
need signalfd() to implement this, and this means the signal needs to be
blocked.
This adds support for watching for process exits via Linux new pidfd
concept. This makes watching processes and killing them race-free if
properly used, fixing a long-standing UNIX misdesign.
This patch adds implicit and explicit pidfd support to sd-event: if a
process shall be watched and is specified by PID we will now internally
create a pidfd for it and use that, if available. Alternatively a new
constructor for child process event sources is added that takes pidfds
as input.
Besides mere watching of child processes via pidfd two additional
features are added:
→ sd_event_source_send_child_signal() allows sending a signal to the
process being watched in the safest way possible (wrapping
the new pidfd_send_signal() syscall).
→ sd_event_source_set_child_process_own() allows marking a process
watched for destruction as soon as the event source is freed. This is
currently implemented in userspace, but hopefully will become a kernel
feature eventually.
Altogether this means an sd_event_source object is now a safe and stable
concept for referencing processes in race-free way, with automatic
fallback to pre-pidfd kernels.
Note that this patch adds support for this only to sd-event, not to PID
1. That's because PID 1 needs to use waitid(P_ALL) for reaping any
process that might get reparented to it. This currently semantically
conflicts with pidfd use for watching processes since we P_ALL is
undirected and thus might reap process earlier than the pidfd notifies
process end, which is hard to handle. The kernel will likely gain a
concept for excluding specific pidfds from P_ALL watching, as soon as
that is around we can start making use of this in PID 1 too.
From v4.12 the kernel appends some attributes to netlink acks
containing a textual description of the error and other fields.
This makes sd-netlink parse the attributes.
Virtual filesystems such as sysfs or procfs use kernfs, and kernfs can work
with two sorts of virtual files.
One sort uses "seq_file", and the results of the first read are buffered for
the second read. The other sort uses "raw" reads which always go direct to the
device.
In the later case, the content of the virtual file must be retrieved with a
single read otherwise subsequent read might get the new value instead of
finding EOF immediately. That's the reason why the usage of fread(3) is
prohibited in this case as it always performs a second call to read(2) looking
for EOF which is subject to the race described previously.
Fixes: #13585.
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.
(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)
sd-netlink is not public yet, so we can change the interface.
I did not touch interfaces of functions like sd_netlink_wait() and
sd_rtnl_message_new_link() which do not modify the object that is passed in,
because in the future we might want to change the code to e.g. take a
reference to the parent object or otherwise require a non-const reference.
It is natural that n_attiributes is less than type. But in that case,
the message does not contain any message about the type. So, we should
not abort execution with assertion, but just return -ENODATA.
This avoid the use of the global variable.
Also rename cgroup_unified_update() to cgroup_unified_cached() and
cgroup_unified_flush() to cgroup_unified() to better reflect their new roles.
This code is not part of the public API of sd-netlink, nor used by it
internally and hence should not be in the sd-netlink directory.
Also, move the test case for it to src/test/.