The latest consolidation cleanup of write_string_file() revealed some users
of that helper which should have used write_string_file_no_create() in the
past but didn't. Basically, all existing users that write to files in /sys
and /proc should not expect to write to a file which is not yet existant.
Merge write_string_file(), write_string_file_no_create() and
write_string_file_atomic() into write_string_file() and provide a flags mask
that allows combinations of atomic writing, newline appending and automatic
file creation. Change all users accordingly.
Due to our _cleanup_ usage for the udev manager, it will be destroyed
after the "exit:" label has finished. Therefore, it is the last
destruction done in main(). This has two side-effects:
- mac_selinux is destroyed before the udev manager is, possible causing
use-after-free if the manager-cleanup accesses selinux data
- log_close() is called *before* the manager is destroyed, possibly
re-opening the log if you use --debug (and thus not re-applying the
--debug option)
Avoid this by moving the manager-handling into a new function called
run(). This function will be left before we enter the "exit:" label in
main(), hence, the manager object will be destroyed early.
Make sure we never close fds before we drop their related event-source.
This will cause horrible disruptions if the fd-num is re-used by someone
else. Under normal conditions, this should not cause any problems as the
close() will drop the fd from the epoll-set automatically. However, this
changes if you have any child processes with a copy of that fd.
This fixes issue #163.
Background:
If you create an epoll-set via epoll_create() (lets call it 'EFD')
you can add file-descriptors to it to watch for events. Whenever
you call EPOLL_CTL_ADD on a file-descriptor you want to watch, the
kernel looks up the attached "struct file" pointer, that this FD
refers to. This combination of the FD-number and the "struct file"
pointer is used as key to link it into the epoll-set (EFD).
This means, if you duplicate your file-descriptor, you can watch
this file-descriptor, too (because the duplicate will have a
different FD-number, hence, the combination of FD-number and
"struct file" is different as before).
If you want to stop watching an FD, you use EPOLL_CTL_DEL and pass
the FD to the kernel. The kernel again looks up your
file-descriptor in your FD-table to find the linked "struct file".
This FD-number and "struct file" combination is then dropped from
the epoll-set (EFD).
Last, but not least: If you close a file-descriptor that is linked
to an epoll-set, the kernel does *NOTHING* regarding the
epoll-set. This is a vital observation! Because this means, your
epoll_wait() calls will still return the metadata you used to
watch/subscribe your file-descriptor to events.
There is one exception to this rule: If the file-descriptor that
you just close()ed was the last FD that referred to the underlying
"struct file", then _all_ epoll-set watches/subscriptions are
destroyed. Hence, if you never dup()ed your FD, then a simple
close() will also unsubscribe it from any epoll-set.
With this in mind, lets look at fork():
Assume you have an epoll-set (EFD) and a bunch of FDs
subscribed to events on that EFD. If you now call fork(),
the new process gets a copy of your file-descriptor table.
This means, the whole table is copied and the "struct
file" reference of each FD is increased by 1. It is
important to notice that the FD-numbers in the child are
exactly the same as in the parent (eg., FD #5 in the child
refers to the same "struct file" as FD #5 in the parent).
This means, if the child calls EPOLL_CTL_DEL on an FD, the
kernel will look up the linked "struct file" and drop the
FD-number and "struct file" combination from the epoll-set
(EFD). However, this will effectively drop the
subscription that was installed by the parent.
To sum up: even though the child gets a duplicate of the
EFD and all FDs, the subscriptions in the EFD are *NOT*
duplicated!
Now, with this in mind, lets look at what udevd does:
Udevd has a bunch of file-descriptors that it watches in its
sd-event main-loop. Whenever a uevent is received, the event is
dispatched on its workers. If no suitable worker is present, a new
worker is fork()ed to handle the event. Inside of this worker, we
try to free all resources we inherited. However, the fork() call
is done from a call-stack that is never rewinded. Therefore, this
call stack might own references that it drops once it is left.
Those references we cannot deduce from the fork()'ed process;
effectively causing us to leak objects in the worker (eg., the
call to sd_event_dispatch() that dispatched our uevent owns a
reference to the sd_event object it used; and drops it again once
the function is left).
(Another example is udev_monitor_ref() for each 'worker' that is
also inherited by all children; thus keeping the udev-monitor and
the uevent-fd alive in all children (which is the real cause for
bug #163))
(The extreme variant is sd_event_source_unref(), which explicitly
keeps event-sources alive, if they're currently dispatched,
knowing that the dispatcher will free the event once done. But
if the dispatcher is in the parent, the child will never ever
free that object, thus leaking it)
This is usually not an issue. However, if such an object has a
file-descriptor embedded, this FD is left open and never closed in
the child.
In manager_exit(), if we now destroy an object (i.e., close its embedded
file-descriptor) before we destroy its related sd_event_source, then
sd-event will not be able to drop the FD from the epoll-set (EFD). This
is, because the FD is no longer valid at the time we call EPOLL_CTL_DEL.
Hence, the kernel cannot figure out the linked "struct file" and thus
cannot remove the FD-number plus "struct file" combination; effectively
leaving the subscription in the epoll-set.
Since we leak the uevent-fd in the children, they retain a copy of the FD
pointing to the same "struct file". Thus, the EFD-subscription are not
automatically removed by close() (as described above). Therefore, the main
daemon will still get its metadata back on epoll_watch() whenever an event
occurs (even though it already freed the metadata). This then causes the
free-after-use bug described in #163.
This patch fixes the order in which we destruct objects and related
sd-event-sources. Some open questions remain:
* Why does source_io_unregister() not warn on EPOLL_CTL_DEL failures?
This really needs to be turned into an assert_return().
* udevd really should not leak file-descriptors into its children. Fixing
this would *not* have prevented this bug, though (since the child-setup
is still async).
It's non-trivial to fix this, though. The stack-context of the caller
cannot be rewinded, so we cannot figure out temporary refs. Maybe it's
time to exec() the udev-workers?
* Why does the kernel not copy FD-subscriptions across fork()?
Or at least drop subscriptions if you close() your FD (it uses the
FD-number as key, so it better subscribe to it)?
Or it better used
FD+"struct file_table*"+"struct file*"
as key to not allow the childen to share the subscription table..
*sigh*
Seems like we have to live with that API forever.
This ports a lot of manual code over to sigprocmask_many() and friends.
Also, we now consistly check for sigprocmask() failures with
assert_se(), since the call cannot realistically fail unless there's a
programming error.
Also encloses a few sd_event_add_signal() calls with (void) when we
ignore the return values for it knowingly.
Now that listen_fds() have been split out, we can safely move the allocation
of the manager object after doing the forking (the fork is done to notify legcay
init-systems that the fds are ready).
Subsequently, we can merge manager_listen() back into managre_new().
This entails a minor behaviour change: the application of permissions to
static device nodes now happens after the fork (but still before notifying
systemd about being ready).
This will simply silently fail on non-systemd systems, so there is no reason
to make it conditional.
Also make it clear that we notify systemd about being ready as the last step
before starting the event loop, whereas the forking might need to happen
earlier.
This should have no behavioural change, but it is odd to tie the cgroup cleaning to
whether or not we are passed sockets.
The point really is if we are guaranteed to be in a dedicated cgroup, so instead
check for our parent being PID1 (we already implicitly only do this on systemd
systems).
We used to block all signals, and restore the original signal mask before exec'ing
external processes.
Now we just block the signals we care about and unconditionally unblock all signals
before exec'ing.
The communication channels must all be opened before forknig in daemon mode,
or we cannot guarantee that udevadm will work correctly as soon as udevd is
started.
Only log about starting in daemon mode, rely on PID1 to log this in notify mode. Also
explicitly set the STATUS variable, as is done in notify mode as is done for other
serivecs.
This allows us to drop the special sigterm handling in spawn_wait()
as this will now be passed directly to the worker event loop.
We now log failing spawend processes at 'warning' level, and timeouts
are in terms of CLOCK_BOOTTIME when available, otherwise the behavior
is unchanged.
Rather than trying to schedule new events on every main-loop iteration, do it explicitly when
processing an event finishes, a worker is killed, a new uevent is received, or the event queue
is explicitly restarted.
The behavior is mostly unchanged, but rather than only ever calling these functions at
fixed points in the event loop, they are called directly whenever they are invoked.
This partly reverts:
commit 6d1b1e0bc6
Author: Tom Gundersen <teg@jklm.no>
Date: Sun May 24 15:10:04 2015 +0200
udevd: worker - fully clean up unnecessary fds
The inotify-fd _is_ used in the workers, so don't close it! Have a look at
udev-watch.c, which keeps track of the inotify-fd as a global variable
(ugh!).
We would enforce that events could only be added to the queue from the
main process, but that brake in daemonized mode. Relax the restriction
to only allow one process to add events to the queue.
Reported by Mantas Mikulėnas.
We were returning rather than continuing in some cases. The intention
was always to fully process all pending events before returning
from the SIGCHLD handler. Restore this behaviour.
This way it is more obvious that the queue flag file is always
up-to-date. Moreover, we only have to touch/unlink it when the
first/last event is allocated/freed.
This avoids updating the flag files twice for every loop, and also removes another dependency
in the main-loop, so we are freer to reshufle it as we want.
Rather than skippling ctrl handling whenever we have handlede inotify events
(and hence may have synthesized a 'change' event), just call the uevent
handling explicitly from on_inotify() so that the event queue is up-to-date.
Make the worker context have the same life-span as the worker process. It is created on fork()
and free'd on SIGCHLD.
The change means that we can get worker_returned() for a worker context that is no longer around,
this is not a problem and we can just drop the message. The only use for worker_returned() is to
know to reschedule events to workers that are still around, so if the worker has already exited
it is not important to keep track of. We still print a debug statement in this case to be on the
safe side.
If the main daemon is not notified about a worker finishing an event
the refcounting of the worker struct will be wrong, and we will lose
track of the number of children we have to wait for.
This should not happen, but if it does we better complain loudly about
it. Worst case udev will wait for 30 seconsd at shutdown waiting for
nonexistent workers.
Remove some redundant logging, and reduce the log-level in most cases. The only
case that is really critical is if a worker failed while hanlding an event, so
keep that at error level.
udev uses inotify to implement a scheme where when the user closes
a writable device node, a change uevent is forcefully generated.
In the case of block devices, it actually requests a partition rescan.
This currently can't be synchronized with "udevadm settle", i.e. this
is not reliable in a script:
sfdisk --change-id /dev/sda 1 81
udevadm settle
mount /dev/sda1 /foo
The settle call doesn't synchronize there, so at the same time we try
to mount the device, udevd is busy removing the partition device nodes and
readding them again. The mount call often happens in that moment where the
partition node has been removed but not readded yet.
This exact issue was fixed long ago:
http://git.kernel.org/cgit/linux/hotplug/udev.git/commit/?id=bb38678e3ccc02bcd970ccde3d8166a40edf92d3
but that fix is no longer valid now that sequence numbers are no longer
used.
Fix this by forcing another mainloop iteration after handling inotify events
before unblocking settle. If the inotify event caused us to generate a
"change" event, we'll pick that up in the following loop iteration, before
we reach the end of the loop where we respond to settle's control message,
unblocking it.
The information in the db is stale, so it does not make sense to
expose it any longer. Also, don't drop the kernel event, but simply
pass it on to userspace without ammending it.
This essentially replaces
open("/run/udev/queue", O_WRONLY|O_CREAT|O_CLOEXEC|O_TRUNC|O_NOFOLLOW, 0444)
with
open("/run/udev/queue", O_WRONLY|O_CREAT|O_CLOEXEC|O_NOCTTY, 0644),
which is ok for our purposes.
The udev-settle guarantees that udevd is no longer processing any of the
events casued by udev-trigger. The way this works is that it sends a
synchronous PING to udevd after udev-trigger has ran, and when that returns
it knows that udevd has started processing the events from udev-trigger.
udev-settle will then wait for the event queue to empty before returning.
However, there was a race here, as we would only update the /run state at
the beginning of the event loop, before reading out new events and before
processing the ping.
That means that if the first uevent arrived in the same event-loop iteration
as the PING, we would return the ping before updating the queue state in /run
(which would happen on the next iteration).
The race window here is tiny (as the /run state would probably get updated
before udev-settle got a chance to read /run), but still a possibility.
Fix the problem by updating the /run state as the last step before returning
the PING.
We must still update it at the beginning of the loop as well, otherwise we
risk being stuck in poll() with a stale state in /run.
Reported-by: Daniel Drake <drake@endlessm.com>
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
include-what-you-use automatically does this and it makes finding
unnecessary harder to spot. The only content of poll.h is a include
of sys/poll.h so should be harmless.
If the format string contains %m, clearly errno must have a meaningful
value, so we might as well use log_*_errno to have ERRNO= logged.
Using:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\((".*%m.*")/log_\1_errno(errno, \2/'
Plus some whitespace, linewrap, and indent adjustments.