This ports a lot of manual code over to sigprocmask_many() and friends.
Also, we now consistly check for sigprocmask() failures with
assert_se(), since the call cannot realistically fail unless there's a
programming error.
Also encloses a few sd_event_add_signal() calls with (void) when we
ignore the return values for it knowingly.
Now that listen_fds() have been split out, we can safely move the allocation
of the manager object after doing the forking (the fork is done to notify legcay
init-systems that the fds are ready).
Subsequently, we can merge manager_listen() back into managre_new().
This entails a minor behaviour change: the application of permissions to
static device nodes now happens after the fork (but still before notifying
systemd about being ready).
This will simply silently fail on non-systemd systems, so there is no reason
to make it conditional.
Also make it clear that we notify systemd about being ready as the last step
before starting the event loop, whereas the forking might need to happen
earlier.
This should have no behavioural change, but it is odd to tie the cgroup cleaning to
whether or not we are passed sockets.
The point really is if we are guaranteed to be in a dedicated cgroup, so instead
check for our parent being PID1 (we already implicitly only do this on systemd
systems).
We used to block all signals, and restore the original signal mask before exec'ing
external processes.
Now we just block the signals we care about and unconditionally unblock all signals
before exec'ing.
The communication channels must all be opened before forknig in daemon mode,
or we cannot guarantee that udevadm will work correctly as soon as udevd is
started.
Only log about starting in daemon mode, rely on PID1 to log this in notify mode. Also
explicitly set the STATUS variable, as is done in notify mode as is done for other
serivecs.
This allows us to drop the special sigterm handling in spawn_wait()
as this will now be passed directly to the worker event loop.
We now log failing spawend processes at 'warning' level, and timeouts
are in terms of CLOCK_BOOTTIME when available, otherwise the behavior
is unchanged.
Rather than trying to schedule new events on every main-loop iteration, do it explicitly when
processing an event finishes, a worker is killed, a new uevent is received, or the event queue
is explicitly restarted.
The behavior is mostly unchanged, but rather than only ever calling these functions at
fixed points in the event loop, they are called directly whenever they are invoked.
This partly reverts:
commit 6d1b1e0bc6
Author: Tom Gundersen <teg@jklm.no>
Date: Sun May 24 15:10:04 2015 +0200
udevd: worker - fully clean up unnecessary fds
The inotify-fd _is_ used in the workers, so don't close it! Have a look at
udev-watch.c, which keeps track of the inotify-fd as a global variable
(ugh!).
We would enforce that events could only be added to the queue from the
main process, but that brake in daemonized mode. Relax the restriction
to only allow one process to add events to the queue.
Reported by Mantas Mikulėnas.
We were returning rather than continuing in some cases. The intention
was always to fully process all pending events before returning
from the SIGCHLD handler. Restore this behaviour.
This way it is more obvious that the queue flag file is always
up-to-date. Moreover, we only have to touch/unlink it when the
first/last event is allocated/freed.
This avoids updating the flag files twice for every loop, and also removes another dependency
in the main-loop, so we are freer to reshufle it as we want.
Rather than skippling ctrl handling whenever we have handlede inotify events
(and hence may have synthesized a 'change' event), just call the uevent
handling explicitly from on_inotify() so that the event queue is up-to-date.
Make the worker context have the same life-span as the worker process. It is created on fork()
and free'd on SIGCHLD.
The change means that we can get worker_returned() for a worker context that is no longer around,
this is not a problem and we can just drop the message. The only use for worker_returned() is to
know to reschedule events to workers that are still around, so if the worker has already exited
it is not important to keep track of. We still print a debug statement in this case to be on the
safe side.
If the main daemon is not notified about a worker finishing an event
the refcounting of the worker struct will be wrong, and we will lose
track of the number of children we have to wait for.
This should not happen, but if it does we better complain loudly about
it. Worst case udev will wait for 30 seconsd at shutdown waiting for
nonexistent workers.
Remove some redundant logging, and reduce the log-level in most cases. The only
case that is really critical is if a worker failed while hanlding an event, so
keep that at error level.
udev uses inotify to implement a scheme where when the user closes
a writable device node, a change uevent is forcefully generated.
In the case of block devices, it actually requests a partition rescan.
This currently can't be synchronized with "udevadm settle", i.e. this
is not reliable in a script:
sfdisk --change-id /dev/sda 1 81
udevadm settle
mount /dev/sda1 /foo
The settle call doesn't synchronize there, so at the same time we try
to mount the device, udevd is busy removing the partition device nodes and
readding them again. The mount call often happens in that moment where the
partition node has been removed but not readded yet.
This exact issue was fixed long ago:
http://git.kernel.org/cgit/linux/hotplug/udev.git/commit/?id=bb38678e3ccc02bcd970ccde3d8166a40edf92d3
but that fix is no longer valid now that sequence numbers are no longer
used.
Fix this by forcing another mainloop iteration after handling inotify events
before unblocking settle. If the inotify event caused us to generate a
"change" event, we'll pick that up in the following loop iteration, before
we reach the end of the loop where we respond to settle's control message,
unblocking it.
The information in the db is stale, so it does not make sense to
expose it any longer. Also, don't drop the kernel event, but simply
pass it on to userspace without ammending it.
This essentially replaces
open("/run/udev/queue", O_WRONLY|O_CREAT|O_CLOEXEC|O_TRUNC|O_NOFOLLOW, 0444)
with
open("/run/udev/queue", O_WRONLY|O_CREAT|O_CLOEXEC|O_NOCTTY, 0644),
which is ok for our purposes.
The udev-settle guarantees that udevd is no longer processing any of the
events casued by udev-trigger. The way this works is that it sends a
synchronous PING to udevd after udev-trigger has ran, and when that returns
it knows that udevd has started processing the events from udev-trigger.
udev-settle will then wait for the event queue to empty before returning.
However, there was a race here, as we would only update the /run state at
the beginning of the event loop, before reading out new events and before
processing the ping.
That means that if the first uevent arrived in the same event-loop iteration
as the PING, we would return the ping before updating the queue state in /run
(which would happen on the next iteration).
The race window here is tiny (as the /run state would probably get updated
before udev-settle got a chance to read /run), but still a possibility.
Fix the problem by updating the /run state as the last step before returning
the PING.
We must still update it at the beginning of the loop as well, otherwise we
risk being stuck in poll() with a stale state in /run.
Reported-by: Daniel Drake <drake@endlessm.com>
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
include-what-you-use automatically does this and it makes finding
unnecessary harder to spot. The only content of poll.h is a include
of sys/poll.h so should be harmless.
If the format string contains %m, clearly errno must have a meaningful
value, so we might as well use log_*_errno to have ERRNO= logged.
Using:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\((".*%m.*")/log_\1_errno(errno, \2/'
Plus some whitespace, linewrap, and indent adjustments.
As a followup to 086891e5c1 "log: add an "error" parameter to all
low-level logging calls and intrdouce log_error_errno() as log calls
that take error numbers", use sed to convert the simple cases to use
the new macros:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\("(.*)%s"(.*), strerror\(-([a-zA-Z_]+)\)\);/log_\1_errno(-\4, "\2%m"\3);/'
Multi-line log_*() invocations are not covered.
And we also should add log_unit_*_errno().
Processes expecting static nodes to have the right permissions may order themselves after systemd-udevd.service,
make sure that actually guarantees what is expected.
Once upon a time logging during early boot was unreliable, so extra logging messages were
sent by udev to stderr. That is no longer a concern, so drop all fprintf() calls from udved.
Some kernel modules still take more than one minute to insmod, we no longer rely on the timeout
killing insmod within a given period of time, so just bump this to a much higher value. Its only
purpose is to make sure that nothing stays aronud forever.
Creating the rtnl context is cheap, but freeing it may not be, due to
synchronous close().
Also drop some excessive logging. We now log about the changing ifname
exactly once.
String which ended in an unfinished quote were accepted, potentially
with bad memory accesses.
Reject anything which ends in a unfished quote, or contains
non-whitespace characters right after the closing quote.
_FOREACH_WORD now returns the invalid character in *state. But this return
value is not checked anywhere yet.
Also, make 'word' and 'state' variables const pointers, and rename 'w'
to 'word' in various places. Things are easier to read if the same name
is used consistently.
mbiebl_> am I correct that something like this doesn't work
mbiebl_> ExecStart=/usr/bin/encfs --extpass='/bin/systemd-ask-passwd "Unlock EncFS"'
mbiebl_> systemd seems to strip of the quotes
mbiebl_> systemctl status shows
mbiebl_> ExecStart=/usr/bin/encfs --extpass='/bin/systemd-ask-password Unlock EncFS $RootDir $MountPoint
mbiebl_> which is pretty weird
Some events take longer than the default 30 seconds. Killing those
events will leave the machine halfway configured.
Add a commandline option '--event-timeout' to handle these cases.
MD instantiates devices at open(). This is incomptible with the
locking logic, as the "change" event emitted when stopping a
device will bring it back.
This should make sure that fdisk-like programs will automatically
cause an update of all partitions, just like mkfs-like programs cause
an update of the partition.
The way the kernel namespaces have been implemented breaks assumptions
udev made regarding uevent sequence numbers. Creating devices in a
namespace "steals" uevents and its sequence numbers from the host. It
confuses the "udevadmin settle" logic, which might block until util a
timeout is reached, even when no uevent is pending.
Remove any assumptions about sequence numbers and deprecate libudev's
API exposing these numbers; none of that can reliably be used anymore
when namespaces are involved.
In trying to track down a stupid linker bug, I noticed a bunch of
memset() calls that should be using memzero() to make it more "obvious"
that the options are correct (i.e. 0 is not the length, but the data to
set). So fix up all current calls to memset(foo, 0, length) to
memzero(foo, length).
Instead of individually checking for containers in each user do this
once in a new call proc_cmdline() that read the file only if we are not
in a container.
This reverts commit 47e737dc13 - it
introduced a use-after-free. The only way the code would get simpler
is with a cleanup function, but eh, not worth it for just this one
bit.
Reviewed by kay on IRC.
A regression introduced when we moved to systemd's logging is that the only
way to adjust the log-level of the udev daemon is via the env var, kernel
commandline or the commandline.
This reintroduces support for specifying this in the configuration file.
Based on a patch by Kay Sievers.
A tag is exported at boot as a symlinks to the device node in the folder
/run/udev/static_node-tags/<tagname>/, if the device node exists.
These tags are cleaned up by udevadm info --cleanup-db, but are otherwise
never removed.
As of kmod v14, it is possible to export the static node information from
/lib/modules/`uname -r`/modules.devname in tmpfiles.d(5) format.
Use this functionality to let systemd-tmpfilesd create the static device nodes
at boot, and drop the functionality from systemd-udevd.
As an effect of this we can move from systemd-udevd to systemd-tmpfiles-setup-dev:
* the conditional CAP_MKNOD (replaced by checking if /sys is mounted rw)
* ordering before local-fs-pre.target (see 89d09e1b5c)
Containers will now carry a label (normally derived from the root
directory name, but configurable by the user), and the container's root
cgroup is /machine/<label>. This label is called "machine name", and can
cover both containers and VMs (as soon as libvirt also makes use of
/machine/).
libsystemd-login can be used to query the machine name from a process.
This patch also includes numerous clean-ups for the cgroup code.
The removal of the TIMEOUT= handling in udevd put firmware requests into the
devpath parent/child dependency tracking. Drivers which block in module_init()
asking userspace for firmware ran into a 30 sec device timeout.
The whole firmware loading willl hopefully move into the kernel and
the fragile-since-day-one fake async driver-core device dance involving
udev can be retired:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=abb139e75c2cdbb955e840d6331cb5863e409d0e
<falconindy> kay: just curious -- it looks like nodes created by udev from
modules.devname all have 000 perms, and there's nothing in udev that attempts
to change this. is it intended?
<falconindy> c--------- 1 root root 10, 223 Jul 1 23:10 uinput
<kay> falconindy: we might miss the default of 0600
<falconindy> seems like it
<kay> falconindy: stuff that has a rule works i guess
<kay> falconindy: i'll add the 0600 now
The filename parameter passed to mkdir can't contain anything but a
garbage value at this point. This was meant to be the full pathname to
the new udev DB, as the mkdir_parents() call before it won't create the
trailing child directory.
[replace mkdir_parents() + mkdir() with mkdir_p() -- kay]
Udev was the limiting factor for us on low-RAM systems.
Given an average RSS of 180kb, 128 workers would require ~23mb of RAM.
Now, please consider what happens when there is only, say, 15mb free.
Udev protects itself from OOM, and the kernel can do nothing but panic.
28 workers * 0.18mb = ~5mb. This change should not affect more powerful
systems much, given that they still get the addition from the amount of RAM.
This reverts commit 9b5af248f0.
Udev now explicitely labels only files/directories in /dev. The selinux
array API is not released and will not work on other distros at this moment.
systemd-udev is currently incorrectly labeling /run/udev/* content because it is
using selinux prefix labeling of /dev. This patch will allow systemd-udev to
use prefix labeling of /dev and /run.