udev uses inotify to implement a scheme where when the user closes
a writable device node, a change uevent is forcefully generated.
In the case of block devices, it actually requests a partition rescan.
This currently can't be synchronized with "udevadm settle", i.e. this
is not reliable in a script:
sfdisk --change-id /dev/sda 1 81
udevadm settle
mount /dev/sda1 /foo
The settle call doesn't synchronize there, so at the same time we try
to mount the device, udevd is busy removing the partition device nodes and
readding them again. The mount call often happens in that moment where the
partition node has been removed but not readded yet.
This exact issue was fixed long ago:
http://git.kernel.org/cgit/linux/hotplug/udev.git/commit/?id=bb38678e3ccc02bcd970ccde3d8166a40edf92d3
but that fix is no longer valid now that sequence numbers are no longer
used.
Fix this by forcing another mainloop iteration after handling inotify events
before unblocking settle. If the inotify event caused us to generate a
"change" event, we'll pick that up in the following loop iteration, before
we reach the end of the loop where we respond to settle's control message,
unblocking it.
The information in the db is stale, so it does not make sense to
expose it any longer. Also, don't drop the kernel event, but simply
pass it on to userspace without ammending it.
This essentially replaces
open("/run/udev/queue", O_WRONLY|O_CREAT|O_CLOEXEC|O_TRUNC|O_NOFOLLOW, 0444)
with
open("/run/udev/queue", O_WRONLY|O_CREAT|O_CLOEXEC|O_NOCTTY, 0644),
which is ok for our purposes.
The udev-settle guarantees that udevd is no longer processing any of the
events casued by udev-trigger. The way this works is that it sends a
synchronous PING to udevd after udev-trigger has ran, and when that returns
it knows that udevd has started processing the events from udev-trigger.
udev-settle will then wait for the event queue to empty before returning.
However, there was a race here, as we would only update the /run state at
the beginning of the event loop, before reading out new events and before
processing the ping.
That means that if the first uevent arrived in the same event-loop iteration
as the PING, we would return the ping before updating the queue state in /run
(which would happen on the next iteration).
The race window here is tiny (as the /run state would probably get updated
before udev-settle got a chance to read /run), but still a possibility.
Fix the problem by updating the /run state as the last step before returning
the PING.
We must still update it at the beginning of the loop as well, otherwise we
risk being stuck in poll() with a stale state in /run.
Reported-by: Daniel Drake <drake@endlessm.com>
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
include-what-you-use automatically does this and it makes finding
unnecessary harder to spot. The only content of poll.h is a include
of sys/poll.h so should be harmless.
If the format string contains %m, clearly errno must have a meaningful
value, so we might as well use log_*_errno to have ERRNO= logged.
Using:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\((".*%m.*")/log_\1_errno(errno, \2/'
Plus some whitespace, linewrap, and indent adjustments.
As a followup to 086891e5c1 "log: add an "error" parameter to all
low-level logging calls and intrdouce log_error_errno() as log calls
that take error numbers", use sed to convert the simple cases to use
the new macros:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\("(.*)%s"(.*), strerror\(-([a-zA-Z_]+)\)\);/log_\1_errno(-\4, "\2%m"\3);/'
Multi-line log_*() invocations are not covered.
And we also should add log_unit_*_errno().
Processes expecting static nodes to have the right permissions may order themselves after systemd-udevd.service,
make sure that actually guarantees what is expected.
Once upon a time logging during early boot was unreliable, so extra logging messages were
sent by udev to stderr. That is no longer a concern, so drop all fprintf() calls from udved.
Some kernel modules still take more than one minute to insmod, we no longer rely on the timeout
killing insmod within a given period of time, so just bump this to a much higher value. Its only
purpose is to make sure that nothing stays aronud forever.
Creating the rtnl context is cheap, but freeing it may not be, due to
synchronous close().
Also drop some excessive logging. We now log about the changing ifname
exactly once.
String which ended in an unfinished quote were accepted, potentially
with bad memory accesses.
Reject anything which ends in a unfished quote, or contains
non-whitespace characters right after the closing quote.
_FOREACH_WORD now returns the invalid character in *state. But this return
value is not checked anywhere yet.
Also, make 'word' and 'state' variables const pointers, and rename 'w'
to 'word' in various places. Things are easier to read if the same name
is used consistently.
mbiebl_> am I correct that something like this doesn't work
mbiebl_> ExecStart=/usr/bin/encfs --extpass='/bin/systemd-ask-passwd "Unlock EncFS"'
mbiebl_> systemd seems to strip of the quotes
mbiebl_> systemctl status shows
mbiebl_> ExecStart=/usr/bin/encfs --extpass='/bin/systemd-ask-password Unlock EncFS $RootDir $MountPoint
mbiebl_> which is pretty weird
Some events take longer than the default 30 seconds. Killing those
events will leave the machine halfway configured.
Add a commandline option '--event-timeout' to handle these cases.
MD instantiates devices at open(). This is incomptible with the
locking logic, as the "change" event emitted when stopping a
device will bring it back.
This should make sure that fdisk-like programs will automatically
cause an update of all partitions, just like mkfs-like programs cause
an update of the partition.
The way the kernel namespaces have been implemented breaks assumptions
udev made regarding uevent sequence numbers. Creating devices in a
namespace "steals" uevents and its sequence numbers from the host. It
confuses the "udevadmin settle" logic, which might block until util a
timeout is reached, even when no uevent is pending.
Remove any assumptions about sequence numbers and deprecate libudev's
API exposing these numbers; none of that can reliably be used anymore
when namespaces are involved.
In trying to track down a stupid linker bug, I noticed a bunch of
memset() calls that should be using memzero() to make it more "obvious"
that the options are correct (i.e. 0 is not the length, but the data to
set). So fix up all current calls to memset(foo, 0, length) to
memzero(foo, length).
Instead of individually checking for containers in each user do this
once in a new call proc_cmdline() that read the file only if we are not
in a container.
This reverts commit 47e737dc13 - it
introduced a use-after-free. The only way the code would get simpler
is with a cleanup function, but eh, not worth it for just this one
bit.
Reviewed by kay on IRC.
A regression introduced when we moved to systemd's logging is that the only
way to adjust the log-level of the udev daemon is via the env var, kernel
commandline or the commandline.
This reintroduces support for specifying this in the configuration file.
Based on a patch by Kay Sievers.
A tag is exported at boot as a symlinks to the device node in the folder
/run/udev/static_node-tags/<tagname>/, if the device node exists.
These tags are cleaned up by udevadm info --cleanup-db, but are otherwise
never removed.
As of kmod v14, it is possible to export the static node information from
/lib/modules/`uname -r`/modules.devname in tmpfiles.d(5) format.
Use this functionality to let systemd-tmpfilesd create the static device nodes
at boot, and drop the functionality from systemd-udevd.
As an effect of this we can move from systemd-udevd to systemd-tmpfiles-setup-dev:
* the conditional CAP_MKNOD (replaced by checking if /sys is mounted rw)
* ordering before local-fs-pre.target (see 89d09e1b5c)
Containers will now carry a label (normally derived from the root
directory name, but configurable by the user), and the container's root
cgroup is /machine/<label>. This label is called "machine name", and can
cover both containers and VMs (as soon as libvirt also makes use of
/machine/).
libsystemd-login can be used to query the machine name from a process.
This patch also includes numerous clean-ups for the cgroup code.
The removal of the TIMEOUT= handling in udevd put firmware requests into the
devpath parent/child dependency tracking. Drivers which block in module_init()
asking userspace for firmware ran into a 30 sec device timeout.
The whole firmware loading willl hopefully move into the kernel and
the fragile-since-day-one fake async driver-core device dance involving
udev can be retired:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=abb139e75c2cdbb955e840d6331cb5863e409d0e
<falconindy> kay: just curious -- it looks like nodes created by udev from
modules.devname all have 000 perms, and there's nothing in udev that attempts
to change this. is it intended?
<falconindy> c--------- 1 root root 10, 223 Jul 1 23:10 uinput
<kay> falconindy: we might miss the default of 0600
<falconindy> seems like it
<kay> falconindy: stuff that has a rule works i guess
<kay> falconindy: i'll add the 0600 now
The filename parameter passed to mkdir can't contain anything but a
garbage value at this point. This was meant to be the full pathname to
the new udev DB, as the mkdir_parents() call before it won't create the
trailing child directory.
[replace mkdir_parents() + mkdir() with mkdir_p() -- kay]
Udev was the limiting factor for us on low-RAM systems.
Given an average RSS of 180kb, 128 workers would require ~23mb of RAM.
Now, please consider what happens when there is only, say, 15mb free.
Udev protects itself from OOM, and the kernel can do nothing but panic.
28 workers * 0.18mb = ~5mb. This change should not affect more powerful
systems much, given that they still get the addition from the amount of RAM.
This reverts commit 9b5af248f0.
Udev now explicitely labels only files/directories in /dev. The selinux
array API is not released and will not work on other distros at this moment.
systemd-udev is currently incorrectly labeling /run/udev/* content because it is
using selinux prefix labeling of /dev. This patch will allow systemd-udev to
use prefix labeling of /dev and /run.