Similar, refuse triggering deps on units that cannot trigger.
And rework how we ignore After= dependencies on device units, to work
the same way.
See: #14142
Build option "link-networkd-shared" to build a statically linked
systemd-networkd by using
-Dlink-udev-shared=false -Dlink-networkd-shared=false
on systems with full systemd stack except systemd-networkd, such
as RHEL/CentOS 8.
To make this easier to understand, let's always log (at debug level)
when we accept or reject each device:
/swapfile: detection of swap file offset on Btrfs is not supported
/swapfile: is a candidate device.
/dev/zram0: ignoring zram swap
/dev/vdb: ignoring device with lower priority
/dev/vdc: ignoring device with lower usable space
...
If we know that hibernation will fail, refuse. This includes cases where
/sys/power/resume is set and doesn't match any device, or
/sys/power/resume_offset is set and we're not on btrfs and it doesn't match.
If /sys/power/resume is not set at all, we still accept the device with the
highest priority (see 6d176522f5 and
88bc86fcf8)
Tested cases:
1. no swap active → refuse
2. just zram swap active → refuse
3. swapfile on btrfs with /sys/power/resume{,_offset} set → OK
4. swapfile on btrfs with /sys/power/resume set, offset not set → refuse
5. swapfile on btrfs with /sys/power/resume set to nonexistent device, offset set → refuse
6. /sys/power/resume not set, offset set, candidate exists → OK (*)
7. /sys/power/resume not set, offset not set, candidate exists → OK
(*) I think this should fail, but I'm leaving that for the next commit.
Change the behavior of string arrays in a bus property map. Previously,
passing the same strv pointer to more than one map entry would result in
the old strv being freed and overwritten. With this change, an existing
strv pointer is appended to.
This is important if we want to create one strv comprised of multiple
dependencies. This makes it so callers don't have to create one strv per
dependency and subsequently merge them into one strv.
Let's add the actual unicode names of the glyphs we use. Let's also add
in comments what the width expectations of these glyphs are on the
console.
Also, remove the "mdash" definition. First of all it wasn't used, but
what's worse the glyph encoded was actually an "ndash"...
Fixes: #14075
Previously, when first connecting to the bus after connecting to it we'd
issue a ListNames() bus call to the driver to figure out which bus names
are currently active. This information was then used to initialize the
initial state for services that use BusName=.
This change removes the whole code for this and replaces it with
something vastly simpler.
First of all, the ListNames() call was issues synchronosuly, which meant
if dbus was for some reason synchronously calling into PID1 for some
reason we'd deadlock. As it turns out there's now a good chance it does:
the nss-systemd userdb hookup means that any user dbus-daemon resolves
might result in a varlink call into PID 1, and dbus resolves quite a lot
of users while parsing its policy. My original goal was to fix this
deadlock.
But as it turns out we don't need the ListNames() call at all anymore,
since #12957 has been merged. That PR was supposed to fix a race where
asynchronous installation of bus matches would cause us missing the
initial owner of a bus name when a service is first started. It fixed it
(correctly) by enquiring with GetOwnerName() who currently owns the
name, right after installing the match. But this means whenever we start watching a bus name we anyway
issue a GetOwnerName() for it, and that means also when first connecting
to the bus we don't need to issue ListNames() anymore since that just
tells us the same info: which names are currently owned.
hence, let's drop ListNames() and instead make better use of the
GetOwnerName() result: if it failed the name is not owned.
Also, while we are at it, let's simplify the unit's owner_name_changed()
callback(): let's drop the "old_owner" argument. We never used that
besides logging, and it's hard to synthesize from just the return of a
GetOwnerName(), hence don't bother.
When a service unit watches a bus name (i.e. because of BusName= being
set), then we do two things: we install a match slot to watch how its
ownership changes, and we inquire about the current owner. Make sure we
always do both together or neither.
This in particular fixes a corner-case memleak when destroying bus
connections, since we never freed the GetNameOwner() bus slots when
destroying a bus when they were still ongoing.
It's a *return* parameter, not an input parameter. Yes, this is a bit
confusing for method call replies, but we try to use the same message
handler for all incoming messages, hence the parameter. We are supposed
to write any error into it we encounter, if we want, and our caller will
log it, but that's it.
When calculation of swap file offset is unsupported, rely on the
/sys/power/resume & /sys/power/resume_offset values if configured
rather than requiring a matching swap entry to be identified.
Refactor to use dev_t for comparison of resume= device instead of string.
In the steps given in #13850, the resulting graph looks like:
C (Anchor) -> B -> A
Since B is inactive, it will be flagged as redundant and removed from
the transaction, causing A to get garbage collected. The proposed fix is
to not mark nodes as redundant if doing so causes a relevant node to be
garbage collected.
Fixes#13850
When processing list of units (either provided manually or as a
wildcard), let's skip units for which we don't have an on-disk
counterpart, but note the -ENOENT error code and propagate it back to
the user.
Fixes: #14082
This has been requested many times before. Let's add it finally.
GPT auto-discovery for /var is a bit more complex than for other
partition types: the other partitions can to some degree be shared
between multiple OS installations on the same disk (think: swap, /home,
/srv). However, /var is inherently something bound to an installation,
i.e. specific to its identity, or actually *is* its identity, and hence
something that cannot be shared.
To deal with this this new code is particularly careful when it comes to
/var: it will not mount things blindly, but insist that the UUID of the
partition matches a hashed version of the machine-id of the
installation, so that each installation has a very specific /var
associated with it, and would never use any other. (We actually use
HMAC-SHA256 on the GPT partition type for /var, keyed by the machine-id,
since machine-id is something we want to keep somewhat private).
Setting the right UUID for installations takes extra care. To make
things a bit simpler to set up, we avoid this safety check for nspawn
and RootImage= in unit files, under the assumption that such container
and service images unlikely will have multiple installations on them.
The check is hence only required when booting full machines, i.e. in
in systemd-gpt-auto-generator.
To help with putting together images for full machines, PR #14368
introduces a repartition tool that can automatically fill in correctly
calculated UUIDs on first boot if images have the var partition UUID
initialized to all zeroes. With that in place systems can be put
together in a way that on first boot the machine ID is determined and
the partition table automatically adjusted to have the /var partition
with the right UUID.
It turns out that this is not necessary. When we try to resolve alias@inst, we
first check alias@inst, and if that is not found, fall back to alias@. Since we
already created a symlink for alias@, we will find that and the result will be
the same.
get_block_device() is just the nicer way to do it (since it also odes
btrfs). Also, let's already collect the dev_t of the loopback device
when we enumerate things, that allows us to do the checks simpler
without constantly stat()ing things over and over again.
We fucked up errno vs. r two times, let's correct that.
While we are at it, let's handle the error first, like we usually do,
and the clean case without indentation.
This reverts commit 07125d24ee.
In contrast to what is claimed in #13396 dbus-broker apparently does
care for the service file to be around, and otherwise will claim
"Service Not Activatable" in the time between systemd starting up the
broker and connecting to it, which the stub service file is supposed to
make go away.
Reverting this makes the integration test suite pass again on host with
dbus-broker (i.e. current Fedora desktop).
Tested with dbus-broker-21-6.fc31.x86_64.
Write a user unit's invocation ID to /run/user/<uid>/systemd/units/ similar
to how a system unit's invocation ID is written to /run/systemd/units/.
This lets the journal read and add a user unit's invocation ID to the
_SYSTEMD_INVOCATION_ID field of logs instead of the user manager's
invocation ID.
Fixes#12474
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
Let per-user service managers have user namespaces too.
For unprivileged users, user namespaces are set up much earlier
(before the mount, network, and UTS namespaces vs after) in
order to obtain capbilities in the new user namespace and enable use of
the other listed namespaces. However for privileged users (root), the
set up for the user namspace is still done at the end to avoid any
restrictions with combining namespaces inside a user namespace (see
inline comments).
Closes#10576
PrefixRoute= was added by e63be0847c,
but unfortunately, the meaning of PrefixRoute= is inverted; when true
IFA_F_NOPREFIXROUTE flag is added. This introduces AddPrefixRoute=
setting.
This only matters for the case where new networkctl is running against older
networkd. We should still handle the old error to avoid unnecessary warning
about speedmeeter being disabled.
This partially reverts commit e813de549b.
Don't try to show top level drop-in for non-existent units or when trying to
instantiate non-instantiated units:
$ systemctl cat nonexistent@.service
Assertion 'name' failed at src/shared/dropin.c:143, function unit_file_find_dirs(). Aborting.
$ systemctl cat systemd-journald@.service
Assertion 'name' failed at src/shared/dropin.c:143, function unit_file_find_dirs(). Aborting.
The code existed in machinectl to use stdin/stdout if the path for
import/export tar/raw was empty or dash (-) but a check to
`fd_verify_regular` in importd prevented it from working.
Update the check instead to explicitly check for regular file or
pipe/fifo.
Fixes#14346
The variable is always initialized, but the compiler might not notice
that. With gcc-9.2.1-1.fc31:
$ CFLAGS='-Werror=maybe-uninitialized -Og' meson build
$ ninja -C build
[...]
../src/basic/tmpfile-util.c: In function ‘mkostemp_safe’:
../src/basic/tmpfile-util.c:76:12: error: ‘fd’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
76 | if (fd < 0)
| ^
Ideally, we would want to report this over back over dbus. But that is pretty hard,
because the unitfile parsing logic doesn't provide any feedback.
systemd-analyze verify also doesn't notice the issue, because it doesn't look
at the [Install] section at all. Let's print a message in the logs at least.
The combination of sd_netlink_message_enter_container() and
sd_netlink_message_read_string() only reads the last element if the attribute is
duplicated, such a situation easily happens for IFLA_ALT_IFNAME.
The function introduced here reads all matched attributes.
On my machine stat returns size 22, but only 20 bytes are read:
openat(AT_FDCWD, "/sys/firmware/efi/efivars/LoaderTimeInitUSec-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=22, ...}) = 0
read(3, "\6\0\0\0", 4) = 4
read(3, "7\0001\0001\0003\0005\0002\0007\0\0\0", 18) = 16
Failed to read LoaderTimeInitUSec: Input/output error
Let's just accept that the kernel is returning inconsistent results.
It seems to happen two only two variables on my machine:
/sys/firmware/efi/efivars/LoaderTimeInitUSec-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f
/sys/firmware/efi/efivars/LoaderTimeMenuUSec-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f
so it might be related to the way we write them.
If we call LOOP_CLR_FD and LOOP_CTL_REMOVE too rapidly, the kernel cannot deal
with that (5.3.13-300.fc31.x86_64 running on dual core
Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz).
$ sudo strace -eioctl build/test-dissect-image /tmp/foobar3.img
ioctl(3, TCGETS, 0x7ffcee47de20) = -1 ENOTTY (Inappropriate ioctl for device)
ioctl(4, LOOP_CTL_GET_FREE) = 9
ioctl(5, LOOP_SET_FD, 3) = 0
ioctl(5, LOOP_SET_STATUS64, {lo_offset=0, lo_number=0, lo_flags=LO_FLAGS_READ_ONLY|LO_FLAGS_AUTOCLEAR|LO_FLAGS_PARTSCAN, lo_file_name="", ...}) = 0
ioctl(5, BLKGETSIZE64, [299999744]) = 0
ioctl(5, CDROM_GET_CAPABILITY, 0) = -1 EINVAL (Invalid argument)
ioctl(5, BLKSSZGET, [512]) = 0
Waiting for device (parent + 0 partitions) to appear...
Found root partition, writable of type btrfs at #-1 (/dev/block/7:9)
ioctl(5, LOOP_CLR_FD) = 0
ioctl(3, LOOP_CTL_REMOVE, 9) = -1 EBUSY (Device or resource busy)
Failed to remove loop device: Device or resource busy
This seems to be clear race condition, and attaching strace is generally enough
to "win" the race. But even with strace attached, we will fail occasionally.
Let's wait a bit and retry. With the wait, on my machine, the second attempt
always succeeds:
...
Found root partition, writable of type btrfs at #-1 (/dev/block/7:9)
ioctl(5, LOOP_CLR_FD) = 0
ioctl(3, LOOP_CTL_REMOVE, 9) = -1 EBUSY (Device or resource busy)
ioctl(3, LOOP_CTL_REMOVE, 9) = 9
+++ exited with 0 +++
Without the wait, all 64 attempts will occasionally fail.
The function no longer returns the fd. This complicated semantics, because it
wasn't clear what holds the ownership: the return value or the output
parameter. There were no users of the fd in the return value, so let's
simplify things conceptually and only return the fd once.
Reduce the scope of variables.
LOOP_CLR_FD was called on the wrong fd. Let's use a cleanup function to make
this automatic and reduce chances of a mixup in the future.
CID 1408498.
This makes --overlay=+/foobar::/foobar work again, i.e. where the middle
parameter is left out. According to the documentation this is supposed
to generate a temporary writable work place in the midle. But it
apparently never did. Weird.
Don't reset the conflict counter when trying a new pseudo random
address, so that after trying 10 addresses the londer timeout is used in
accordance with the RFC
Fixes#14299.
This adds a new crypttab option for volumes "pkcs11-uri=" which takes a
PKCS#11 URI. When used the key stored in the line's key file is
decrypted with the private key the PKCS#11 URI indiciates.
This means any smartcard that can store private RSA keys is usable for
unlocking LUKS devices.
Let's improve memory allocation for call such as strv_extend() that just
one item to an strv: these are often called in a loop, where they used
to be very ineffecient, since we'd allocate byte-exact space. With this
change let's improve on that, by allocating exponentially by rounding up
to the next exponent of 2. This way we get GREEDY_REALLOC()-like
behaviour without passing around state.
In fact this should be good enough so that we could replace existing
loops around GREEDY_REALLOC() for strv build-up with plain strv_extend()
and get similar behaviour.
If a daemon is not started as root, most likely it also can't create its
directory and let's not try to resolve the user in that case either.
Create /run/systemd/netif/lldp with tmpfiles.d like other netif directories.
This is also very helpful for preparing a RootImage for the daemons as NSS crud
is not needed.
Intermediate Functional Block
The Intermediate Functional Block (ifb) pseudo network interface acts as a QoS concentrator for multiple different sources of traffic. Packets from or to other interfaces have to be redirected to it using the mirred action in order to be handled, regularly routed traffic will be dropped. This way, a single stack of qdiscs, classes and filters can be shared between multiple interfaces.
Here's a simple example to feed incoming traffic from multiple interfaces through a Stochastic Fairness Queue (sfq):
(1) # modprobe ifb
(2) # ip link set ifb0 up
(3) # tc qdisc add dev ifb0 root sfq
CID 1409488.
This code was added in 903659e7b2. The change
that is done here is a simple fix to avoid use of a
unitialized/wrongly-initialized variable, but the bigger issue is that nothing
looks at the returned result to distinguish between 0 and a positive return
value.
If the interface is in initialized state, no network file is assigned to
the interface. If an interface is not managed by networkd, previously,
the foreign configs of the interface was dropped.
Fixes#14250.
The kernel resets the ipv6 mtu after NETDEV_UP or NETDEV_CHANGEMTU event,
so we must reset the ipv6 mtu to our configured value after we detect
IFF_UP flag set or after we set the device mtu.
Fixes: #13914.
This option is an indication for PID1 that the entry in crypttab is handled by
initrd only and therefore it shouldn't interfer during the usual start-up and
shutdown process.
It should be primarily used with the encrypted device containing the root FS as
we want to keep it (and thus its encrypted device) until the very end of the
shutdown process, i.e. when initrd takes over.
This option is the counterpart of "x-initrd.mount" used in fstab.
Note that the slice containing the cryptsetup services also needs to drop the
usual shutdown dependencies as it's required by the cryptsetup services.
Fixes: #14224
Like with shmat already the actual results of the test
test_memory_deny_write_execute_mmap depend on kernel/libseccomp/glibc
of the platform it is running on.
There are known-good platforms, but on the others do not assert success
(which implies test has actually failed as no seccomp blocking was achieved),
but instead make the check dependent to the success of the mmap call
on that platforms.
Finally the assert of the munmap on that valid pointer should return ==0,
so that is what the check should be for in case of p != MAP_FAILED.
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
At the beginning of seccomp_memory_deny_write_execute architectures
can set individual filter_syscall, block_syscall, shmat_syscall values.
The former two are then used in the call to add_seccomp_syscall_filter
but shmat_syscall is not.
Right now all shmat_syscall values are the same, so the change is a
no-op, but if ever an architecture is added/modified this would be a
subtle source for a mistake so fix it by using shmat_syscall later.
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
If seccomp_memory_deny_write_execute was fatally failing to load rules it
already returned a bad retval.
But if any adding filters failed it skipped the subsequent seccomp_load and
always returned an rc of 0 even if no rule was loaded at all.
Lets fix this requiring to (non fatally-failing) load at least one rule set.
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
This switches detect_container() to path_is_read_only_rw("/sys"), as if
systemd-udevd.service is conditionalized with that way.
This also updates the log message.
If /sys is read only filesystem, e.g., nspawn is running in container,
then usually udev is not running. In such a case, let's assume that
the interface is already initialized. Also, this makes nspawn refuse
to use the network interface which is under renaming.
Fixes#14223.