Let's allow more memory to be locked on beefy machines than on small
ones. The previous limit of 64M is the lower bound still. This
effectively means on a 4GB machine we can lock 512M, which should be
more than enough, but still not lock up the machine entirely under
pressure.
Fixes: #15053
Let's be extra careful whenever we return from recvmsg() and see
MSG_CTRUNC set. This generally means we ran into a programming error, as
we didn't size the control buffer large enough. It's an error condition
we should at least log about, or propagate up. Hence do that.
This is particularly important when receiving fds, since for those the
control data can be of any size. In particular on stream sockets that's
nasty, because if we miss an fd because of control data truncation we
cannot recover, we might not even realize that we are one off.
(Also, when failing early, if there's any chance the socket might be
AF_UNIX let's close all received fds, all the time. We got this right
most of the time, but there were a few cases missing. God, UNIX is hard
to use)
ubsan complains that we add an offset to a NULL ptr here in some cases.
Which isn't really a bug though, since we only use it as the end
condition for a for loop, but we can still fix it...
Fixes: #15522
When a job is skipped due its dependencies not being ready, log
a debug message saying what is holding it back.
This was very useful with transient units timing out to figure
out where the problem was.
Just as log_full already does, check if the log level would result in
logging immediately in the macro in order to avoid doing
unnecessary work that adds up in hot spots.
It's not that I think that "hostname" is vastly superior to "host name". Quite
the opposite — the difference is small, and in some context the two-word version
does fit better. But in the tree, there are ~200 occurrences of the first, and
>1600 of the other, and consistent spelling is more important than any particular
spelling choice.
We use udev to wait for /dev/loopX devices to be fully proped hence we
need an implicit ordering dependency on it, for RootImage= to work
reliably in early boot, too.
Fixes: #14972
The watchdog ping is performed for every iteration of manager event
loop. This results in a lot of ioctls on watchdog device driver
especially during boot or if services are aggressively using sd_notify.
Depending on the watchdog device driver this may have performance
impact on embedded systems.
The patch skips sending the watchdog to device driver if the ping is
requested before half of the watchdog timeout.
Regardless of whether a mount is setup in initrd or int the main system,
the network default dependencies _netdev should still be honored.
IOW if a mount unit use the following options "x-initrd.mount,_netdev", it
should be ordered against initrd-fs.target, network.target,
network-online.target.
/dev/vdb1 /mnt ext4 x-initrd.mount,_netdev defaults 0 0
Before this patch:
Before=umount.target initrd-fs.target
After=system.slice sysroot.mount dev-vdb1.device -.mount systemd-journald.socket blockdev@dev-vdb1.target
After this patch:
Before=initrd-fs.target umount.target
After=network-online.target -.mount blockdev@dev-vdb1.target dev-vdb1.device sysroot.mount system.slice network.target systemd-journald.socket
First After=local-fs-pre.target wasn't described in the man page although it's
part of the default dependencies automatically set by pid1.
Secondly, Before=local-fs.target was only set if the automount unit was
generated from the fstab-generator because the dep was explicitly
generated. It was also not documented as a default dependency.
Fix it by managing the dep from pid1 instead.
fstab-generator was also handling the default ordering dependencies for mount
units setup in initrd. To do that it was turning the defaults dependencies off
completely and ordered the mount unit against either local-fs.target or
initrd-fs.target or initrd-root-fs.target itself.
But it had the bad side effect to also remove all other default dependencies as
well. Thus if an initrd mount was using _netdev, the network dependencies were
missing.
In general fstab-generator shouldn't use DefaultDependecies=no because it can
handle only a small set of the default dependencies the rest are dealt by pid1.
So this patch makes pid1 handle all default dependencies.
This patch changes the way user managers set the default umask for the units it
manages.
Indeed one can expect that if user manager's umask is redefined through PAM
(via /etc/login.defs or pam_umask), all its children including the units it
spawns have their umask set to the new value.
Hence make user units inherit their umask value from their parent instead of
the hard coded value 0022 but allow them to override this value via their unit
file.
Note that reexecuting managers with 'systemctl daemon-reexec' after changing
UMask= has no effect. To take effect managers need to be restarted with
'systemct restart' instead. This behavior was already present before this
patch.
Fixes#6077.
The commit b3ac5f8cb9 has changed the system mount propagation to
shared by default, and according to the following patch:
https://github.com/opencontainers/runc/pull/208
When starting the container, the pouch daemon will call runc to execute
make-private.
However, if the systemctl daemon-reexec is executed after the container
has been started, the system mount propagation will be changed to share
again by default, and the make-private operation above will have no chance
to execute.
This reworks the user validation infrastructure. There are now two
modes. In regular mode we are strict and test against a strict set of
valid chars. And in "relaxed" mode we just filter out some really
obvious, dangerous stuff. i.e. strict is whitelisting what is OK, but
"relaxed" is blacklisting what is really not OK.
The idea is that we use strict mode whenver we allocate a new user
(i.e. in sysusers.d or homed), while "relaxed" mode is when we process
users registered elsewhere, (i.e. userdb, logind, …)
The requirements on user name validity vary wildly. SSSD thinks its fine
to embedd "@" for example, while the suggested NAME_REGEX field on
Debian does not even allow uppercase chars…
This effectively liberaralizes a lot what we expect from usernames.
The code that warns about questionnable user names is now optional and
only used at places such as unit file parsing, so that it doesn't show
up on every userdb query, but only when processing configuration files
that know better.
Fixes: #15149#15090
And similar for other settings that require a writable /var/.
Rationale: if these options are used for early-boot services (such as
systemd-pstore.service) we need /var/ writable. And if /var/ is on the
root fs, then systemd-remount-fs.service is the service that ensures
that /var/ is writable.
This allows us to remove explicit deps in services such as
systemd-pstore.service.
Both fstab-generator and pid1 are duplicating the handling of
"Before=local-fs.target" dependency for mount units.
fstab-generator is correctly skipping this dep if the mount unit is activated
by an automount unit.
However the condition used by pid1 was incorrect and missed the case when a mount
unit uses "x-systemd.automount" since in this case the mount unit should be
(only) ordered against its automount unit counterpart instead.
Scheduling devices after a given unit can be useful to start device *jobs* at a
specific time in the transaction, see commit 4195077ab4.
This (hidden) change was introduced by commit eef85c4a3f.
Functions called from device_setup_unit() already make sure that unit is
enqueued in case it is a new unit or properties exported on the bus have
changed.
This should prevent unnecessary DBus wakeups and associated DBus traffic
when device_setup_unit() was called while reparsing /proc/self/mountinfo
due to the mountinfo notifications. Note that we parse
/proc/self/mountinfo quite often on the busy systems (e.g. k8s container
hosts) but majority of the time mounts didn't change, only some mount
got added. Thus we don't need to generate PropertiesChanged for devices
associated with the mounts that didn't change.
Thanks to Renaud Métrich <rmetrich@redhat.com> for debugging the
problem and providing draft version of the patch.
I think the two names were both pretty bad. They did not give a proper hint
what the difference between the two functions is, and sd_path_home sounds like
it is somehow related to /home or home directories or whatever, when in fact
both functions return the same set of paths as either a colon-delimited string
or a strv. "_strv" suffix is used by various functions in sd-bus, so let's
reuse that.
Those functions are not public yet, so let's rename.
The names with multiple lowercase words run together are hard to read. We
started that way with very short names like rootprefix, but then same pattern
was applied to longer and longer names. Looking at the body of .pc files
available on my machine, many packages use underscores; let's do the same. Old
names are kept for compatiblity, so this is backwards compatible.
When there are hundreds of mounts on the server, it will take a long
time to analyze the failure of a certain mount unit. So it is useful
to print the reason why unit_add_name() failed.
A common pattern in the codebase is reading a cgroup memory value
and converting it to a uint64_t. Let's make it a helper and refactor a
few places to use it so it's more concise.
Calling `mac_selinux_enforcing()`, which calls `security_getenforce()`, on a SELinux disabled system causes the following error message to be printed:
Failed to get SELinux enforced status: No such file or directory
Fixes: 257188f80c ("selinux: cache enforced status and treat retrieve failure as enforced mode")
Supersedes: #15145
In practice we are very unlikely to fail at this point, but for
consistency, we should always warn when allocation fails, and
we have free_and_strdup_warn() for this.
Without changing the SELinux label for private /dev of a service, it will take
a generic file system label:
system_u:object_r:tmpfs_t:s0
After this change it is the same as without `PrivateDevices=yes`:
system_u:object_r:device_t:s0
This helps writing SELinux policies, as the same rules for `/dev` will apply
despite any `PrivateDevices=yes` setting.
/dev used to be mounted with "exec" flag due to /dev/MAKEDEV script but that's
history and it's now located in /sbin. mmap() with file descriptor to
"/dev/zero" (instead of modern mmap(,,,MAP_ANON...)) will still work.
When using the cgroups IO controller, the device that is controlled
should always be the toplevel block device. This did not get resolved
correctly for an LVM volume inside a LUKS device, because the code would
only resolve one level of indirection.
Fix this by recursively looking up the originating block device for DM
devices.
Resolves: #15008
UnitStatusMessageFormats.finished_job, if present,
will be called with the same arguments as
job_get_done_status_message_format() to provide a format string
appropriate for the context
This commit replaces "Started" with "Finished" for started oneshot
units, as mentioned in the referenced issue
Closes#2458.
Reload the internal selabel cache automatically on SELinux policy reloads so non pid-1 daemons are participating.
Run the reload function `mac_selinux_reload()` not manually on daemon-reload, but rather pass it as callback to libselinux.
Trigger the callback prior usage of the systemd internal selabel cache by depleting the selinux netlink socket via `avc_netlink_check_nb()`.
Improves: a9dfac21ec ("core: reload SELinux label cache on daemon-reload")
Improves: #13363
It fully initializes the address structure, so no need for pre-initialization,
and also returns the length of the address, so no need to recalculate using
SOCKADDR_UN_LEN().
socklen_t is unsigned, so let's not use an int for it. (It doesn't matter, but
seems cleaner and more portable to not assume anything about the type.)
systemd.show-status=error is useful for the case where people care about errors
only.
If people want to have a quiet boot, they most likely don't want to see all
status output even if there is a delay in boot, so make "quiet" imply
systemd.show-status=error instead of systemd.show-status=auto.
Fixes#14976.
We would flip to status=temporary mode on the first error, and then switch back
to status=auto after the initial transaction was done. This isn't very useful,
because usually all the messages about successfully started units and not
related to the original failure. In fact, all those messages most likely cause
the information about the prime error to scroll off screen. And if the user
requested quiet boot, there's no reason to think that they care about those
success messages.
Also, when logging about dependency cycles, treat this similarly to a unit
error and show the message even if the status is "soft disabled" (before we
wouldn't show it in that case).
When we are booting with show-status=on, normally new status updates happen a
few times per second. Thus, it is reasonable to start showing the cylon eye
after 5 s, because that means a significant delay has happened. When we are
running with show-status=off or show-status=auto (and no error had occured),
the user is expecting maybe 15 to 90 seconds with no output (because that's
usually how long the whole boot takes). So we shouldn't bother the user with
information about a few seconds of delay. Let's make the timeout 25s if we are
not showing any messages.
Conversly, when we are outputting status messages, we can show the cylon eye
with a shorter delay, now that we removed the connection to enablement status.
Let's make this 2s, so users get feedback about delays more quickly.
We know if we created the file before, no need to repeat the operation. The
state in /run should always match our internal state. Since we call
manager_set_show_status() quite often internally, this saves quite a few
pointless syscalls.
We would say "Enabling" also for SHOW_STATUS_AUTO, which is actually
"soft off". So just print the exact state to make things easier to understand.
Also add a helper function to avoid repeating the enum value list.
For #14814.
In a user namespace container:
Feb 28 12:45:53 0b2420135953 systemd[1]: Starting Home Manager...
Feb 28 12:45:53 0b2420135953 systemd[21]: systemd-homed.service: Failed to set up network namespacing: Operation not permitted
Feb 28 12:45:53 0b2420135953 systemd[21]: systemd-homed.service: Failed at step NETWORK spawning /usr/lib/systemd/systemd-homed: Operation not permitted
Feb 28 12:45:53 0b2420135953 systemd[1]: systemd-homed.service: Main process exited, code=exited, status=225/NETWORK
Feb 28 12:45:53 0b2420135953 systemd[1]: systemd-homed.service: Failed with result 'exit-code'.
Feb 28 12:45:53 0b2420135953 systemd[1]: Failed to start Home Manager.
We should treat this similarly to the case where network namespace are not
supported at all.
https://bugzilla.redhat.com/show_bug.cgi?id=1807465
The man pages state that the '+' prefix in Exec* directives should
ignore filesystem namespacing options such as PrivateTmp. Now it does.
This is very similar to #8842, just with PrivateTmp instead of
PrivateDevices.
Without changing the SELinux label for private /dev of a service, it will take
a generic file system label:
system_u:object_r:tmpfs_t:s0
After this change it is the same as without `PrivateDevices=yes`:
system_u:object_r:device_t:s0
This helps writing SELinux policies, as the same rules for `/dev` will apply
despite any `PrivateDevices=yes` setting.
Currently, if deactivation of the primary swap unit fails:
# LANG=C systemctl --no-pager stop dev-mapper-fedora\\x2dswap.swap
Job for dev-mapper-fedora\x2dswap.swap failed.
See "systemctl status "dev-mapper-fedora\\x2dswap.swap"" and "journalctl -xe" for details.
then there are still the running stop jobs for all the secondary swap units
that follow the primary one:
# systemctl list-jobs
JOB UNIT TYPE STATE
3233 dev-disk-by\x2duuid-2dc8b9b1\x2da0a5\x2d44d8\x2d89c4\x2d6cdd26cd5ce0.swap stop running
3232 dev-dm\x2d1.swap stop running
3231 dev-disk-by\x2did-dm\x2duuid\x2dLVM\x2dyuXWpCCIurGzz2nkGCVnUFSi7GH6E3ZcQjkKLnF0Fil0RJmhoLN8fcOnDybWCMTj.swap stop running
3230 dev-disk-by\x2did-dm\x2dname\x2dfedora\x2dswap.swap stop running
3234 dev-fedora-swap.swap stop running
5 jobs listed.
This remains endlessly because their JobTimeoutUSec is infinity:
# LANG=C systemctl show -p JobTimeoutUSec dev-fedora-swap.swap
JobTimeoutUSec=infinity
If this issue happens during system shutdown, the system shutdown appears to
get hang and the system will be forcibly shutdown or rebooted 30 minutes later
by the following configuration:
# grep -E "^JobTimeout" /usr/lib/systemd/system/reboot.target
JobTimeoutSec=30min
JobTimeoutAction=reboot-force
The scenario in the real world seems that there is some service unit with
KillMode=none, processes whose memory is being swapped out are not killed
during stop operation in the service unit and then swapoff command fails.
On the other hand, it works well in successful case of swapoff command because
the secondary jobs monitor /proc/swaps file and can detect deletion of the
corresponding swap file.
This commit fixes the issue by finishing the secondary swap units' jobs if
deactivation of the primary swap unit fails.
Fixes: #11577
9e48626571 added some new syscalls to the
filter lists. However, on systems that do not yet support the new calls,
running systemd-run with the filter set results in error:
```
$ sudo systemd-run -t -r -p "SystemCallFilter=~@mount" /bin/true
Failed to start transient service unit: Invalid argument
```
Having the same properties in a unit file will start the service
without issue. This is because the load-fragment code will parse the
syscall filters in permissive mode:
https://github.com/systemd/systemd/blob/master/src/core/load-fragment.c#L2909
whereas the dbus-execute equivalent of the code does not.
Since the permissive mode appears to be the right setting to support
older kernels/libseccomp, this will update the dbus-execute parsing
to also be permissive.
Instead of setting the bus error structure and then freeing it, let's only set
it if used. If we will ignore the selinux denial, say ", ignore" to make this
clear. Also, use _cleanup_ to avoid gotos.
../src/core/selinux-access.c: In function ‘mac_selinux_generic_access_check’:
../src/basic/log.h:223:27: error: ‘%s’ directive argument is null [-Werror=format-overflow=]
../src/core/selinux-access.c:235:85: note: format string is defined here
235 | log_warning_errno(errno, "SELinux getcon_raw failed (tclass=%s perm=%s): %m", tclass, permission);
| ^~
I wonder why nobody ever noticed this.
Fixes#14691 (other issues listed in that ticket have already been fixed).
Let systemd create the dummy file where a device node will be mounted on with the default label for the parent directory (e.g. /tmp/namespace-dev-yTMwAe/dev/).
Fixes: #13762
Relevant when testing in permissive mode, where the function does not return a failure to the client.
This helps to configure a system in permissive mode, without getting surprising failures when switching to enforced mode.
When unit is reloaded, and the reloaded unit has bad-setting, then
unit_patch_contexts() is not called and exec_context::user and group
may not be configured.
A minimum reproducer for the case is:
- step 1.
$ sudo systemctl edit --full hoge.service
[Service]
oneshot
ExecStart=sleep 1h
- step 2.
$ sudo systemctl start hoge.service
- step 3.
$ sudo systemctl edit --full hoge.service
[Service]
Type=oneshot
ExecStart=@bindir@/sleep 1h
DynamicUser=yes
Then pid1 crashed.
Fixes#14733.
If we have exit on idle, then operations such as "journalctl
--namespace=foo --rotate" should work even if the journal daemon is
currently not running.
(Note that we don't do activation by varlink for the main instance of
journald, I am not sure the deadlocks it might introduce are worth it)
This way we shuld be able to order mounts properly against their backing
services in case complex storage is used (i.e. LUKS), even if the device
path used for mounting the devices is different from the expected device
node of the backing service.
Specifically, if we have a LUKS device /dev/mapper/foo that is mounted
by this name all is trivial as the relationship can be established a
priori easily. But if it is mounted via a /dev/disk/by-uuid/ symlink or
similar we only can relate the device node generated to the one mounted
at the moment the device is actually established. That's because the
UUID of the fs is stored inside the encrypted volume and thus not
knowable until the volume is set up. This patch tries to improve on this
situation: a implicit After=blockdev@.target dependency is generated for
all mounts, based on the data from /proc/self/mountinfo, which should be
the actual device node, with all symlinks resolved. This means that as
soon as the mount is established the ordering via blockdev@.target will
work, and that means during shutdown it is honoured, which is what we
are looking for.
Note that specifying /etc/fstab entries via UUID= for LUKS devices still
sucks and shouldn't be done, because it means we cannot know which LUKS
device to activate to make an fs appear, and that means unless the
volume is set up at boot anyway we can't really handle things
automatically when putting together transactions that need the mount.
This catches up with 9d06297e26 and adapts
the change made to swap units. We generally don't want to react
a-posteriori to swap devices disappearing, bad things will happen
anyway.
This extends on d253a45e1c, and instead of
merging just a single flag from previous mount entries of
/proc/self/mountinfo for the same path we merge all three.
This shouldn't change behaviour, but I think make things more readable.
Previously we'd set MOUNT_PROC_IS_MOUNTED unconditionally, we still do.
Previously we'd inherit MOUNT_PROC_JUST_MOUNTED from a previous entry on
the same line, we still do.
MOUNT_PROC_JUST_CHANGED should generally stay set too. Why that? If we
have two mount entries on the same mount point we'd first process one
and then the other, and the almost certainly different mount parameters
of the two would mean we'd set MOUNT_PROC_JUST_CHANGED for the second.
And with this we'll definitely do that still.
This also adds a comment explaining the situation a bit, and why we get
into this situation.
When starting a mount unit, systemd invokes mount command and moves the
unit's internal state to "mounting". Then it watches for updates of
/proc/self/mountinfo. When the expected mount entry newly appears in
mountinfo, the unit internal state is changed to "mounting-done".
Finally, when systemd finds the mount command has finished, it checks
whether the unit internal state is "mounting-done" and changes the state
to "mounted".
If the state was not "mounting-done" in the last step though mount command
was successfully finished, the unit is marked as "failed" with following
log messages:
Mount process finished, but there is no mount.
Failed with result 'protocol'.
If daemon-reload is done in parallel with starting mount unit, it is
possible that things happen in following order and result in above failure.
1. the mount unit state changes to "mounting"
2. daemon-reload saves the unit state
3. kernel completes the mount and /proc/self/mountinfo is updated
4. daemon-reload restores the saved unit state, that is "mounting"
5. systemd notices the mount command has finished but the unit state
is still "mounting" though it should be "mounting-done"
mount_setup_existing_unit() should take into account that MOUNT_MOUNTING
is transitional state and set MOUNT_PROC_JUST_MOUNTED flag if the unit
comes from /proc/self/mountinfo so that mount_process_proc_self_mountinfo()
later can make state transition from "mounting" to "mounting-done".
Fixes: #10872
In subsequent commits, calls to if_nametoindex() will be replaced by a wrapper
that falls back to alternative name resolution over netlink. netlink support
requires libsystemd (for sd-netlink), and we don't want to add any functions
that require netlink in basic/. So stuff that calls if_nametoindex() for user
supplied interface names, and everything that depends on that, needs to be
moved.
If we have a template unit template@.service, it should be allowed to specify a
dependency on a unit without an instance, bar@.service. When the unit is created,
the instance will be propagated into the target, so template@inst.service will
depend on bar@inst.service.
This commit changes unit_dependency_name_compatible(), which makes the manager
accept links like that, and unit_file_verify_alias(), so that the installation
function will agree to create a symlink like that, and finally the tests are
adjusted to pass.
We should allow the ones that the [Unit] section of regular unit files
may accet, but no other, in particular not the internal deps we
synthesize as reverse of explicitly configured ones, such was WantedBy=.
Fixes: #14251
When in a userns environment we cannot take away per-mount point flags
set on a mount point that was passed to us. Hence we need to be careful
to always check the actual mount flags in place and manipulate only
those flags of them that we actually want to change and not reset more
as side-effect.
We mostly got this right already in
bind_remount_recursive_with_mountinfo(), but didn't in the simpler
bind_remount_one_with_mountinfo(). Catch up.
(The old code assumed that the MountEntry.flags field contained the
right flag settings, but it actually doesn't for new mounts we just
established as for those mount() establishes the initial flags for us,
and we have to read them back to figure out which ones the kernel
picked.)
Fixes: #13622
It makes sense to filter state changes for some load states that
shouldn't happen, but the common cases should be accepted, because they
might happen during runtime when "systemctl daemon-reload" is issued and
unit files changed state in between. Otherwise we lose events.
Fixes: #4708
So far we set up a loopback file read-only iff ProtectSystem= and
ProtectHome= both where set to values that mark these dirs read-only.
Let's extend that and also be happy if /home and the root dir are marked
read-only by some other means.
Fixes: #14442
Similar, refuse triggering deps on units that cannot trigger.
And rework how we ignore After= dependencies on device units, to work
the same way.
See: #14142
Previously, when first connecting to the bus after connecting to it we'd
issue a ListNames() bus call to the driver to figure out which bus names
are currently active. This information was then used to initialize the
initial state for services that use BusName=.
This change removes the whole code for this and replaces it with
something vastly simpler.
First of all, the ListNames() call was issues synchronosuly, which meant
if dbus was for some reason synchronously calling into PID1 for some
reason we'd deadlock. As it turns out there's now a good chance it does:
the nss-systemd userdb hookup means that any user dbus-daemon resolves
might result in a varlink call into PID 1, and dbus resolves quite a lot
of users while parsing its policy. My original goal was to fix this
deadlock.
But as it turns out we don't need the ListNames() call at all anymore,
since #12957 has been merged. That PR was supposed to fix a race where
asynchronous installation of bus matches would cause us missing the
initial owner of a bus name when a service is first started. It fixed it
(correctly) by enquiring with GetOwnerName() who currently owns the
name, right after installing the match. But this means whenever we start watching a bus name we anyway
issue a GetOwnerName() for it, and that means also when first connecting
to the bus we don't need to issue ListNames() anymore since that just
tells us the same info: which names are currently owned.
hence, let's drop ListNames() and instead make better use of the
GetOwnerName() result: if it failed the name is not owned.
Also, while we are at it, let's simplify the unit's owner_name_changed()
callback(): let's drop the "old_owner" argument. We never used that
besides logging, and it's hard to synthesize from just the return of a
GetOwnerName(), hence don't bother.
When a service unit watches a bus name (i.e. because of BusName= being
set), then we do two things: we install a match slot to watch how its
ownership changes, and we inquire about the current owner. Make sure we
always do both together or neither.
This in particular fixes a corner-case memleak when destroying bus
connections, since we never freed the GetNameOwner() bus slots when
destroying a bus when they were still ongoing.
It's a *return* parameter, not an input parameter. Yes, this is a bit
confusing for method call replies, but we try to use the same message
handler for all incoming messages, hence the parameter. We are supposed
to write any error into it we encounter, if we want, and our caller will
log it, but that's it.
In the steps given in #13850, the resulting graph looks like:
C (Anchor) -> B -> A
Since B is inactive, it will be flagged as redundant and removed from
the transaction, causing A to get garbage collected. The proposed fix is
to not mark nodes as redundant if doing so causes a relevant node to be
garbage collected.
Fixes#13850
This has been requested many times before. Let's add it finally.
GPT auto-discovery for /var is a bit more complex than for other
partition types: the other partitions can to some degree be shared
between multiple OS installations on the same disk (think: swap, /home,
/srv). However, /var is inherently something bound to an installation,
i.e. specific to its identity, or actually *is* its identity, and hence
something that cannot be shared.
To deal with this this new code is particularly careful when it comes to
/var: it will not mount things blindly, but insist that the UUID of the
partition matches a hashed version of the machine-id of the
installation, so that each installation has a very specific /var
associated with it, and would never use any other. (We actually use
HMAC-SHA256 on the GPT partition type for /var, keyed by the machine-id,
since machine-id is something we want to keep somewhat private).
Setting the right UUID for installations takes extra care. To make
things a bit simpler to set up, we avoid this safety check for nspawn
and RootImage= in unit files, under the assumption that such container
and service images unlikely will have multiple installations on them.
The check is hence only required when booting full machines, i.e. in
in systemd-gpt-auto-generator.
To help with putting together images for full machines, PR #14368
introduces a repartition tool that can automatically fill in correctly
calculated UUIDs on first boot if images have the var partition UUID
initialized to all zeroes. With that in place systems can be put
together in a way that on first boot the machine ID is determined and
the partition table automatically adjusted to have the /var partition
with the right UUID.
This reverts commit 07125d24ee.
In contrast to what is claimed in #13396 dbus-broker apparently does
care for the service file to be around, and otherwise will claim
"Service Not Activatable" in the time between systemd starting up the
broker and connecting to it, which the stub service file is supposed to
make go away.
Reverting this makes the integration test suite pass again on host with
dbus-broker (i.e. current Fedora desktop).
Tested with dbus-broker-21-6.fc31.x86_64.
Write a user unit's invocation ID to /run/user/<uid>/systemd/units/ similar
to how a system unit's invocation ID is written to /run/systemd/units/.
This lets the journal read and add a user unit's invocation ID to the
_SYSTEMD_INVOCATION_ID field of logs instead of the user manager's
invocation ID.
Fixes#12474
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
Let per-user service managers have user namespaces too.
For unprivileged users, user namespaces are set up much earlier
(before the mount, network, and UTS namespaces vs after) in
order to obtain capbilities in the new user namespace and enable use of
the other listed namespaces. However for privileged users (root), the
set up for the user namspace is still done at the end to avoid any
restrictions with combining namespaces inside a user namespace (see
inline comments).
Closes#10576