When a service unit watches a bus name (i.e. because of BusName= being
set), then we do two things: we install a match slot to watch how its
ownership changes, and we inquire about the current owner. Make sure we
always do both together or neither.
This in particular fixes a corner-case memleak when destroying bus
connections, since we never freed the GetNameOwner() bus slots when
destroying a bus when they were still ongoing.
It's a *return* parameter, not an input parameter. Yes, this is a bit
confusing for method call replies, but we try to use the same message
handler for all incoming messages, hence the parameter. We are supposed
to write any error into it we encounter, if we want, and our caller will
log it, but that's it.
In the steps given in #13850, the resulting graph looks like:
C (Anchor) -> B -> A
Since B is inactive, it will be flagged as redundant and removed from
the transaction, causing A to get garbage collected. The proposed fix is
to not mark nodes as redundant if doing so causes a relevant node to be
garbage collected.
Fixes#13850
This has been requested many times before. Let's add it finally.
GPT auto-discovery for /var is a bit more complex than for other
partition types: the other partitions can to some degree be shared
between multiple OS installations on the same disk (think: swap, /home,
/srv). However, /var is inherently something bound to an installation,
i.e. specific to its identity, or actually *is* its identity, and hence
something that cannot be shared.
To deal with this this new code is particularly careful when it comes to
/var: it will not mount things blindly, but insist that the UUID of the
partition matches a hashed version of the machine-id of the
installation, so that each installation has a very specific /var
associated with it, and would never use any other. (We actually use
HMAC-SHA256 on the GPT partition type for /var, keyed by the machine-id,
since machine-id is something we want to keep somewhat private).
Setting the right UUID for installations takes extra care. To make
things a bit simpler to set up, we avoid this safety check for nspawn
and RootImage= in unit files, under the assumption that such container
and service images unlikely will have multiple installations on them.
The check is hence only required when booting full machines, i.e. in
in systemd-gpt-auto-generator.
To help with putting together images for full machines, PR #14368
introduces a repartition tool that can automatically fill in correctly
calculated UUIDs on first boot if images have the var partition UUID
initialized to all zeroes. With that in place systems can be put
together in a way that on first boot the machine ID is determined and
the partition table automatically adjusted to have the /var partition
with the right UUID.
This reverts commit 07125d24ee.
In contrast to what is claimed in #13396 dbus-broker apparently does
care for the service file to be around, and otherwise will claim
"Service Not Activatable" in the time between systemd starting up the
broker and connecting to it, which the stub service file is supposed to
make go away.
Reverting this makes the integration test suite pass again on host with
dbus-broker (i.e. current Fedora desktop).
Tested with dbus-broker-21-6.fc31.x86_64.
Write a user unit's invocation ID to /run/user/<uid>/systemd/units/ similar
to how a system unit's invocation ID is written to /run/systemd/units/.
This lets the journal read and add a user unit's invocation ID to the
_SYSTEMD_INVOCATION_ID field of logs instead of the user manager's
invocation ID.
Fixes#12474
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
Let per-user service managers have user namespaces too.
For unprivileged users, user namespaces are set up much earlier
(before the mount, network, and UTS namespaces vs after) in
order to obtain capbilities in the new user namespace and enable use of
the other listed namespaces. However for privileged users (root), the
set up for the user namspace is still done at the end to avoid any
restrictions with combining namespaces inside a user namespace (see
inline comments).
Closes#10576
Reloading the SELinux label cache here enables a light-wight follow-up of a SELinux policy change, e.g. adding a label for a RuntimeDirectory.
Closes: #13363
This makes the code do what the documentation says. The code had no inkling
about initrd.target, so I think this change is fairly risky. As a fallback,
default.target will be loaded, so initramfses which relied on current behaviour
will still work, as along as they don't have a different initrd.target.
In an initramfs created with recent dracut:
$ ls -l usr/lib/systemd/system/{default.target,initrd.target}
lrwxrwxrwx. usr/lib/systemd/system/default.target -> initrd.target
-rw-r--r--. usr/lib/systemd/system/initrd.target
So at least for dracut, there should be no difference.
Also avoid a pointless allocation.
The function capability_ambient_set_apply() now drops capabilities not
in the capability_ambient_set(), so it is necessary to call it when
the ambient set is empty.
Fixes#13163
This partially reverts a07a7324ad.
We have two pieces of information: the value and a boolean.
config_parse_timeout_abort() added in the reverted commit would write
the boolean to the usec_t value, making a mess.
The code is reworked to have just one implementation and two wrappers
which pass two pointers.
The conf-parser machinery already removed whitespace before and after "=", no
need to repeat this step.
The test is adjusted to pass. It was testing an code path that doesn't happen
normally, no point in doing that.
The original PR was submitted with CPUSetCpus and CPUSetMems, which was later
changed to AllowedCPUs and AllowedMemmoryNodes everywhere (including the parser
used by systemd-run), but not in the parser for unit files.
Since we already released -rc1, let's keep support for the old names. I think
we can remove it in a release or two if anyone remembers to do that.
Fixes#14126. Follow-up for 047f5d63d7.
I see we log this during every boot, even though it is a routine expected event:
Nov 12 14:50:01 krowka systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
(and for other services too). Let's downgrade this to debug level.
https://bugzilla.redhat.com/show_bug.cgi?id=1614871
Let's mark cgroups that are delegation boundaries to us. This can then
be used by tools such as "systemd-cgls" to show where the next manager
takes over.
Previously we'd only skip ProtectHostname= if kernel support for
namespaces was lacking. With this change we also accept if unshare()
fails because it is blocked.
In some containers unshare() is made unavailable entirely. Let's deal
with this that more gracefully and disable our sandboxing of services
then, so that we work in a container, under the assumption the container
manager is then responsible for sandboxing if we can't do it ourselves.
Previously, we'd insist on sandboxing as soon as any form of BindPath=
is used. With this change we only insist on it if we have a setting like
that where source and destination differ, i.e. there's a mapping
established that actually rearranges things, and thus would result in
systematically different behaviour if skipped (as opposed to mappings
that just make stuff read-only/writable that otherwise arent').
(Let's also update a test that intended to test for this behaviour with
a more specific configuration that still triggers the behaviour with
this change in place)
Fixes: #13955
(For testing purposes unshare() can easily be blocked with
systemd-nspawn --system-call-filter=~unshare.)
Our handling of the condition was inconsistent. Normally, we'd only fire when
the file was created (or removed and subsequently created again). But on restarts,
we'd do a "recheck" from path_coldplug(), and if the file existed, we'd
always trigger. Daemon restarts and reloads should not be observeable, in
the sense that they should not trigger units which were already triggered and
would not be started again under normal circumstances.
Note that the mechanism for checks is racy: we get a notification from inotify,
and by the time we check, the file could have been created and removed again,
or removed and created again. It would be better if we inotify would give as
an unambiguous signal that the file was created, but it doesn't: IN_DELETE_SELF
triggers on inode removal, not directory entry, so we need to include IN_ATTRIB,
which obviously triggers on other conditions.
Fixes#12801.
We allow expressing configuration as a fraction with granularity of 0.001, but
when writing out the unit file, we'd round that up to 0.01.
Longer term, I think it'd be nicer to simply use floats and do away with
arbitrary restrictions on precision.
TasksMax= and DefaultTasksMax= can be specified as percentages. We don't
actually document of what the percentage is relative to, but the implementation
uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max,
and /sys/fs/cgroup/pids.max (when present). When the value is a percentage,
we immediately convert it to an absolute value. If the limit later changes
(which can happen e.g. when systemd-sysctl runs), the absolute value becomes
outdated.
So let's store either the percentage or absolute value, whatever was specified,
and only convert to an absolute value when the value is used. For example, when
starting a unit, the absolute value will be calculated when the cgroup for
the unit is created.
Fixes#13419.
It turns out that the kernel verifier would reject a program we would build
if there was a whitelist, but no entries in the whitelist matched.
The program would approximately like this:
0: (61) r2 = *(u32 *)(r1 +0)
1: (54) w2 &= 65535
2: (61) r3 = *(u32 *)(r1 +0)
3: (74) w3 >>= 16
4: (61) r4 = *(u32 *)(r1 +4)
5: (61) r5 = *(u32 *)(r1 +8)
48: (b7) r0 = 0
49: (05) goto pc+1
50: (b7) r0 = 1
51: (95) exit
and insn 50 is unreachable, which is illegal. We would then either keep a
previous version of the program or allow everything. Make sure we build a
valid program that simply rejects everything.
Most of the time, we specify the allowed access mode as "rwm", so the check
always trivially passes. In that case, skip the check.
The repeating part changes from:
5: (55) if r2 != 0x2 goto pc+6
6: (bc) w1 = w3
7: (54) w1 &= 7
8: (5d) if r1 != r3 goto pc+3
9: (55) if r4 != 0x1 goto pc+2
10: (55) if r5 != 0x3 goto pc+1
11: (05) goto pc+8
to
6: (55) if r2 != 0x2 goto pc+3
7: (55) if r4 != 0x1 goto pc+2
8: (55) if r5 != 0x3 goto pc+1
9: (05) goto pc+40
This makes the code a bit longer, but easier to read I think, because
the cgroup v1 and v2 code paths are more similar. And whent he type is
a char, any backtrace is easier to interpret.
The naming of the functions was a complete mess: the most specific functions
which don't know anything about cgroups had "cgroup_" prefix, while more
general functions which took a node path and a cgroup for reporting had no
prefix. Let's use "bpf_devices_" for the latter group, and "bpf_prog_*" for the
rest.
The main goal of this move is to split the implementation from the calling code
and add unit tests in a later patch.
Discussed in #13743, the -.service semantic conflicts with the
existing root mount and slice names, making this feature not
uniformly extensible to all types. Change the name to be
<type>.d instead.
Updating to this format also extends the top-level dropin to
unit types.
Currently, systemctl reload command breaks ordering dependencies if it's
executed when its target service unit is in activating state.
For example, prepare A.service, B.service and C.target as follows:
# systemctl cat A.service B.service C.target
# /etc/systemd/system/A.service
[Unit]
Description=A
[Service]
Type=oneshot
ExecStart=/usr/bin/echo A1
ExecStart=/usr/bin/sleep 60
ExecStart=/usr/bin/echo A2
ExecReload=/usr/bin/echo A reloaded
RemainAfterExit=yes
# /etc/systemd/system/B.service
[Unit]
Description=B
After=A.service
[Service]
Type=oneshot
ExecStart=/usr/bin/echo B
RemainAfterExit=yes
# /etc/systemd/system/C.target
[Unit]
Description=C
Wants=A.service B.service
Start them.
# systemctl daemon-reload
# systemctl start C.target
Then, we have:
# LANG=C journalctl --no-pager -u A.service -u B.service -u C.target -b
-- Logs begin at Mon 2019-09-09 00:25:06 EDT, end at Thu 2019-10-24 22:28:47 EDT. --
Oct 24 22:27:47 localhost.localdomain systemd[1]: Starting A...
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Child 967 belongs to A.service.
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Running next main command for state start.
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Passing 0 fds to service
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: About to execute: /usr/bin/sleep 60
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Forked /usr/bin/sleep as 968
Oct 24 22:27:47 localhost.localdomain systemd[968]: A.service: Executing: /usr/bin/sleep 60
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Trying to enqueue job A.service/reload/replace
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Merged into running job, re-running: A.service/reload as 1288
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Enqueued job A.service/reload as 1288
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Unit cannot be reloaded because it is inactive.
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Job 1288 A.service/reload finished, result=invalid
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Passing 0 fds to service
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: About to execute: /usr/bin/echo B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Forked /usr/bin/echo as 970
Oct 24 22:27:52 localhost.localdomain systemd[970]: B.service: Executing: /usr/bin/echo B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Failed to send unit change signal for B.service: Connection reset by peer
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Changed dead -> start
Oct 24 22:27:52 localhost.localdomain systemd[1]: Starting B...
Oct 24 22:27:52 localhost.localdomain echo[970]: B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Child 970 belongs to B.service.
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Changed start -> exited
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Job 1371 B.service/start finished, result=done
Oct 24 22:27:52 localhost.localdomain systemd[1]: Started B.
Oct 24 22:27:52 localhost.localdomain systemd[1]: C.target: Job 1287 C.target/start finished, result=done
Oct 24 22:27:52 localhost.localdomain systemd[1]: Reached target C.
Oct 24 22:27:52 localhost.localdomain systemd[1]: C.target: Failed to send unit change signal for C.target: Connection reset by peer
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Child 968 belongs to A.service.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Running next main command for state start.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Passing 0 fds to service
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: About to execute: /usr/bin/echo A2
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Forked /usr/bin/echo as 972
Oct 24 22:28:47 localhost.localdomain systemd[972]: A.service: Executing: /usr/bin/echo A2
Oct 24 22:28:47 localhost.localdomain echo[972]: A2
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Child 972 belongs to A.service.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Changed start -> exited
The issue occurs not only in reload command, i.e.:
- reload
- try-restart
- reload-or-restart
- reload-or-try-restart commands
The cause of this issue is that job_type_collapse() doesn't take care of the
activating state.
Fixes: #10464
This reworks the logic introduced in
a5cede8c24 (#13693).
First of all, let's move this out of util.c, since only PID 1 really
needs this, and there's no real need to have it in util.c.
Then, fix freeing of the variable. It previously relied on
STATIC_DESTRUCTOR_REGISTER() which however relies on static_destruct()
to be called explicitly. Currently only the main-func.h macros do that,
and PID 1 does not. (It might be worth investigating whether to do that,
but it's not trivial.) Hence the freeing wasn't applied.
Finally, an OOM check was missing, add it in.
cpu_set_to_range_string() can fail due to OOM. Handle that.
unit_write_settingf() exists, use it instead of formatting a string
beforehand.
cpu_set_add_all() can fail due to OOM. Let's avoid it if we don't have
to use it, just copy over the cpuset directly.
It was done for mount units already (see commit 142b8142d7). For the
same reasons and for consistency we should also stop activating automagically
swaps when their device is hot-plugged.
From the bug:
> According to the documentation of systemd.automount if the automoint point is
> automagically created if it doesn't exist yet. This ofcourse means the
> filesystem underneath has to be writable, which for / means not only does
> -.mount need to be started but also systemd-remount-fs.service has to be run,
> which isn't guaranteed by the default automount dependencies.
>
> For .mount units there is an automatic default After= dependency on
> local-fs-pre.target, would probably make sense to do the same for automount
> units to avoid it failing on the corner-case where it has to create directory.
Fixes#13306.
Just as `RuntimeMaxSec=` is supported for service units, add support for
it to scope units. This will gracefully kill a scope after the timeout
expires from the moment the scope enters the running state.
This could be used for time-limited login sessions, for example.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
Fixes: #12035
chase_symlinks() would return negative on error, and either a non-negative status
or a non-negative fd when CHASE_OPEN was given. This made the interface quite
complicated, because dependning on the flags used, we would get two different
"types" of return object. Coverity was always confused by this, and flagged
every use of chase_symlinks() without CHASE_OPEN as a resource leak (because it
would this that an fd is returned). This patch uses a saparate output parameter,
so there is no confusion.
(I think it is OK to have functions which return either an error or an fd. It's
only returning *either* an fd or a non-fd that is confusing.)
This makes it easy to tell that the function only uses the Unit* for
reporting, and only makes changes to the other argument (which most likely
also points at the same Unit structure) for modifications.
No functional change, just adjusting code to follow the same pattern
everywhere. In particular, never call _verify() on an already loaded unit,
but return early from the caller instead. This makes the code a bit easier
to follow.
Now all unit types define .load. But even if it wasn't defined, we'd need
to call unit_load_fragment_and_dropin() anyway, so this code would not have
worked correctly.
Also, unit_load_fragment_and_dropin() either returns -ENOENT or changes
UNIT_STUB to UNIT_LOADED, so we don't need to repeat this here.
There is a slight functional change when load_state == UNIT_MERGED. Before,
we would not call unit_load_dropin(), but now we do. I'm not sure if this
causes an actual difference in behaviour, but since all other unit types do
this, I think it's better to do the same thing here too.
unit_load_fragment_and_dropin() and unit_load_fragment_and_dropin_optional()
are really the same, with one minor difference in behaviour. Let's drop
the second function.
"_optional" in the name suggests that it's the "dropin" part that is optional.
(Which it is, but in this case, we mean the fragment to be optional.)
I think the new version with a flag is easier to understand.
The default message for ENOSPC is very misleading: it says that the disk is
filled, but in fact the inotify watch limit is the problem.
So let's introduce and use a wrapper that simply calls inotify_add_watch(2) and
which fixes the error message up in case ENOSPC is returned.
PID1 may modified the environment passed by the kernel when it starts
running. Commit 9d48671c62 unset $HOME for
example.
In case PID1 is going to switch to a new root and execute a new system manager
which is not systemd, we should restore the original environment as the new
manager might expect some variables to be set by default (more specifically
$HOME).
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
We currently don't have any mitigations against another privileged user
on the system messing with the cgroup hierarchy, bringing the system out
of line with what we've set in systemd. We also don't have any real way
to surface this to the user (we do have logs, but you have to know to
look in the first place).
There are a few possible solutions:
1. Maintaining our own cgroup tree with the new fsopen API and having a
read-only copy for everyone else. However, there are some
complications on this front, and this may be infeasible in some
environments. I'd rate this as a longer term effort that's tangential
to this patch.
2. Actively checking for changes with {fa,i}notify and changing them
back afterwards to match our configuration again. This is also
possible, but it's also good to have a way to do passive monitoring
of the situation without taking hard action. Also, currently daemons
like senpai do actually need to modify the tree behind systemd's
back (although hopefully this should be more integrated soon).
This patch implements another option, where one can, on demand, monitor
deviations in cgroup memory configuration from systemd's internal state.
Currently the only consumer is `systemd-analyze dump`, but the interface
is generic enough that it can also be exposed elsewhere later (for
example, over D-Bus).
Currently only memory limit style properties are supported, but later I
also plan to expand this out to other properties that systemd should
have ultimate control over.
mkdir -p is called both when setting up the autofs mount, as well
as after being notified that the real mount unit should be called.
However the first mkdir -p is hardcoded with 0555, while the second
uses the value specified to DirectoryMode in the automount unit; the
second mkdir -p is only needed when called from coldplug, so under
normal operation the dirs are incorrectly created with mode 0555.
This replaces the hardcoded 0555 mode with the value of DirectoryMode.
Closes#13683.
Setting the log level based on the signal made sense when signals that
were used were fixed. Since we allow signals to be configured, it doesn't
make sense to log at notice level about e.g. a restart or stop operation
just because the signal used is different.
This avoids messages like:
six.service: Killing process 210356 (sleep) with signal SIGINT.
v2:
- if RestartKillSignal= is not specified, fall back to KillSignal=. This is necessary
to preserve backwards compatibility (and keep KillSignal= generally useful).
A later version of the DefaultMemory{Low,Min} patch changed these to
require explicitly setting memory_foo_set, but we only set that in
load-fragment, not dbus-cgroup.
Without these, we may fall back to either DefaultMemoryFoo or
CGROUP_LIMIT_MIN when we really shouldn't.
This is an oversight from https://github.com/systemd/systemd/pull/12332.
Sadly the tests didn't catch it since it requires a real cgroup
hierarchy to see, and it wasn't seen in prod since we're only currently
using DefaultMemoryLow, not DefaultMemoryMin. :-(
As documented in the man-page, readdir() may return a directory entry with
d_type == DT_UNKNOWN. This must be handled for regular filesystems.
dirent_ensure_type() is available to set d_type if necessary. Use it in
some more places.
Without this systemd will fail to boot correctly with nfsroot and some
other filesystems.
Closes#13609
-ELOOP can happen also when enabling an alias name (which is admittedly useless
since the unit it belongs to was already enabled) so let's mention this
possibility when reporting the corresponding error.
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller. This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.
The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written. However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well. In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
We have the problem that many early boot or late shutdown issues are harder
to solve than they could be because we have no logs. When journald is not
running, messages are redirected to /dev/kmsg. It is also the time when many
things happen in a rapid succession, so we tend to hit the kernel printk
ratelimit fairly reliably. The end result is that we get no logs from the time
where they would be most useful. Thus let's disable the kernels ratelimit.
Once the system is up and running, the ratelimit is not a problem. But during
normal runtime, things also log to journald, and not to /dev/kmsg, so the
ratelimit is not useful. Hence, there doesn't seem to be much point in trying
to restore the ratelimit after boot is finished and journald is up and running.
See kernel's commit 750afe7babd117daabebf4855da18e4418ea845e for the
description of the kenrel interface. Our setting has lower precedence than
explicit configuration on the kenrel command line.
"ratelimit" is a real word, so we don't need to use the other form anywhere.
We had both forms in various places, let's standarize on the shorter and more
correct one.