As device units will be reloaded by systemd whenever the corresponding device generates a "changed" event, if the mount unit / cryptsetup service is wanted by its device unit, the former can be restarted by systemd unexpectedly after the user stopped them explicitly. It is not sensible at all and can be considered dangerous. Neither is the behaviour conventional (as `auto` in fstab should only affect behaviour on boot and `mount -a`) or ever documented at all (not even in systemd, see systemd.mount(5) and crypttab(5)).
Creating a pipe with O_NONBLOCK causes both the read and the write end to
be marked as non-blocking.
The "write" end is passed to the kernel autofs module, and it does not
expect a non-blocking pipe. If it gets -EAGAIN when trying to write
(which is unlikely, but not completely impossible), it will close the
write end of the pipe, which leads to unexpected errors.
So change the code to only set O_NONBLOCK on the "read" end of the
pipe. This is the only end that systemd interacts with, so the only end
it should be configuring.
KillMode=mixed and control group are used to indicate that all
process should be killed off. SendSIGKILL is used for services
that require a clean shutdown. These are typically database
service where a SigKilled process would result in a lengthy
recovery and who's shutdown or startup time is quite variable
(so Timeout settings aren't of use).
Here we take these two factors and refuse to start a service if
there are existing processes within a control group. Databases,
while generally having some protection against multiple instances
running, lets not stress the rigor of these. Also ExecStartPre
parts of the service aren't as rigoriously written to protect
against against multiple use.
closes#8630
The problem was introduced in a37422045fbb68ad68f734e5dc00e0a5b1759773:
we have a unit which has a fragment, and when we'd update it based on
/proc/self/mountinfo, we'd say that e.g. What=/dev/loop8 has origin-fragment.
This commit changes two things:
- origin-fragment is changed to origin-mountinfo-implicit
- when we stop a unit, mountinfo information is flushed and all deps based
on it are dropped.
The second step is important, because when we restart the unit, we want to
notice that we have "fresh" mountinfo information. We could keep the old info
around and solve this in a different way, but keeping stale information seems
inelegant.
Fixes#11342.
procfs_memory_get_current is renamed to procfs_memory_get_used, because
"current" can mean anything, including total memory, used memory, and free
memory, as long as the value is up to date.
No functional change.
Fixes: #11499
Let's return -EAGAIN so that on state change, unit_process_job tries to
add our job to run_queue again so that all the reloads that coalesced
into the installed reload (which itself merged into a running one)
inititally atleast runs *once*. This should ensure service picks up all
config changes reliably.
See the issue being fixed for a detailed explanation.
When we synthesize a "struct rlimit" structure to pass on for
RLIMIT_NOFILE to our children, let's explicitly make sure that the soft
limit is not above FD_SETSIZE, for compat reason with select().
Note this only applies when we derive the "struct rlimit" from what we
inherited. If the user configures something explicitly it always takes
precedence.
Fixes: #11305Fixes: #3260
Related: #11456
So, here's what happens in the described scenario in #11305. A unit goes
down, and that triggeres stop jobs for the other two units as they were
bound to it. Now, the timer for manager triggered restarts kicks in and
schedules a restart job with the JOB_FAIL job mode. This means there is
a stop job installed on those units, and now due to them being bound to
us they also get a restart job enqueued. This however is a conflicts, as
neither stop can merge into restart, nor restart into stop. However,
restart should be able to replace stop in any case. If the stop
procedure is ongoing, it can cancel the stop job, install itself, and
then after reaching dead finish and convert itself to a start job.
However, if we increase the timer, then it can always take those units
from inactive -> auto-restart.
We change the job mode to JOB_REPLACE so the restart job cancels the
stop job and installs itself.
Also, the original bug could be worked around by bumping RestartSec= to
avoid the conflicting.
This doesn't seem to be something that is going to break uses. That is
because for those who already had it working, there must have never been
conflicting jobs, as that would result in a desctructive transaction by
virtue of the job mode used.
After this change, the test case is able to work nicely without issues.
@bl33pbl0p, please fix your editor
(Apparently you never configured the source tree? If you did, then the
git pre-commit hook would have been enabled which doesn't allow
commiting non-whitespace clean stuff...)
This doesn't really change much, but feels more correct to do, as it
ensures that all messages currently queued in the bus connections are
definitely unreffed and thus destryoing of the connection object will
follow immediately.
Strictly speaking this change is entirely unnecessary, since nothing
else could have acquired a ref to the connection and queued a message
in, however, now that we have the new sd_bus_close_unref() helper it
makes a lot of sense to use it here, to ensure that whatever happens
nothing that might have been queued fucks with us.
It's similar to sd_bus_flush_close_unref() but doesn't do the flushing.
This is useful since this will still discnnect the connection properly
but not synchronously wait for the peer to take our messages.
Primary usecase is within _cleanup_() expressions where synchronously
waiting on the peer is not OK.
The error string for operations that are not supported (e.g. "shutdown" for
user-defined units) should take two arguments, where the first one is the type
of action being defined (i.e. "FailureAction" vs. "SuccessAction") and the
second is the string that was invalid.
Currently, the code prints this:
$ systemd-run --user --wait -p SuccessAction=poweroff true
Failed to start transient service unit: EmergencyAction setting invalid for manager type: SuccessAction
Change the format string to instead print:
$ systemd-run --user --wait -p SuccessAction=poweroff true
Failed to start transient service unit: SuccessAction setting invalid for manager type: poweroff
This function returns 0 on success and a negative value on failure. On success,
it writes the parsed action to the address passed in its third argument.
`bus_set_transient_emergency_action` does this:
r = parse_emergency_action(s, system, &v);
if (v < 0)
// handle failure
However, `v` is not updated if the function fails, and this should be checking
`r` instead of `v`.
The result of this is that if an invalid failure (or success) action is
specified, systemd ends up creating the unit anyway and then misbehaves if it
tries to run the failure action because the action value comes from
uninitialized stack data. In my case, this resulted in a failed assertion:
Program received signal SIGABRT, Aborted.
0x00007fe52cca0d7f in raise () from /snap/usr/lib/libc.so.6
(gdb) bt
#0 0x00007fe52cca0d7f in raise () from /snap/usr/lib/libc.so.6
#1 0x00007fe52cc8b672 in abort () from /snap/usr/lib/libc.so.6
#2 0x00007fe52d66f169 in log_assert_failed_realm (realm=LOG_REALM_SYSTEMD, text=0x56177ab8e000 "action < _EMERGENCY_ACTION_MAX", file=0x56177ab8dfb8 "../src/core/emergency-action.c", line=33, func=0x56177ab8e2b0 <__PRETTY_FUNCTION__.14207> "emergency_action") at ../src/basic/log.c:795
#3 0x000056177aa98cf4 in emergency_action (m=0x56177c992cb0, action=2059118610, options=(unknown: 0), reboot_arg=0x0, exit_status=1, reason=0x7ffdd2df4290 "unit run-u0.service failed") at ../src/core/emergency-action.c:33
#4 0x000056177ab2b739 in unit_notify (u=0x56177c9eb340, os=UNIT_ACTIVE, ns=UNIT_FAILED, flags=(unknown: 0)) at ../src/core/unit.c:2504
#5 0x000056177aaf62ed in service_set_state (s=0x56177c9eb340, state=SERVICE_FAILED) at ../src/core/service.c:1104
#6 0x000056177aaf8a29 in service_enter_dead (s=0x56177c9eb340, f=SERVICE_SUCCESS, allow_restart=true) at ../src/core/service.c:1712
#7 0x000056177aaf9233 in service_enter_signal (s=0x56177c9eb340, state=SERVICE_FINAL_SIGKILL, f=SERVICE_SUCCESS) at ../src/core/service.c:1854
#8 0x000056177aaf921b in service_enter_signal (s=0x56177c9eb340, state=SERVICE_FINAL_SIGTERM, f=SERVICE_SUCCESS) at ../src/core/service.c:1852
#9 0x000056177aaf8eb3 in service_enter_stop_post (s=0x56177c9eb340, f=SERVICE_SUCCESS) at ../src/core/service.c:1788
#10 0x000056177aaf91eb in service_enter_signal (s=0x56177c9eb340, state=SERVICE_STOP_SIGKILL, f=SERVICE_SUCCESS) at ../src/core/service.c:1850
#11 0x000056177aaf91bc in service_enter_signal (s=0x56177c9eb340, state=SERVICE_STOP_SIGTERM, f=SERVICE_FAILURE_EXIT_CODE) at ../src/core/service.c:1848
#12 0x000056177aaf9759 in service_enter_running (s=0x56177c9eb340, f=SERVICE_FAILURE_EXIT_CODE) at ../src/core/service.c:1941
#13 0x000056177ab005b7 in service_sigchld_event (u=0x56177c9eb340, pid=112, code=1, status=1) at ../src/core/service.c:3296
#14 0x000056177aad84b5 in manager_invoke_sigchld_event (m=0x56177c992cb0, u=0x56177c9eb340, si=0x7ffdd2df48f0) at ../src/core/manager.c:2444
#15 0x000056177aad88df in manager_dispatch_sigchld (source=0x56177c994710, userdata=0x56177c992cb0) at ../src/core/manager.c:2508
#16 0x00007fe52d72f807 in source_dispatch (s=0x56177c994710) at ../src/libsystemd/sd-event/sd-event.c:2846
#17 0x00007fe52d730f7d in sd_event_dispatch (e=0x56177c993530) at ../src/libsystemd/sd-event/sd-event.c:3229
#18 0x00007fe52d73142e in sd_event_run (e=0x56177c993530, timeout=18446744073709551615) at ../src/libsystemd/sd-event/sd-event.c:3286
#19 0x000056177aad9f71 in manager_loop (m=0x56177c992cb0) at ../src/core/manager.c:2906
#20 0x000056177aa7c876 in invoke_main_loop (m=0x56177c992cb0, ret_reexecute=0x7ffdd2df4bff, ret_retval=0x7ffdd2df4c04, ret_shutdown_verb=0x7ffdd2df4c58, ret_fds=0x7ffdd2df4c70, ret_switch_root_dir=0x7ffdd2df4c48, ret_switch_root_init=0x7ffdd2df4c50, ret_error_message=0x7ffdd2df4c60) at ../src/core/main.c:1792
#21 0x000056177aa7f251 in main (argc=2, argv=0x7ffdd2df4e78) at ../src/core/main.c:2573
Fix this by checking the correct variable.
Makes it easier to understand what was merged (and easier to realize why).
Example is a start job running, and another unit triggering a verify-active job. It is not clear what job was it that from baz.service that merged into the installed job for bar.service in the debug logs. This makes it useful when debugging issues.
Jan 15 11:45:58 jupiter systemd[1218]: baz.service: Trying to enqueue job baz.service/start/replace
Jan 15 11:45:58 jupiter systemd[1218]: baz.service: Installed new job baz.service/start as 498
Jan 15 11:45:58 jupiter systemd[1218]: bar.service: Merged into installed job bar.service/start as 497
Jan 15 11:45:58 jupiter systemd[1218]: baz.service: Enqueued job baz.service/start as 498
It becomes:
Jan 15 11:45:58 jupiter systemd[1218]: bar.service: Merged bar.service/verify-active into installed job bar.service/start as 497
Found by inspecting results of running this small program:
int main(int argc, const char **argv) {
for (int i = 1; i < argc; i++) {
FILE *f;
char line[1024], prev[1024], *r;
int lineno;
prev[0] = '\0';
lineno = 1;
f = fopen(argv[i], "r");
if (!f)
exit(1);
do {
r = fgets(line, sizeof(line), f);
if (!r)
break;
if (strcmp(line, prev) == 0)
printf("%s:%d: error: dup %s", argv[i], lineno, line);
lineno++;
strcpy(prev, line);
} while (!feof(f));
fclose(f);
}
}
https://hamberg.no/erlend/posts/2013-02-18-static-array-indices.html
This only works with clang, unfortunately gcc doesn't seem to implement the check
(tested with gcc-8.2.1-5.fc29.x86_64).
Simulated error:
[2/3] Compiling C object 'systemd-nspawn@exe/src_nspawn_nspawn.c.o'.
../src/nspawn/nspawn.c:3179:45: warning: array argument is too small; contains 15 elements, callee requires at least 16 [-Warray-bounds]
candidate = (uid_t) siphash24(arg_machine, strlen(arg_machine), hash_key);
^ ~~~~~~~~
../src/basic/siphash24.h:24:64: note: callee declares array parameter as static here
uint64_t siphash24(const void *in, size_t inlen, const uint8_t k[static 16]);
^~~~~~~~~~~~
MIPS/O32's st_rdev member of struct stat is unsigned long, which
is 32bit, while dev_t is defined as 64bit, which make some problems
in device_path_parse_major_minor.
Don't pass st.st_rdev, st_mode to device_path_parse_major_minor,
while pass 2 seperate variables. The result of stat is alos copied
out into these 2 variables. Fixes: #11247
Nitpicky, but we've used a lot of random spacings and names in the past,
but we're trying to be completely consistent on "cgroup vN" now.
Generated by `fd -0 | xargs -0 -n1 sed -ri --follow-symlinks 's/cgroups? ?v?([0-9])/cgroup v\1/gI'`.
I manually ignored places where it's not appropriate to replace (eg.
"cgroup2" fstype and in src/shared/linux).
Commit 250e9fadbc introduced
support for %j/%J specifier in unit files. The function
unit_name_printf is used in unit dependency resolution,
such as Wants / After directives, but was missing support
for the %j. Add to allow directives such as:
[Unit]
Wants=bar-%j.target
Fixes: systemd/systemd#11217
Signed-off-by: Patrick Williams <patrick@stwcx.xyz>
The idea was that those vars could be configured to 'no' to not install the .pc
files, or they could be set to '', and then they would be built but not
installed. This was inherited from the autoconf build system. This couldn't
work because '' is replaced by the default value. Also, having this level of
control doesn't seem necessary, since creating those files is very
quick. Skipping with 'no' was implemented only for systemd.pc and not the other
.pc files. Let's simplify things and skip installation if the target dir
is configured as 'no' for all .pc files.
$ build/systemctl --version
systemd 239-3555-g6178cbb5b5
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN +PCRE2 default-hierarchy=hybrid
$ git tag v240 -m 'v240'
$ ninja -C build
ninja: Entering directory `build'
[76/76] Linking target fuzz-unit-file.
$ build/systemctl --version
systemd 240
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN +PCRE2 default-hierarchy=hybrid
This is very useful during development, because a precise version string is
embedded in the build product and displayed during boot, so we don't have to
guess answers for questions like "did I just boot the latest version or the one
from before?".
This change creates an overhead for "noop" builds. On my laptop, 'ninja -C
build' that does nothing goes from 0.1 to 0.5 s. It would be nice to avoid
this, but I think that <1 s is still acceptable.
Fixes#7183.
PACKAGE_VERSION is renamed to GIT_VERSION, to make it obvious that this is the
more dynamically changing version string.
Why save to a file? It would be easy to generate the version tag using
run_command(), but we want to go through a file so that stuff gets rebuilt when
this file changes. If we just defined an variable in meson, ninja wouldn't know
it needs to rebuild things.
Let's not use atoi() if we can simply provide the project version as a number.
In C code, this is the numerical project version. In substitutions in other
files, this is just the bare substitution.
The "PACKAGE_" prefix is from autotools, and is strange. We call systemd a
"project", and "package" is something that distros build. Let's rename.
PACKAGE_URL is renamed to PROJECT_URL for the same reasons and for consistency.
(This leave PACKAGE_VERSION as the stringified define for C code.)
This reverts commit 89f9752ea0.
This patch causes various problems during boot, where a "mount storm" occurs
naturally. Current approach is flakey, and it seems very risky to push a
feature like this which impacts boot right before a release. So let's revert
for now, and consider a more robust solution after later.
Fixes#11209.
> https://github.com/systemd/systemd/pull/11196#issuecomment-448523186:
"Reverting 89f9752ea0 and fcfb1f775e fixes this test."
PACKAGE_VERSION is more explicit, and also, we don't pretend that changing the
project name in meson.build has any real effect. "systemd" is embedded in a
thousand different places, so let's just use the hardcoded string consistently.
This is mostly in preparation for future changes.
The starting of mount units requires that changes to
/proc/self/mountinfo be processed before the SIGCHILD from the
completion of /sbin/mount is processed, as described by the comment
/* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded
* before we process the SIGCHLD for the mount command. */
The recently-added mount-storm protection can defeat this as it
will sometimes deliberately delay processing of /proc/self/mountinfo.
So we need to disable mount-storm protection when a mount unit is starting.
We do this by keeping a counter of the number of pending
mounts, and disabling the protection when this is non-zero.
Thanks to @asavah for finding and reporting this problem.
KeyringMode option is useful for user services. Also, documentation for the
option suggests that the option applies to user services. However, setting the
option to any of its allowed values has no effect.
This commit fixes that and removes EXEC_NEW_KEYRING flag. The flag is no longer
necessary: instead of checking if the flag is set we can check if keyring_mode
is not equal to EXEC_KEYRING_INHERIT.
ensure that mfree() on filename is called after the logging function
which uses the string pointed by filename
Signed-off-by: Khem Raj <raj.khem@gmail.com>
If we create 2000 mounts (on a 1-CPU qemu VM) with
mkdir -p /MNT/{1..2000}
time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done
it takes around 20 seconds to complete. Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.
I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.
One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.
This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs off when a storm is detected. Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed. This ensures systemd won't use more than 10% of
real time processing mountinfo.
With this patch, my test above takes about 5 seconds.
The parent slice is always filtered ahead of time from UNIT_BEFORE, so
checking if the current member is the same as the parent unit will never
pass.
I may also write a SLICE_FOREACH_CHILD macro to remove some more of the
parent slice checks, but this requires a bit of a rework and general
refactoring and may not be worth it, so let's just do this for now.
fdopen doesn't accept "e", it's ignored. Let's not mislead people into
believing that it actually sets O_CLOEXEC.
From `man 3 fdopen`:
> e (since glibc 2.7):
> Open the file with the O_CLOEXEC flag. See open(2) for more information. This flag is ignored for fdopen()
As mentioned by @jlebon in #11131.
It would be very wrong if any of the specfier printf calls modified
any of the objects or data being printed. Let's mark all arguments as const
(primarily to make it easier for the reader to see where modifications cannot
occur).
Previously, we'd immediately propagate unit state changes into any jobs
pending for them, always. With this we only do this if the manager is
out of the "reload" state. This fixes the problem #8803 tried to
address, by simply not completing jobs until after the reload (and thus
reestablishment of the dbus connection) is complete.
Note that there's no need to later on explicitly catch up with the
missed job state changes (i.e. there's no need to call
unit_process_job() later one explicitly). That's because for jobs in
JOB_WAITING state on deserialization all jobs are requeued into the run
queue anyway, and thus checked again if they can complete now. And for
JOB_RUNNING jobs unit_catchup() phase is going to trigger missed out
state changes *after* the reload complete anyway (after all that's what
distinguishes from unit_coldplug()).
Replaces: #8803
Let's add a helper call unit_deserialize_job() for this purpose, and
let's move registration in the global jobs hash table into
job_install_deserialized() so that it it is done after all superficial
checks are done, and before transitioning into installed states, so that
rollback code is not necessary anymore.
Let's validate that the ID is actually allocated to us before remove a
job.
This is relevant as various bits of code will call job_free() on
partially set up Job objects, and we really shouldn't remove another job
object accidentally from the hash table, when the set up didn't
complete.
Memory management is borked for this, and moreover this is unnecessary
since f0831ed2a0, i.e. since coldplug() and catchup() are two different
concepts: the former restoring the state from before a reload, the
latter than adjusting it again to the actual status in effect after the
reload.
Fixes: #10716
Mostly reverts: #8803
We'd only set the description after the device appeared in sysfs, so
we'd always print
"A start job is running for dev-disk-by\x2duuid-aaaa ... aaaa.device (42s / 1min 30s)"
Let's make this
"A start job is running for /dev/disk/by-duuid/aaaa ... aaaa (42s / 1min 30s)"
https://bugzilla.redhat.com/show_bug.cgi?id=1655860
PATH_MAX is supposed to include the terminating NUL byte. But we already
check that there is no NUL byte in the specified path. Hence the maximum
length we can expect is PATH_MAX - 1.
This doesn't change much, but makes this use of PATH_MAX consistent with the
rest of the codebase.
We previously returned -EPERM but it can be returned for various other reasons
too.
Let's use -ENOLINK instead as this value shouldn't be used currently. This
allows users of CHASE_SAFE to detect without any ambiguities when unsafe
transitions are encountered by chase_symlinks().
All current users of CHASE_SAFE that explicitly reacted on -EPERM have been
converted to react on -ENOLINK.
Much like for the mount units we need fields such as the slice
initialized by the time we activate the swap, hence when the kernel
let's us know about a new swap that appeared we need to initialize the
slice in any Swap object we allocated for that right-away, even if we
can't read the real unit file for the swap device.
For services (and other units) we generally follow the rule that at the
beginning of each cycle, i.e. when the INACTIVE/FAILED state is left for
ACTIVATING/ACTIVE we flush out various state variables. Mount units
handled this differently so far when the unit state change was effected
outside of systemd: in that case these variables would be flushed out
when going back to INACTIVE/FAILED already.
Let's fix that, and flush out this state always during the activating
transition, not during the deactivating transition.
We never know what the changes triggered by mount_set_state() do to the
unit. Let's be safe and copy the device path into our set, so that we
are safe against that.
It doesn't matter what kind of precise failure we had earlier with
loading the unit, let's report that it loaded successfully now, after
all the kernel is an OK source for that, like any other.
Whenever we notice a change on an existing /proc/self/mountinfo line,
let's update the deps generated from it. For that, let's flush out the
old deps generated this way, and add in the new ones.
This takes benefit of the fact that today (unlike a comment this patch
removes says) we can remove deps in a somewhat reasonable way.
Let's set 'from_proc_self_mountinfo' right away, since we know its from
there. This is important so that when the load queue is dispatched (and
thus mount_load() called) this
fact is already known.
In a previous commit we added logic that mount_add_extras() (or more
precisely mount_add_default_dependencies()) adds in dependencies on
remote-fs.target and local-fs.target, hence we can drop this from
mount_setup_new_unit() and let the usual load queue dispatching take
care of this.
Let's initialize two fields with free_and_strdup() rather than directly
with strdup(). The fields should not be initialized so far, but it's
still nicer to be prepared for futzre code changes and always free
what's stored before replacing it.
If we can't process a specific line in /proc/self/mountinfo we should
log about it (which we do), but this should not affect other lines, nor
further processing of mount units. Let's keep these failures local.
Fixes: #10874
We need to make sure that the slice property is initialized whenever
mount_load() is invoked, even if we fail to load things properly off
disk. This is important since we generally don't allow changing the
slice after a unit has been started. But given that we must track the
state of external objects with mount units we must hence initialize the
property no matter what.
This deps are very similar to the -pre deps, hence establish them at the
same place, in particular as they should only be generated if default
deps are on.
This allows us to later on remove similar code that adds in these deps
whenever /proc/self/mountinfo changes.
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.
This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.
Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.
This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
This adds a depth-first version of unit_realize_cgroup_now which can
only do depth-first disabling of controllers, in preparation for the
DisableController= directive.
systemd currently doesn't really expend much effort in disabling
controllers. unit_realize_cgroup_now *may* be able to disable a
controller in the basic case when using cgroup v2, but generally won't
manage as downstream dependents may still use it.
This code doesn't add any logic to fix that, but it starts the process
of moving to have a breadth-first version of unit_realize_cgroup_now for
enabling, and a depth-first version of unit_realize_cgroup_now for
disabling.
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.
No code changes, just some rearranging of source files.
In the kernel sources attempts to write to either are refused with
EINVAL. Not sure why these attributes are exported anyway on cgroupsv1,
but this means we really should ignore them altogether.
This simplifies our code as this means cgroupsv1 is more alike cgroupsv2
in this regard.
Fixes: #10969
Previously, we'd enqueue a unit to the dbus queue whenever the state
changed, after we processed the state change fully. This commit to the
beginning of the state change. This has the benefit that when the state
change causes a job to complete the unit is already in the dbus queue,
and thus we get the guarantee that any unit change can be sent out to
clients before the job change.
Whenever we enqueue a job, we should announce this on the bus, hence add
both the job and the unit to the dbus queues. (Why both? The former
should be obvious, the latter because we send out Job properties).
In most cases adding these to the queue is not necessary, as
other properties tend to change at the same time and result in a change
being sent out. However, let's clean this up and make it explicit.
Let's inform the clients about assert/condition property changes as they
happen, it's basically for free because assert/condition property
changes generally coincide with other unit state changes (after all
these checks are done on unit_start())
When a client requests a new job, let's make sure we for out the JobNew
signals for it, before we return successfully from the method call.
After all we shouldn't return a path that is not announced yet, as
announcement of jobs should be considered part of the job setup.
We always want the state of the unit to be reflected first to the
client before we claim the job has changed state, after all the job is
the request to change unit state, and thus job changes are kinda the
confirmation that the state changed as requested.
This allows clients to follow our internal state changes safely.
Previously, quick state changes (for example, when we restart a unit due
to Restart= after it quickly transitioned through DEAD/FAILED states)
would be coalesced into one bus signal event, with this change there's
the guarantee that all state changes after the unit was announced ones
are reflected on th bus.
Note we only do this kind of guaranteed flushing only for unit state
changes, not for other unit property changes, where clients still have
to expect coalescing. This is because the unit state is a very
important, high-level concept.
Fixes: #10185
Whenever we invoke external, foreign code from code that has
RLIMIT_NOFILE's soft limit bumped to high values, revert it to 1024
first. This is a safety precaution for compatibility with programs using
select() which cannot operate with fds > 1024.
This commit adds the call to rlimit_nofile_safe() to all invocations of
exec{v,ve,l}() and friends that either are in code that we know runs
with RLIMIT_NOFILE bumped up (which is PID 1 and all journal code for
starters) or that is part of shared code that might end up there.
The calls are placed as early as we can in processes invoking a flavour
of execve(), but after the last time we do fd manipulations, so that we
can still take benefit of the high fd limits for that.
Let's generalize this, so that we can use this in nspawn later on, which
is pretty useful as we need to be able to mask files from the inner
child of nspawn too, where the host's /run/systemd/inaccessible
directory is not visible anymore. Moreover, if nspawn can create these
nodes on its own before the payload this means the payload can run with
fewer privileges.
Not only when we populate the "devices" cgroup controller we need
major/minor numbers, but for the io/blkio one it's the same, hence let's
use the same logic for both.
device_path_make_{major_minor|canonical) generate device node paths
given a mode_t and a dev_t. We have similar code all over the place,
let's unify this in one place. The former will generate a "/dev/char/"
or "/dev/block" path, and never go to disk. The latter then goes to disk
and resolves that path to the actual path of the device node.
device_path_parse_major_minor() reverses device_path_make_major_minor(),
also withozut going to disk.
We have similar code doing something like this at various places, let's
unify this in a single set of functions. This also allows us to teach
them special tricks, for example handling of the
/run/systemd/inaccessible/{blk|chr} device nodes, which we use for
masking device nodes, and which do not exist in /dev/char/* and
/dev/block/*
Previously we'd allow pattern expressions such as "char-input" to match
all input devices. Internally, this would look up the right major to
test in /proc/devices. With this commit the syntax is slightly extended:
- "char-*" can be used to match any kind of character device, and
similar "block-*. This expression would work previously already, but
instead of actually installing a wildcard match it would install many
individual matches for everything listed in /proc/devices.
- "char-<MAJOR>" with "<MAJOR>" being a numerical parameter works now
too. This allows clients to install whitelist items by specifying the
major directly.
The main reason to add these is to provide limited compat support for
clients that for some reason contain whitelists with major/minor numbers
(such as OCI containers).
This adds some code to hanlde /dev/block/* and /dev/char/* device node
paths specially: instead of actually stat()ing them we'll just parse the
major/minor name from the name. This is useful 'hack' to allow clients
to install whitelists for devices that don't actually have to exist.
Also, let's similarly handle /run/systemd/inaccessible/{blk|chr}. This
allows us to simplify our built-in default whitelist to not require a
"ignore_enoent" mode for these nodes.
In general we should be careful with hardcoding major/minor numbers, but
in this case this should safe.
add new "systemd-run-generator" for running arbitrary commands from the kernel command line as system services using the "systemd.run=" kernel command line switch
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.
When not specified let's also propagate the exit status of the main
process we fork off for the unit.
Otherwise we might conflict with the "no-processes-in-inner-cgroup" rule
of cgroupsv2. Consider nspawn starting up and initializing its cgroup
hierarchy with "supervisor/" and "payload/" as subcgroup, with itself
moved into the former and the payload into the latter. Now, if an
ExecStartPre= is run right after it cannot be placed in the main cgroup,
because that is now in inner cgroup with populated children.
Hence, let's run these helpers in another sub-cgroup .control/ below it.
This is somewhat ugly since it weakens the clear separation of
ownership, but given that this is an explicit contract, and double opt-in should be acceptable.
Fixes: #10482
Let's highlight the unit description string in the status updates, to
separate them a bit more the english sentence they are part of, and thus
make the different casing less surprising.
This way we can corectly ensure that when a unit that requires some
controller goes away, we propagate the removal of it all the way up, so
that the controller is turned off in all the parents too.
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.
This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.
Fixes: #9512
After creating a cgroup we need to initialize its
"cgroup.subtree_control" file with the controllers its children want to
use. Currently we do so whenever the mkdir() on the cgroup succeeded,
i.e. when we know the cgroup is "fresh". Let's update the condition
slightly that we also do so when internally we assume a cgroup doesn't
exist yet, even if it already does (maybe left-over from a previous
run).
This shouldn't change anything IRL but make things a bit more robust.
Previously this would manipulate the realization mask for invalidating
the realization. This is a bit ugly though as the realization mask's
primary purpose to is to reflect in which hierarchies a cgroup currently
exists, and it's probably a good idea to keep that in sync with
realities.
We nowadays have the an explicit fields for invalidating cgroup
controller information, the "cgroup_invalidated_mask", let's use this
one instead.
The effect is pretty much the same, as the main consumer of these masks
(unit_has_mask_realize()) checks both anyway.
This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.
So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).
To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.