Let's place them in initialize_runtime(), where they appear to fit best.
Effectively this is just a move a little bit down, swapping places with
log_execution_mode(), which should require neither call to be done
first.
Note that changes the conditionalization a bit for these calls, from
(PID == 1) to (arg_system && arg_action == ACTION_RUN). At this point this is pretty much the same
however, as we don't allow PID 1 without ACTION_RUN and without
arg_system set, safety_checks() ensures that.
It's part of finalizing our runtime parameters, hence let's move this
into load_configuration() after we loaded everything else. This is safe,
since we don't use it between the location where it was and where we
place it now yet.
This just sets up some variables the loaded configuration will then
modify. Let's invoke it hence right before loading the configuration.
This moves the initialization just a tiny bit later, but that shouldn't
matter, since we never access it in-between.
Let's make sure that if we are PID 1 we are invoked in ACTION_RUN mode,
and in arg_system mode, as well as the opposite.
Everything else is untested and probably not worth supporting hence
let's bail out early if people try anyway.
Let's shorten main() a bit, and split out everything that loads our
configuration and runtime parameters into a function of its own.
No changes in behaviour.
Dec 14 14:10:54 krowka systemd[1]: System is tainted: overflowgid-not-65534
-- Subject: The system is configured in a way that might cause problems
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The following "tags" are possible:
-- - "split-usr" — /usr is a separate file system and was not mounted when systemd
-- was booted
-- - "cgroups-missing" — the kernel was compiled without cgroup support or access
-- to expected interface files is resticted
-- - "var-run-bad" — /var/run is not a symlink to /run
-- - "overflowuid-not-65534" — the kernel user ID used for "unknown" users (with
-- NFS or user namespaces) is not 65534
-- - "overflowgid-not-65534" — the kernel group ID used for "unknown" users (with
-- NFS or user namespaces) is not 65534
-- Current system is tagged as overflowgid-not-65534.
Our CODING_STYLE suggests not comparing with NULL, but relying on C's
downgrade-to-bool feature for that. Fix up some code to match these
guidelines. (This is not comprehensive, the coccinelle output for this
is unfortunately kinda borked)
This option allows a device path to be specified for the systemd
watchdog (both runtime and shutdown).
If a system requires a watchdog other than /dev/watchdog (pointing to
/dev/watchdog0) to be used to reboot the system, this setting should be
changed to the relevant watchdog device path (e.g. /dev/watchdog1).
The tainting logic existed for a long time, but was hidden inside the
bus interfaces. Let's give it a small bit more coverage, by logging its
value early at boot during initialization.
This moves the invocation a bit later, but that shoudln't matter. By
moving it we gain two things: first of all, its closer to other code
where it belongs, secondly its naturally conditioned properly, as we no
longer will rewrite the container ID file on every reexecution again,
and not in test mode either.
No real functional changes, just some rearranging to shorten the overly
long main() function a bit.
This gets rid of the arm_reboot_watchdog variable, as it can be directly
derived from shutdown_verb, and we need it only one time. By dropping it
we can reduce the number of arguments we need to pass around.
read_one_line_file() always returns <= 0, so the code was OK, but let's write
the check a bit differently to make it obvious that min_max is always set.
This makes things quite a bit more systematic I think, as we can
systematically operate on all timestamps, for example for the purpose of
serialization/deserialization.
This rework doesn't necessarily make things shorter in the individual
lines, but it does reduce the line count a bit.
(This is useful particularly when we want to add additional timestamps,
for example to solve #7023)
Presets are useful to initialize uninitialized /etc, but that doesn't
apply to the initrd.
Also, let's rename etc_empty → first_boot. After all, the variable
doesn't actually reflect whether /etc is really empty, it just reflects
whether /etc/machine-id existed originally or not. Moreover, we later on
directly initialize manager_set_first_boot() from it, hence let's just
name it the same way all through the codepath, to make this all less
confusing.
See: #7100
This function is really not a method of the Manager object (implemented
in manager.c), but just a helper in main.c. Hence let's not confusingly
name it the way methods are called.
These three settings only make sense within the context of actual unit
files, hence filter this out when applied to the per-manager default,
and generate a log message about it.
We neglected to set error_message for errors which occur _after_ the
`finish` label. These fatal errors only happen in paths where `finish`
was reached successfully, i.e. error_message has not already been set
(and this analysis is simple enough that this need not cause too much
headaches. Also our new assignments to error_message come immediately
after execve() calls, which would have lost the error_message if it had
been set).
Also print a status message when we fail to exec init, otherwise the only
sign the user will see is `# ` :).
This addresses the lack of error messages pointed out in issue #6827.
The advantage is that is the name is mispellt, cpp will warn us.
$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build
squash! build-sys: use #if Y instead of #ifdef Y everywhere
v2:
- fix incorrect setting of HAVE_LIBIDN2
On current kernels BPF_MAP_TYPE_LPM_TRIE bpf maps are charged against
RLIMIT_MEMLOCK even for privileged users that have CAP_IPC_LOCK. Given
that mlock() generally ignores RLIMIT_MEMLOCK if CAP_IPC_LOCK is set
this appears to be an oversight in the kernel. Either way, until that's
fixed, let's just bump RLIMIT_MEMLOCK for the root user considerably, as
the default is quite limiting, and doesn't permit us to create more than
a few TRIE maps.
Now generators are only run in systemd --test mode, where this makes
most sense (how are you going to test what would happen otherwise?).
Fixes#6842.
v2:
- rename test_run to test_run_flags
Most importantly, don't collect open socket activation fds when in
--test mode. This specifically created a problem because we invoke
pager_open() beforehand (which these days makes copies of the original
stdout/stderr in order to be able to restore them when the pager goes
away) and we might mistakenly the fd copies it creates as socket
activation fds.
Fixes: #6383
This commit moves the first-boot system preset-settings evaluation out
of main and into the manager startup logic itself. Notably, it reverses
the order between generators and presets evaluation, so that any changes
performed by first-boot generators are taken into the account by presets
logic.
After this change, units created by a generator can be enabled as part
of a preset.
Some kdbus_flag and memfd related parts are left behind, because they
are entangled with the "legacy" dbus support.
test-bus-benchmark is switched to "manual". It was already broken before
(in the non-kdbus mode) but apparently nobody noticed. Hopefully it can
be fixed later.
This moves pretty much all uses of getpid() over to getpid_raw(). I
didn't specifically check whether the optimization is worth it for each
replacement, but in order to keep things simple and systematic I
switched over everything at once.
The CapabilityBoundingSet option only makes sense if we are running as
PID1.
The system.conf.d(5) manpage, already states that the CapabilityBoundingSet
option:
Controls which capabilities to include in the capability bounding set
for PID 1 and its children.
https://github.com/systemd/systemd/issues/6080
All those uses were correct, but I think it's better to be explicit.
Using implicit errno is too error prone, and with this change we can require
(in the sense of a style guideline) that the code is always specified.
Helpful query: git grep -n -P 'log_[^s][a-z]+\(.*%m'
This has systemd look at /proc/sys/fs/nr_open to find the current maximum of
open files compiled into the kernel and tries to set the RLIMIT_NOFILE max to
it. This has the advantage the value chosen as limit is less arbitrary and also
improves the behavior of systemd in containers that have an rlimit set: When
systemd currently starts in a container that has RLIMIT_NOFILE set to e.g.
100000 systemd will lower it to 65536. With this patch systemd will try to set
the nofile limit to the allowed kernel maximum. If this fails, it will compute
the minimum of the current set value (the limit that is set on the container)
and the maximum value as soft limit and the currently set maximum value as the
maximum value. This way it retains the limit set on the container.
This substantially reworks the seccomp code, to ensure better
compatibility with some architectures, including i386.
So far we relied on libseccomp's internal handling of the multiple
syscall ABIs supported on Linux. This is problematic however, as it does
not define clear semantics if an ABI is not able to support specific
seccomp rules we install.
This rework hence changes a couple of things:
- We no longer use seccomp_rule_add(), but only
seccomp_rule_add_exact(), and fail the installation of a filter if the
architecture doesn't support it.
- We no longer rely on adding multiple syscall architectures to a single filter,
but instead install a separate filter for each syscall architecture
supported. This way, we can install a strict filter for x86-64, while
permitting a less strict filter for i386.
- All high-level filter additions are now moved from execute.c to
seccomp-util.c, so that we can test them independently of the service
execution logic.
- Tests have been added for all types of our seccomp filters.
- SystemCallFilters= and SystemCallArchitectures= are now implemented in
independent filters and installation logic, as they semantically are
very much independent of each other.
Fixes: #4575
This improves kernel command line parsing in a number of ways:
a) An kernel option "foo_bar=xyz" is now considered equivalent to
"foo-bar-xyz", i.e. when comparing kernel command line option names "-" and
"_" are now considered equivalent (this only applies to the option names
though, not the option values!). Most of our kernel options used "-" as word
separator in kernel command line options so far, but some used "_". With
this change, which was a source of confusion for users (well, at least of
one user: myself, I just couldn't remember that it's systemd.debug-shell,
not systemd.debug_shell). Considering both as equivalent is inspired how
modern kernel module loading normalizes all kernel module names to use
underscores now too.
b) All options previously using a dash for separating words in kernel command
line options now use an underscore instead, in all documentation and in
code. Since a) has been implemented this should not create any compatibility
problems, but normalizes our documentation and our code.
c) All kernel command line options which take booleans (or are boolean-like)
have been reworked so that "foobar" (without argument) is now equivalent to
"foobar=1" (but not "foobar=0"), thus normalizing the handling of our
boolean arguments. Specifically this means systemd.debug-shell and
systemd_debug_shell=1 are now entirely equivalent.
d) All kernel command line options which take an argument, and where no
argument is specified will now result in a log message. e.g. passing just
"systemd.unit" will no result in a complain that it needs an argument. This
is implemented in the proc_cmdline_missing_value() function.
e) There's now a call proc_cmdline_get_bool() similar to proc_cmdline_get_key()
that parses booleans (following the logic explained in c).
f) The proc_cmdline_parse() call's boolean argument has been replaced by a new
flags argument that takes a common set of bits with proc_cmdline_get_key().
g) All kernel command line APIs now begin with the same "proc_cmdline_" prefix.
h) There are now tests for much of this. Yay!
It's rather hard to parse the confirmation messages (enabled with
systemd.confirm_spawn=true) amongst the status messages and the kernel
ones (if enabled).
This patch gives the possibility to the user to redirect the confirmation
message to a different virtual console, either by giving its name or its path,
so those messages are separated from the other ones and easier to read.
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
This stripping is contolled by a new boolean parameter. When the parameter
is true, it means that the caller does not care about the distinction between
initrd and real root, and wants to act on both rd-dot-prefixed and unprefixed
parameters in the initramfs, and only on the unprefixed parameters in real
root. If the parameter is false, behaviour is the same as before.
Changes by caller:
log.c (systemd.log_*): changed to accept rd-dot-prefix params
pid1: no change, custom logic
cryptsetup-generator: no change, still accepts rd-dot-prefix params
debug-generator: no change, does not accept rd-dot-prefix params
fsck: changed to accept rd-dot-prefix params
fstab-generator: no change, custom logic
gpt-auto-generator: no change, custom logic
hibernate-resume-generator: no change, does not accept rd-dot-prefix params
journald: changed to accept rd-dot-prefix params
modules-load: no change, still accepts rd-dot-prefix params
quote-check: no change, does not accept rd-dot-prefix params
udevd: no change, still accepts rd-dot-prefix params
I added support for "rd." params in the three cases where I think it's
useful: logging, fsck options, journald forwarding options.
Since we ignore the result anyway, downgrade errors to warning.
log_oom() will still emit an error, but that's mostly theoretical, so it
is not worth complicating the code to avoid the small inconsistency
If `--test` command line option was passed, the systemd set skip_setup
to true during bootup. But after this we check again that arg_action is
test or help and opens pager depends on result.
We should skip setup in a case when `--test` is passed, but it is also
safe to set skip_setup in a case of `--help`. So let's remove first
check and move skip_setup = true to the second check.
When we print information about PID 1's crashdump subprocess failing. In this
case we *know* that we do not generate LSB exit codes, as it's basically PID 1
itself that exited there.
Previously we've used free_and_strdup() to fill arg_default_unit with unit
name, If we didn't pass default unit name through a kernel command line or
command line arguments. But we can use just strdup() instead of
free_and_strdup() for this, because we will start fill arg_default_unit
only if it wasn't set before.
systemd fills arg_default_unit during startup with default.target
value. But arg_default_unit may be overwritten in parse_argv() or
parse_proc_cmdline_item().
Let's check value of arg_default_unit after calls of parse_argv()
and parse_proc_cmdline_item() and fill it with default.target if
it wasn't filled before. In this way we will not spend unnecessary
time to for filling arg_default_unit with default.target.
For some certification, it should not be possible to reboot the machine through ctrl-alt-delete. Currently we suggest our customers to mask the ctrl-alt-delete target, but that is obviously not enough.
Patching the keymaps to disable that is really not a way to go for them, because the settings need to be easily checked by some SCAP tools.
The parsing functions for [User]TasksMax were inconsistent. Empty string and
"infinity" were interpreted as no limit for TasksMax but not accepted for
UserTasksMax. Update them so that they're consistent with other knobs.
* Empty string indicates the default value.
* "infinity" indicates no limit.
While at it, replace opencoded (uint64_t) -1 with CGROUP_LIMIT_MAX in TasksMax
handling.
v2: Update empty string to indicate the default value as suggested by Zbigniew
Jędrzejewski-Szmek.
v3: Fixed empty UserTasksMax handling.
This way, invoking nspawn from a shell in the best case inherits the TERM
setting all the way down into the login shell spawned in the container.
Fixes: #3697
IMA wiki says: "If the IMA policy contains LSM labels, then the LSM
policy must be loaded prior to the IMA policy." Right now, in case of
Smack, the IMA policy is loaded before the Smack policy. Move the order
around to allow Smack labels to be used in IMA policy.
the ACTION_DONE was introduced in the 4288f61921 (dbus: automatically
generate and install introspection files ) commit and was used in
systemd --introspect command.
Later 'introspect' command was removed in the ca2871d9b (bus: remove
static introspection file export) commit and have no users anymore.
So we can remove it.
As it turns out 512 is max number of tasks per service is hit by too many
applications, hence let's bump it a bit, and make it relative to the system's
maximum number of PIDs. With this change the new default is 15%. At the
kernel's default pids_max value of 32768 this translates to 4915. At machined's
default TasksMax= setting of 16384 this translates to 2457.
Why 15%? Because it sounds like a round number and is close enough to 4096
which I was going for, i.e. an eight-fold increase over the old 512
Summary:
| on the host | in a container
old default | 512 | 512
new default | 4915 | 2457