Whenever we invoke external, foreign code from code that has
RLIMIT_NOFILE's soft limit bumped to high values, revert it to 1024
first. This is a safety precaution for compatibility with programs using
select() which cannot operate with fds > 1024.
This commit adds the call to rlimit_nofile_safe() to all invocations of
exec{v,ve,l}() and friends that either are in code that we know runs
with RLIMIT_NOFILE bumped up (which is PID 1 and all journal code for
starters) or that is part of shared code that might end up there.
The calls are placed as early as we can in processes invoking a flavour
of execve(), but after the last time we do fd manipulations, so that we
can still take benefit of the high fd limits for that.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
Otherwise we keep collecting stuff from env generators, and we really
shouldn't.
This was working properly on reexec but not on reload, as for reexec we
would always start fresh, but for reload would reuse the Manager object
and hence its default environment set.
Fixes: #10671
We now don't enable the CPU controller just for CPU accounting if we are
on 4.15+ and using pure unified hierarchy, as this is provided
externally to the CPU controller. This makes CPUAccounting=yes
essentially free, so enabling it by default when it's cheap seems like a
good idea.
This removes the ability to configure which cgroup controllers to mount
together. Instead, we'll now hardcode that "cpu" and "cpuacct" are
mounted together as well as "net_cls" and "net_prio".
The concept of mounting controllers together has no future as it does
not exist to cgroupsv2. Moreover, the current logic is systematically
broken, as revealed by the discussions in #10507. Also, we surveyed Red
Hat customers and couldn't find a single user of the concept (which
isn't particularly surprising, as it is broken...)
This reduced the (already way too complex) cgroup handling for us, since
we now know whenever we make a change to a cgroup for one controller to
which other controllers it applies.
Pretty much everything uses just the first argument, and this doesn't make this
common pattern more complicated, but makes it simpler to pass multiple options.
sysctl is disabled for /proc mounted from an user namespace thus entries like
/proc/sys/net/unix/max_dgram_qlen do not exist. In this case, skip the error
and do not try to change the default for the AF_UNIX datagram queue length.
This splits the "environment" field of Manager into two:
transient_environment and client_environment. The former is generated
from configuration file, kernel cmdline, environment generators. The
latter is the one the user can control with "systemctl set-environment"
and similar.
Both sets are merged transparently whenever needed. Separating the two
sets has the benefit that we can safely flush out the former while
keeping the latter during daemon reload cycles, so that env var settings
from env generators or configuration files do not accumulate, but
dynamic API changes are kept around.
Note that this change is not entirely transparent to users: if the user
first uses "set-environment" to override a transient variable, and then
uses "unset-environment" to unset it again things will revert to the
original transient variable now, while previously the variable was fully
removed. This change in behaviour should not matter too much though I
figure.
Fixes: #9972
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.
In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.
While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
After discussions with kernel folks, a system with memcg really
shouldn't need extra hard limits on file descriptors anymore, as they
are properly accounted for by memcg anyway. Hence, let's bump these
values to their maximums.
This also adds a build time option to turn thiss off, to cover those
users who do not want to use memcg.
Following the discussions with the kernel folks, let's substantially
increase the hard limit (but not the soft limit) of RLIMIT_NOFILE to
256K for all services we start.
Note that PID 1 itself bumps the limit even further, to the max the
kernel allows. We can deal with that after all.
let's clean up error handling and logging in manager_reload() a bit.
Specifically: make sure we log about every error we might encounter at
least and at most once.
When we encounter an error before the "point of no return" then log at
LOG_ERR about it and propagate it. Otherwise, eat it up, but warn about
it and proceed, it's the best we can do.
"ExitCode" is a bit of a misnomer in two ways: it suggests this was
about the "exit code" concept that exit()/waitid() deal with, but really
isn't. Moreover, it's not event just about exiting either, but more
often about reloading/reexecing or rebooting. Let's hence pick a new
name for this that is a bit more correct.
I initially thought about naming this the "state", but that'd be a
misnomer too, as the value really encodes a "goal" more than a current
state. Also we already have the externally visible ManagerState.
No actual changes in behaviour, just the rename.
Bpf programs are charged against memlock ulimit, and the default value
can be too tight on machines with many cgroups and attached bpf programs.
Let's bump it to 64Mb.
Until a core dump handler is installed by systemd-sysctl, the generation of
core dump for services is turned OFF which can make the debugging of the early
boot process harder especially since there's no easy way to restore the core
dump generation.
This patch introduces a new kernel command line option which specifies an
absolute path where the kernel should write the core dump file when an early
process crashes.
This will take effect until systemd-coredump (or any other handlers) takes
over.
This is a bit like the info link in most of GNU's --help texts, but we
don't do info but man pages, and we make them properly clickable on
terminal supporting that, because awesome.
I think it's generally advisable to link up our (brief) --help texts and
our (more comprehensive) man pages a bit, so this should be an easy and
straight-forward way to do it.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
To make debugging easier, this patches allows one to change the log target and
do reload/reexec without modifying configuration permanently, which makes
debugging easier.
Indeed if one changed the log target at runtime (via the bus or via signals),
the change was lost on the next reload/reexecution.
In order to restore back the default value (set via system.conf, environment
variables or any other means ), the empty string in the "LogTarget" property is
now supported as well as sending SIGTRMIN+26 signal.
To make debugging easier, this patches allows one to change the log level and
do reload/reexec without modifying configuration permanently, which makes
debugging easier.
Indeed if one changed the log max level at runtime (via the bus or via
signals), the change was lost on the next daemon reload/reexecution.
In order to restore the original value back (set via system.conf, environment
variables or any other means), the empty string in the "LogLevel" property is
now supported as well as sending SIGRTMIN+23 signal.
Most our other parsing functions do this, let's do this here too,
internally we accept that anyway. Also, the closely related
load_env_file() and load_env_file_pairs() also do this, so let's be
systematic.
That way we can use it in nspawn.
Also, while we are at it, let's rename the call config_parse_rlimit(),
i.e. insert the "r", to clarify what kind of limit this is about.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
sd_bus_open/sd_bus_open_system/sd_bus_open_user are convenient, but
don't allow the description to be set. After they return, the bus is
is already started, and sd_bus_set_description() fails with -EBUSY.
It would be possible to allow sd_bus_set_description() to update the
description "live", but messages are already emitted from sd_bus_open
functions, so it's better to allow the description to be set in
sd_bus_open/sd_bus_open_system/sd_bus_open_user.
Fixes message like:
Bus n/a: changing state UNSET → OPENING
Even if pager_open() fails, in general, we should continue the operations.
All erroneous cases in pager_open() show log message in the function.
So, it is not necessary to check the returned value.
"noreturn" is reserved and can be used in other header files we include:
[ 16s] In file included from /usr/include/gcrypt.h:30:0,
[ 16s] from ../src/journal/journal-file.h:26,
[ 16s] from ../src/journal/journal-vacuum.c:31:
[ 16s] /usr/include/gpg-error.h:1544:46: error: expected ‘,’ or ‘;’ before ‘)’ token
[ 16s] void gpgrt_log_bug (const char *fmt, ...) GPGRT_ATTR_NR_PRINTF(1,2);
Here we include grcrypt.h (which in turns include gpg-error.h) *after* we
"noreturn" was defined in macro.h.
config_parse_join_controllers would free the destination argument on failure,
which is contrary to our normal style, where failed parsing has no effect.
Moving it to shared also allows a test to be added.
After discussions with @htejun it appears it's OK now to enable memory
accounting by default for all units without affecting system performance
too badly. facebook has made good experiences with deploying memory
accounting across their infrastructure.
This hence turns MemoryAccounting= from opt-in to opt-out, similar to
how TasksAccounting= is already handled. The other accounting options
remain off, their performance impact is too big still.
log_open_console() did not switch from stderr to /dev/console, when
"always_reopen_console" was set. It was necessary to call
log_close_console() first.
By contrast, log_open() did switch between e.g. journald and kmsg according
to the value of "prohibit_ipc".
Let's fix log_open() to respect the values of all the log options, and we
can make log_close_*() private.
Also log_close_console() is changed. There was some precaution, avoiding
closing the console fd if we are not PID 1. I think commit 48a601fe made
a little mistake in leaving this in, and it only served to confuse
readers :).
Also I changed systemd-shutdown. Now we have log_set_prohibit_ipc(), let's
use it to clarify that systemd-shutdown is not expected to try and log via
journald (which it is about to kill). We avoided ever asking it to, but
it's more convenient for the reader if they don't have to think about that.
In that sense, it's similar to using assert() to validate a function's
arguments.
This doesn't matter much, and we don't rely on it, but I think it's much
nicer if we log_set_target() and log_set_upgrade_syslog_to_journal() can
be called in either order and have the same effect.
This happens to be almost the same moment as when we send READY=1 in the user
instance, but the logic is slightly different, since we log taint when
basic.target is reached in the system manager, but we send the notification
only in the user manager. So add a separate flag for this and propagate it
across reloads.
Fixes#7683.
By default systemd-shutdown will wait for 90s after SIGTERM was sent
for all processes to exit. This is way too long and effectively defeats
an emergency watchdog reboot via "reboot-force" actions. Instead now
use DefaultTimeoutStopSec which is configurable.
First, let's rename it to disable_coredumps(), as in the rest of our
codebase we spell it "coredump" rather than "core_dump", so let's stick
to that.
However, also log about failures to turn off core dumpling on LOG_DEBUG,
because debug logging is always a good idea.
We don't use the return value, and we don't have to, as the call already
initializes &ret, which is the one we return as exit code from the
process.
CID#1384230
This code is executed before we parse command line/configuration
parameters, hence let's not use arg_system to figure our how to clean up
things, but instead PID == 1. Let's move that check inside of the
function, to make things a bit more robust abstract from the outside.
Also, let's add a log message about this, that was so far missing.
Let's place them in initialize_runtime(), where they appear to fit best.
Effectively this is just a move a little bit down, swapping places with
log_execution_mode(), which should require neither call to be done
first.
Note that changes the conditionalization a bit for these calls, from
(PID == 1) to (arg_system && arg_action == ACTION_RUN). At this point this is pretty much the same
however, as we don't allow PID 1 without ACTION_RUN and without
arg_system set, safety_checks() ensures that.
It's part of finalizing our runtime parameters, hence let's move this
into load_configuration() after we loaded everything else. This is safe,
since we don't use it between the location where it was and where we
place it now yet.
This just sets up some variables the loaded configuration will then
modify. Let's invoke it hence right before loading the configuration.
This moves the initialization just a tiny bit later, but that shouldn't
matter, since we never access it in-between.
Let's make sure that if we are PID 1 we are invoked in ACTION_RUN mode,
and in arg_system mode, as well as the opposite.
Everything else is untested and probably not worth supporting hence
let's bail out early if people try anyway.
Let's shorten main() a bit, and split out everything that loads our
configuration and runtime parameters into a function of its own.
No changes in behaviour.
Dec 14 14:10:54 krowka systemd[1]: System is tainted: overflowgid-not-65534
-- Subject: The system is configured in a way that might cause problems
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The following "tags" are possible:
-- - "split-usr" — /usr is a separate file system and was not mounted when systemd
-- was booted
-- - "cgroups-missing" — the kernel was compiled without cgroup support or access
-- to expected interface files is resticted
-- - "var-run-bad" — /var/run is not a symlink to /run
-- - "overflowuid-not-65534" — the kernel user ID used for "unknown" users (with
-- NFS or user namespaces) is not 65534
-- - "overflowgid-not-65534" — the kernel group ID used for "unknown" users (with
-- NFS or user namespaces) is not 65534
-- Current system is tagged as overflowgid-not-65534.
Our CODING_STYLE suggests not comparing with NULL, but relying on C's
downgrade-to-bool feature for that. Fix up some code to match these
guidelines. (This is not comprehensive, the coccinelle output for this
is unfortunately kinda borked)
This option allows a device path to be specified for the systemd
watchdog (both runtime and shutdown).
If a system requires a watchdog other than /dev/watchdog (pointing to
/dev/watchdog0) to be used to reboot the system, this setting should be
changed to the relevant watchdog device path (e.g. /dev/watchdog1).
The tainting logic existed for a long time, but was hidden inside the
bus interfaces. Let's give it a small bit more coverage, by logging its
value early at boot during initialization.
This moves the invocation a bit later, but that shoudln't matter. By
moving it we gain two things: first of all, its closer to other code
where it belongs, secondly its naturally conditioned properly, as we no
longer will rewrite the container ID file on every reexecution again,
and not in test mode either.
No real functional changes, just some rearranging to shorten the overly
long main() function a bit.
This gets rid of the arm_reboot_watchdog variable, as it can be directly
derived from shutdown_verb, and we need it only one time. By dropping it
we can reduce the number of arguments we need to pass around.
read_one_line_file() always returns <= 0, so the code was OK, but let's write
the check a bit differently to make it obvious that min_max is always set.
This makes things quite a bit more systematic I think, as we can
systematically operate on all timestamps, for example for the purpose of
serialization/deserialization.
This rework doesn't necessarily make things shorter in the individual
lines, but it does reduce the line count a bit.
(This is useful particularly when we want to add additional timestamps,
for example to solve #7023)
Presets are useful to initialize uninitialized /etc, but that doesn't
apply to the initrd.
Also, let's rename etc_empty → first_boot. After all, the variable
doesn't actually reflect whether /etc is really empty, it just reflects
whether /etc/machine-id existed originally or not. Moreover, we later on
directly initialize manager_set_first_boot() from it, hence let's just
name it the same way all through the codepath, to make this all less
confusing.
See: #7100
This function is really not a method of the Manager object (implemented
in manager.c), but just a helper in main.c. Hence let's not confusingly
name it the way methods are called.
These three settings only make sense within the context of actual unit
files, hence filter this out when applied to the per-manager default,
and generate a log message about it.
We neglected to set error_message for errors which occur _after_ the
`finish` label. These fatal errors only happen in paths where `finish`
was reached successfully, i.e. error_message has not already been set
(and this analysis is simple enough that this need not cause too much
headaches. Also our new assignments to error_message come immediately
after execve() calls, which would have lost the error_message if it had
been set).
Also print a status message when we fail to exec init, otherwise the only
sign the user will see is `# ` :).
This addresses the lack of error messages pointed out in issue #6827.
The advantage is that is the name is mispellt, cpp will warn us.
$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build
squash! build-sys: use #if Y instead of #ifdef Y everywhere
v2:
- fix incorrect setting of HAVE_LIBIDN2
On current kernels BPF_MAP_TYPE_LPM_TRIE bpf maps are charged against
RLIMIT_MEMLOCK even for privileged users that have CAP_IPC_LOCK. Given
that mlock() generally ignores RLIMIT_MEMLOCK if CAP_IPC_LOCK is set
this appears to be an oversight in the kernel. Either way, until that's
fixed, let's just bump RLIMIT_MEMLOCK for the root user considerably, as
the default is quite limiting, and doesn't permit us to create more than
a few TRIE maps.
Now generators are only run in systemd --test mode, where this makes
most sense (how are you going to test what would happen otherwise?).
Fixes#6842.
v2:
- rename test_run to test_run_flags
Most importantly, don't collect open socket activation fds when in
--test mode. This specifically created a problem because we invoke
pager_open() beforehand (which these days makes copies of the original
stdout/stderr in order to be able to restore them when the pager goes
away) and we might mistakenly the fd copies it creates as socket
activation fds.
Fixes: #6383
This commit moves the first-boot system preset-settings evaluation out
of main and into the manager startup logic itself. Notably, it reverses
the order between generators and presets evaluation, so that any changes
performed by first-boot generators are taken into the account by presets
logic.
After this change, units created by a generator can be enabled as part
of a preset.
Some kdbus_flag and memfd related parts are left behind, because they
are entangled with the "legacy" dbus support.
test-bus-benchmark is switched to "manual". It was already broken before
(in the non-kdbus mode) but apparently nobody noticed. Hopefully it can
be fixed later.
This moves pretty much all uses of getpid() over to getpid_raw(). I
didn't specifically check whether the optimization is worth it for each
replacement, but in order to keep things simple and systematic I
switched over everything at once.
The CapabilityBoundingSet option only makes sense if we are running as
PID1.
The system.conf.d(5) manpage, already states that the CapabilityBoundingSet
option:
Controls which capabilities to include in the capability bounding set
for PID 1 and its children.
https://github.com/systemd/systemd/issues/6080
All those uses were correct, but I think it's better to be explicit.
Using implicit errno is too error prone, and with this change we can require
(in the sense of a style guideline) that the code is always specified.
Helpful query: git grep -n -P 'log_[^s][a-z]+\(.*%m'
This has systemd look at /proc/sys/fs/nr_open to find the current maximum of
open files compiled into the kernel and tries to set the RLIMIT_NOFILE max to
it. This has the advantage the value chosen as limit is less arbitrary and also
improves the behavior of systemd in containers that have an rlimit set: When
systemd currently starts in a container that has RLIMIT_NOFILE set to e.g.
100000 systemd will lower it to 65536. With this patch systemd will try to set
the nofile limit to the allowed kernel maximum. If this fails, it will compute
the minimum of the current set value (the limit that is set on the container)
and the maximum value as soft limit and the currently set maximum value as the
maximum value. This way it retains the limit set on the container.