Commit graph

571 commits

Author SHA1 Message Date
Michal Sekletar b070c7c0e1 core: introduce NUMAPolicy and NUMAMask options
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.

Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
2019-06-24 16:58:54 +02:00
Lennart Poettering cd69e88ba3 doc: make clear that --system and --user only make sense with --test
Fixes: #12843
2019-06-24 14:51:52 +02:00
Yu Watanabe 657ee2d82b tree-wide: replace strjoin() with path_join() 2019-06-21 03:26:16 +09:00
Lennart Poettering 6e2f789484 core: set fs.file-max sysctl to LONG_MAX rather than ULONG_MAX
Since kernel 5.2 the kernel thankfully returns proper errors when we
write a value out of range to the sysctl. Which however breaks writing
ULONG_MAX to request the maximum value. Hence let's write the new
maximum value instead, LONG_MAX.

/cc @brauner

Fixes: #12803
2019-06-17 15:48:11 +02:00
Michal Sekletar 3f09629c22
Merge pull request #12628 from keszybz/dbus-execute
Rework cpu affinity parsing
2019-05-30 12:32:53 +02:00
Zbigniew Jędrzejewski-Szmek fb39af4ce4 pid1: when reloading configuration, forget old settings
If we had a configuration setting from a configuration file, and it was
removed, we'd still remember the old value, because there's was no mechanism to
"reset" everything, just to assign new values.

Note that the effect of this is limited. For settings that have an "ongoing" effect,
like systemd.confirm_spawn, the new value is simply used. But some settings can only
be set at start.

In particular, CPUAffinity= will be updated if set to a new value, but if
CPUAffinity= is fully removed, it will not be reset, simply because we don't
know what to reset it to. We might have inherited a setting, or we might have
set it ourselves. In principle we could remember the "original" value that was
set when we were executed, but propagate this over reloads and reexecs, but
that would be a lot of work for little gain. So this corner case of removal of
CPUAffinity= is not handled fully, and a reboot is needed to execute the
change. As a work-around, a full mask of CPUAffinity=0-8191 can be specified.
2019-05-29 10:29:28 +02:00
Zbigniew Jędrzejewski-Szmek 470a5e6dce pid1: don't reset setting from /proc/cmdline upon restart
We have settings which may be set on the kernel command line, and also
in /proc/cmdline (for pid1). The settings in /proc/cmdline have higher priority
of course. When a reload was done, we'd reload just the configuration file,
losing the overrides.

So read /proc/cmdline again during reload.

Also, when initially reading the configuration file when program starts,
don't treat any errors as fatal. The configuration done in there doesn't
seem important enough to refuse boot.
2019-05-29 10:29:28 +02:00
Zbigniew Jędrzejewski-Szmek 61fbbac1d5 pid1: parse CPUAffinity= in incremental fashion
This makes the handling of this option match what we do in unit files. I think
consistency is important here. (As it happens, it is the only option in
system.conf that is "non-atomic", i.e. where there's a list of things which can
be split over multiple assignments. All other options are single-valued, so
there's no issue of how to handle multiple assignments.)
2019-05-29 10:29:28 +02:00
Zbigniew Jędrzejewski-Szmek 0985c7c4e2 Rework cpu affinity parsing
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)

Let's rework this, and store the size in bytes of the allocated storage area.

The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.

Fixes #12605.
2019-05-29 10:20:42 +02:00
Zbigniew Jędrzejewski-Szmek 9d48671c62 core: unset HOME=/ that the kernel gives us
Partially fixes #12389.

%h would return "/" in a machine, but "/root" in a container. Let's fix
this by resetting $HOME to the expected value.
2019-05-22 16:28:02 +02:00
Ben Boeckel 5238e95759 codespell: fix spelling errors 2019-04-29 16:47:18 +02:00
Jan Klötzke dc653bf487 service: handle abort stops with dedicated timeout
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.

This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.

If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
2019-04-12 17:32:52 +02:00
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Zbigniew Jędrzejewski-Szmek 237ebf61e2
Merge pull request #12013 from yuwata/fix-switchroot-11997
core: on switching root do not emit device state change based on enumeration results
2019-04-02 16:06:07 +02:00
Lennart Poettering 50cbaba4fe core: add new API for enqueing a job with returning the transaction data 2019-03-27 12:37:37 +01:00
Lennart Poettering 36fea15565 util: introduce save_argc_argv() helper 2019-03-21 18:08:56 +01:00
Yu Watanabe 49052946c9 core: use TAKE_PTR() at few more places 2019-03-15 19:01:12 +09:00
Lennart Poettering c3b6a348c0 main: use _exit() rather than exit() in code potentially caled from signal handler 2019-03-14 13:25:52 +01:00
Lennart Poettering eefc66aa8f util: split out some stuff into a new file limits-util.[ch] 2019-03-13 12:16:43 +01:00
Zbigniew Jędrzejewski-Szmek 796ac4c12c core: update comment
Initially, the check was that /usr is not a separate fs, and was later relaxed
to allow /usr to be mounted in the initramfs. Documentation was updated in 9e93f6f092,
but this comment wasn't. Let's update it too.
2019-02-18 10:29:33 +01:00
Lennart Poettering 99a2fd3bca main: when generating the resource limit to pass to children, take FD_SETSIZE into consideration
When we synthesize a "struct rlimit" structure to pass on for
RLIMIT_NOFILE to our children, let's explicitly make sure that the soft
limit is not above FD_SETSIZE, for compat reason with select().

Note this only applies when we derive the "struct rlimit" from what we
inherited. If the user configures something explicitly it always takes
precedence.
2019-01-18 17:31:36 +01:00
Lennart Poettering cda7faa9a5 main: don't bump resource limits if they are higher than we need them anyway
This matters in particular in the case of --user, since there we lack
the privs to bump the limits up again later on when invoking children.
2019-01-18 17:31:36 +01:00
Lennart Poettering ddfa8b0b3b main: add commenting, clean up handling of saved resource limits
This doesn't really change behaviour, but adds comments and uses more
symbolic names for everything, to make this more readable.
2019-01-18 17:31:36 +01:00
Lennart Poettering c0d7695908 main: when bumping RLIMIT_MEMLOCK, save the previous value to pass to children
Let's make sure that the bumping of RLIMIT_MEMLOCK does not leak into
our children.
2019-01-18 17:31:36 +01:00
Frantisek Sumsal 4a2c3dc318
Merge pull request #11252 from evverx/use-asan-wrapper-on-travis-ci
travis: run PID1, journald and everything else under ASan+UBsan
2019-01-06 18:48:38 +01:00
Evgeny Vereshchagin 7e11a95e41 tests: reproduce https://github.com/systemd/systemd/issues/11251 2018-12-29 19:14:28 +01:00
Zbigniew Jędrzejewski-Szmek 681bd2c524 meson: generate version tag from git
$ build/systemctl --version
systemd 239-3555-g6178cbb5b5
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN +PCRE2 default-hierarchy=hybrid
$ git tag v240 -m 'v240'
$ ninja -C build
ninja: Entering directory `build'
[76/76] Linking target fuzz-unit-file.
$ build/systemctl --version
systemd 240
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN +PCRE2 default-hierarchy=hybrid

This is very useful during development, because a precise version string is
embedded in the build product and displayed during boot, so we don't have to
guess answers for questions like "did I just boot the latest version or the one
from before?".

This change creates an overhead for "noop" builds. On my laptop, 'ninja -C
build' that does nothing goes from 0.1 to 0.5 s. It would be nice to avoid
this, but I think that <1 s is still acceptable.

Fixes #7183.

PACKAGE_VERSION is renamed to GIT_VERSION, to make it obvious that this is the
more dynamically changing version string.

Why save to a file? It would be easy to generate the version tag using
run_command(), but we want to go through a file so that stuff gets rebuilt when
this file changes. If we just defined an variable in meson, ninja wouldn't know
it needs to rebuild things.
2018-12-21 13:43:20 +01:00
Zbigniew Jędrzejewski-Szmek d2aaf13099 Remove use of PACKAGE_STRING
PACKAGE_VERSION is more explicit, and also, we don't pretend that changing the
project name in meson.build has any real effect. "systemd" is embedded in a
thousand different places, so let's just use the hardcoded string consistently.
This is mostly in preparation for future changes.
2018-12-19 09:29:32 +01:00
Lennart Poettering 595225af7a tree-wide: invoke rlimit_nofile_safe() before various exec{v,ve,l}() invocations
Whenever we invoke external, foreign code from code that has
RLIMIT_NOFILE's soft limit bumped to high values, revert it to 1024
first. This is a safety precaution for compatibility with programs using
select() which cannot operate with fds > 1024.

This commit adds the call to rlimit_nofile_safe() to all invocations of
exec{v,ve,l}() and friends that either are in code that we know runs
with RLIMIT_NOFILE bumped up (which is PID 1 and all journal code for
starters) or that is part of shared code that might end up there.

The calls are placed as early as we can in processes invoking a flavour
of execve(), but after the last time we do fd manipulations, so that we
can still take benefit of the high fd limits for that.
2018-12-01 12:50:45 +01:00
Lennart Poettering a885727a64 show-status: fold two bool flags function arguments into a flags
parameter
2018-11-26 18:24:12 +01:00
Zbigniew Jędrzejewski-Szmek baaa35ad70 coccinelle: make use of SYNTHETIC_ERRNO
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.

I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
2018-11-22 10:54:38 +01:00
Lennart Poettering 818623aca5
Merge pull request #10860 from keszybz/more-cleanup-2
Do more stuff from main macros
2018-11-21 11:07:31 +01:00
Zbigniew Jędrzejewski-Szmek 294bf0c34a Split out pretty-print.c and move pager.c and main-func.h to shared/
This is high-level functionality, and fits better in shared/ (which is for
our executables), than in basic/ (which is also for libraries).
2018-11-20 18:40:02 +01:00
Lennart Poettering bb25977244 main: don't freeze PID 1 in containers, exit with non-zero instead
After all we have a nice way to propagate total failures, hence let's
use it.
2018-11-20 17:04:07 +01:00
Lennart Poettering bb85a58208 main: use EXIT_EXCEPTION instead of EXIT_FAILURE at two more exceptional places 2018-11-20 17:04:07 +01:00
Lennart Poettering 79a224c460 main: when reloading PID 1 let's reset the default environment
Otherwise we keep collecting stuff from env generators, and we really
shouldn't.

This was working properly on reexec but not on reload, as for reexec we
would always start fresh, but for reload would reuse the Manager object
and hence its default environment set.

Fixes: #10671
2018-11-19 13:01:19 +01:00
Chris Down a88c5b8ac4 cgroup v2: DefaultCPUAccounting=yes if CPU controller isn't required
We now don't enable the CPU controller just for CPU accounting if we are
on 4.15+ and using pure unified hierarchy, as this is provided
externally to the CPU controller. This makes CPUAccounting=yes
essentially free, so enabling it by default when it's cheap seems like a
good idea.
2018-11-18 12:21:41 +00:00
Lennart Poettering 143fadf369 core: remove JoinControllers= configuration setting
This removes the ability to configure which cgroup controllers to mount
together. Instead, we'll now hardcode that "cpu" and "cpuacct" are
mounted together as well as "net_cls" and "net_prio".

The concept of mounting controllers together has no future as it does
not exist to cgroupsv2. Moreover, the current logic is systematically
broken, as revealed by the discussions in #10507. Also, we surveyed Red
Hat customers and couldn't find a single user of the concept (which
isn't particularly surprising, as it is broken...)

This reduced the (already way too complex) cgroup handling for us, since
we now know whenever we make a change to a cgroup for one controller to
which other controllers it applies.
2018-11-16 14:54:13 +01:00
Zbigniew Jędrzejewski-Szmek 0221d68a13 basic/pager: convert the pager options to a flags argument
Pretty much everything uses just the first argument, and this doesn't make this
common pattern more complicated, but makes it simpler to pass multiple options.
2018-11-14 16:25:11 +01:00
Zbigniew Jędrzejewski-Szmek e44c5a3ba6
Merge pull request #10594 from poettering/env-reload-fix
change handling of environment block of PID1's manager object
2018-11-07 12:49:13 +01:00
Lennart Poettering ed63705975
Merge pull request #10650 from yuwata/udevadm-trigger-use-write-string-file
udevadm: use write_string_file() helper function
2018-11-06 16:46:25 +03:00
Giuseppe Scrivano 875622c39e core, sysctl: skip ENOENT for /proc/sys/net/unix/max_dgram_qlen
sysctl is disabled for /proc mounted from an user namespace thus entries like
/proc/sys/net/unix/max_dgram_qlen do not exist.  In this case, skip the error
and do not try to change the default for the AF_UNIX datagram queue length.
2018-11-06 16:41:34 +03:00
Yu Watanabe 57512c893e tree-wide: set WRITE_STRING_FILE_DISABLE_BUFFER flag when we write files under /proc or /sys 2018-11-06 21:24:03 +09:00
Lennart Poettering 1ad6e8b302 core: split environment block mantained by PID 1's Manager object in two
This splits the "environment" field of Manager into two:
transient_environment and client_environment. The former is generated
from configuration file, kernel cmdline, environment generators. The
latter is the one the user can control with "systemctl set-environment"
and similar.

Both sets are merged transparently whenever needed. Separating the two
sets has the benefit that we can safely flush out the former while
keeping the latter during daemon reload cycles, so that env var settings
from env generators or configuration files do not accumulate, but
dynamic API changes are kept around.

Note that this change is not entirely transparent to users: if the user
first uses "set-environment" to override a transient variable, and then
uses "unset-environment" to unset it again things will revert to the
original transient variable now, while previously the variable was fully
removed. This change in behaviour should not matter too much though I
figure.

Fixes: #9972
2018-10-31 18:00:53 +01:00
Lennart Poettering d68c645bd3 core: rework serialization
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.

In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.

While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
2018-10-26 10:52:41 +02:00
Yu Watanabe 5e1ee764e1 core: include error cause in log message 2018-10-20 01:40:42 +09:00
Lennart Poettering c8884aceef main: introduce a define HIGH_RLIMIT_MEMLOCK similar to HIGH_RLIMIT_NOFILE 2018-10-17 14:40:44 +02:00
Lennart Poettering a8b627aaed main: bump fs.nr_open + fs.max-file to their largest possible values
After discussions with kernel folks, a system with memcg really
shouldn't need extra hard limits on file descriptors anymore, as they
are properly accounted for by memcg anyway. Hence, let's bump these
values to their maximums.

This also adds a build time option to turn thiss off, to cover those
users who do not want to use memcg.
2018-10-17 14:40:39 +02:00
Lennart Poettering a17c17122c core: bump RLIMIT_NOFILE soft+hard limit for systemd itself in all cases
Previously we'd do this for PID 1 only. Let's do this when running in
user mode too, because we know we can handle it.
2018-10-16 16:33:55 +02:00
Lennart Poettering 52d6207578 core: raise the RLIMIT_NOFILE hard limit for all services by default
Following the discussions with the kernel folks, let's substantially
increase the hard limit (but not the soft limit) of RLIMIT_NOFILE to
256K for all services we start.

Note that PID 1 itself bumps the limit even further, to the max the
kernel allows. We can deal with that after all.
2018-10-16 16:33:55 +02:00