Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
This adds a new bitfield to `execute_directories()` which allows to
configure whether to ignore non-zero exit statuses of binaries run and
whether to allow parallel execution of commands.
In case errors are not ignored, the exit status of the failed script
will now be returned for error reposrting purposes or other further
future use.
Fixes#10256.
What works:
systemd-analyze cat-config systemd/system-preset
systemd-analyze cat-config systemd/user-preset
systemd-analyze cat-config tmpfiles.d
systemd-analyze cat-config sysusers.d
systemd-analyze cat-config systemd/sleep.conf
systemd-analyze cat-config systemd/user.conf
systemd-analyze cat-config systemd/system.conf
systemd-analyze cat-config udev/udev.conf
(and other .conf files)
systemd-analyze cat-config udev/rules.d
systemd-analyze cat-config environment.d
systemd-analyze cat-config environment
Directories may be specified with the trailing dash or not.
The caveat is that for user configuration, systemd and other tools also look
at ~/.config/. It would be nice to support this, but this patch doesn't.
"cat-config --user" is rejected, and we may allow it in the future and then
extend the search path with directories under ~/.config.
What doesn't work (and probably shouldn't because those files cannot be
meaningfully concatenated):
systemd-analyze cat-config systemd/system (.service, .slice, .socket, ...)
systemd-analyze cat-config systemd/user
systemd-analyze cat-config systemd/network (.network, .link, and .dnssd)
The hardcoding of information about paths in this manner is a bit ugly, but
OTOH, it is not too onerous, and at least we have one place where all the
schemes are "documented" through code. It'll make us think twice before adding
yet another slightly different scheme.
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.
Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
are below the resolution of 1ms, or if period is above the max of 1s.
This works like parse_sec() but defaults to USEC_INFINITY when passed an
empty string or only whitespace.
Also introduce config_parse_sec_def_infinity, which can be used to parse
config options using this function.
This is useful for time options that use "infinity" for default and that
can be reset by unsetting them.
Introduce a test case to ensure it works as expected.
Most of the accesses *were* aligned. The only one that definetely wasn't was to
drive_path->part_start and drive_path->part_size, because those both expect
8 byte alignment, and are at offsets 4 and 12 in the packed structure.
Because of the way that device_path structure is defined and used, we expect
that device_path.length is always two-byte aligned.
This adds asserts in various places to ensure the proper alignment, and uses
memcpy in other places where the alignment might be off.
When looking for the terminating double-NUL, don't just read the memory
until the terminator is found, but use the information we got about the
buffer size.
The length parameter passed to utf16_to_utf8() would include the terminator, so
the converted string would end up with two terminators (the original one
converted to "utf8", still 0, and then the one that was always added anyway).
Instead let's pass just the length of the actual data to utf16_to_utf8().
Too much code has been removed while replacing startswith with STARTSWITH_SET
so that every ACL specified e.g. in tmpfiles.d was parsed as a default ACL.
gcc-9 complains that the string may be truncated when written into the output
structure. This shouldn't happen, but if it did, in principle we could remove a
different structure (with a matching name prefix). Let's just refuse the
operation if the name doesn't fit.
This doesn't really change much, but feels more correct to do, as it
ensures that all messages currently queued in the bus connections are
definitely unreffed and thus destryoing of the connection object will
follow immediately.
Strictly speaking this change is entirely unnecessary, since nothing
else could have acquired a ref to the connection and queued a message
in, however, now that we have the new sd_bus_close_unref() helper it
makes a lot of sense to use it here, to ensure that whatever happens
nothing that might have been queued fucks with us.
Previously, this system call was included in @system-service since it is
a "getter" only, i.e. only queries information, and doesn't change
anything, and hence was considered not risky.
However, as it turns out, mincore() is actually security sensitive, see
the discussion here:
https://lwn.net/Articles/776034/
Hence, let's adjust the system call filter and drop mincore() from it.
This constitues a compatibility break to some level, however I presume
we can get away with this as the systemcall is pretty exotic. The fact
that it is pretty exotic is also reflected by the fact that the kernel
intends to majorly change behaviour of the system call soon (see the
linked LWN article)
Existing use of E2BIG is replaced with ENOBUFS (entry too long), and E2BIG is
reused for the new error condition (too many fields).
This matches the change done for systemd-journald, hence forming the second
part of the fix for CVE-2018-16865
(https://bugzilla.redhat.com/show_bug.cgi?id=1653861).
We allocate a iovec entry for each field, so with many short entries,
our memory usage and processing time can be large, even with a relatively
small message size. Let's refuse overly long entries.
CVE-2018-16865
https://bugzilla.redhat.com/show_bug.cgi?id=1653861
What from I can see, the problem is not from an alloca, despite what the CVE
description says, but from the attack multiplication that comes from creating
many very small iovecs: (void* + size_t) for each three bytes of input message.
https://hamberg.no/erlend/posts/2013-02-18-static-array-indices.html
This only works with clang, unfortunately gcc doesn't seem to implement the check
(tested with gcc-8.2.1-5.fc29.x86_64).
Simulated error:
[2/3] Compiling C object 'systemd-nspawn@exe/src_nspawn_nspawn.c.o'.
../src/nspawn/nspawn.c:3179:45: warning: array argument is too small; contains 15 elements, callee requires at least 16 [-Warray-bounds]
candidate = (uid_t) siphash24(arg_machine, strlen(arg_machine), hash_key);
^ ~~~~~~~~
../src/basic/siphash24.h:24:64: note: callee declares array parameter as static here
uint64_t siphash24(const void *in, size_t inlen, const uint8_t k[static 16]);
^~~~~~~~~~~~
Instead of enabling it unconditionally and then using ConditionPathExists=/etc/fstab,
and possibly masking this condition if it should be enabled for auto gpt stuff,
just pull it in explicitly when required.
This reverts commit edda44605f.
The kernel explicitly supports resuming with a different kernel than the one
used before hibernation. If this is something that shouldn't be supported, the
place to change this is in the kernel. We shouldn't censor something that this
exclusively in the kernel's domain.
People might be using this to switch kernels without restaring programs, and
we'd break this functionality for them.
Also, even if resuming with a different kernel was a bad idea, we don't really
prevent that with this check, since most users have more than one kernel and
can freely pick a different one from the menu. So this only affected the corner
case where the kernel has been removed, but there is no reason to single it
out.
Generators run in a context where waiting for udev is not an option,
simply because it's not running there yet. Hence, let's not wait for it
in this case.
This is generally OK to do as we are operating on the root disk only
here, which should have been probed already by the time we come this
far.
An alternative fix might be to remove the udev dependency from image
dissection again in the long run (and thus replace reliance on
/dev/block/x:y somehow with something else).
Fixes: #11205
Apparently this originated in PHP, so the json output could be directly
embedded in HTML script tags.
See https://stackoverflow.com/questions/1580647/json-why-are-forward-slashes-escaped.
Since the output of our tools is not intended directly for web page generation,
let's not do this unescaping. If needed, the consumer can always do escaping as
appropriate for the target format.
Fixes#10526.
Even if we waited for the root device to appear, the mount could still fail if
we didn't wait for udev to initalize the device. In particular, the
/dev/block/n:m path used to mount the device is created by udev, and nspawn
would sometimes win the race and the mount would fail with -ENOENT.
The same wait is done for partitions, since if we try to mount them, the same
considerations apply.
Note: I first implemented a version which just does a loop (with a short wait).
In that approach, udev takes on average ~800 µs to initialize the loopback
device. The approach where we set up a monitor and avoid the loop is a bit
nicer. There doesn't seem to be a significant difference in speed.
With 1000 invocations of 'systemd-nspawn -i image.squashfs echo':
loop (previous approach):
real 4m52.625s
user 0m37.094s
sys 2m14.705s
monitor (this patch):
real 4m50.791s
user 0m36.619s
sys 2m14.039s
dissect-image would wait for the root device and paritions to appear. But if we
had an image with no partitions, we'd not wait at all. If the kernel or udev
were slow in creating device nodes or symlinks, subsequent mount attempt might
fail if nspawn won the race.
Calling wait_for_partitions_to_appear() in case of no partitions means that we
verify that the kernel agrees that there are no partitions. We verify that the
kernel sees the same number of partitions as blkid, so let's that also in this
case.
This makes the failure in #10526 much less likely, but doesn't eliminate it
completely. Stay tuned.
The function interface is the same, except that the output pointer may be NULL.
The implementation is slightly simplified by taking advantage of changes in
ancestor commit 'sd-device: attempt to read db again if it wasn't found', by
not creating a new sd_device object before re-checking the is_initialized
status.
v2:
- In v1, the old object was always used and the device received back from the
sd_device_monitor_start callback was ignored. I *think* the result will be
equivalent in both cases, because by the time we the callback gets called,
the db entry in the filesystem will also exist, and any subsequent access to
properties of the object would trigger a read of the database from disk. But
I'm not certain, and anyway, using the device object received in the callback
seems cleaner.
Apparently, people do use nscd, hence play somewhat nice with it, and
let's explicitly flush nscd caches whenever we register a new
user/group.
This patch only adds the actual refresh request invocation. Later
commits then issue this call at appropriate moments.
Note that the nscd protocol is not officially documented though very
simple. This code is written very defensively so that incompatibilities
don't affect us much.
Given that glibc really has a duty to maintain compat between
differently compiled programs and their system nscd they can't break API
and thus it should be safe for us to implement an alternative,
minimalistic client.
Ideally this kind of explicit, global cache flushing would not be necessary.
However nscd currently has no cache coherency protocol, hence we can't
really implement this better. The only concept it knows is a TTL for
positive hosts lookups. Hoewver for negative lookups or any of the other
tables nothing is available.
This has been irritating me for quite a while: let's prefix these enum
values with a common prefix, like we do for almost all other enums.
No change in behaviour, just some renaming.
In #10583, a unit file lives in ~/.config/systemd/user, and
'systemctl --runtime --user mask' is used to create a symlink in /run.
This symlink has lower priority than the config file, so
'systemctl --user' will happily load the unit file, and does't care about
the symlink at all.
But when asked if the unit is enabled, we'd look for all symlinks, find the
symlink in the runtime directory, and report that the unit is runtime-enabled.
In this particular case the fact that the symlink points at /dev/null, creates
additional confusion, but it doesn't really matter: *any* symlink (or regular
file) that is lower in the priority order is "covered" by the unit fragment,
and should be ignored.
Fixes#10583.
The name argument in UnitFileInstallInfo (i->name) should always be a unit
file name, so the conditional always takes the 'else' branch.
The only call chain that links to find_symlinks_fd() is unit_file_lookup_state
→ find_symlinks_in_scope → find_symlinks → find_symlinks_fd. But
unit_file_lookup_state calls unit_name_is_valid(name), and then name is used
to construct the UnitFileInstallInfo object in install_info_discover, which just
uses the name it was given.
There should be no functional difference, except that the error message
is changd from "three or no arguments" to "zero or three arguments". Somehow
the inverted form always seemed strange.
umask() call is also dropped from run-generator. I think it wasn't dropped in
053254e3cb because the run generator was merged
around the same time.
fdopen doesn't accept "e", it's ignored. Let's not mislead people into
believing that it actually sets O_CLOEXEC.
From `man 3 fdopen`:
> e (since glibc 2.7):
> Open the file with the O_CLOEXEC flag. See open(2) for more information. This flag is ignored for fdopen()
As mentioned by @jlebon in #11131.
It would be very wrong if any of the specfier printf calls modified
any of the objects or data being printed. Let's mark all arguments as const
(primarily to make it easier for the reader to see where modifications cannot
occur).
https://tools.ietf.org/html/rfc1035#section-2.3.1 says (approximately)
that only letters, numbers, and non-leading non-trailing dashes are allowed
(for entries with A/AAAA records). We set no restrictions.
hosts(5) says:
> Host names may contain only alphanumeric characters, minus signs ("-"), and
> periods ("."). They must begin with an alphabetic character and end with an
> alphanumeric character.
nss-files follows those rules, and will ignore names in /etc/hosts that do not
follow this rule.
Let's follow the documented rules for /etc/hosts. In particular, this makes us
consitent with nss-files, reducing surprises for the user.
I'm pretty sure we should apply stricter filtering to names received over DNS
and LLMNR and MDNS, but it's a bigger project, because the rules differ
depepending on which level the label appears (rules for top-level names are
stricter), and this patch takes the minimalistic approach and only changes
behaviour for /etc/hosts.
Escape syntax is also disallowed in /etc/hosts, even if the resulting character
would be allowed. Other tools that parse /etc/hosts do not support this, and
there is no need to use it because no allowed characters benefit from escaping.
The table cell reusing code is supposed to be an internal memory
optimization, and not more. This means behaviour should be the same as
if we wouldn't reuse cells.
This adds a per-cell option for uppercasing displayed strings.
Implicitly turn this on for the header row. The fact that we format the
table header in uppercase is a formatting thing after all, hence should
be applied by the formatter, i.e. the table display code.
Moreover, this provides us with the benefit that we can more nicely
reuse the specified table headers as JSON field names, like we already
do: json field names are usually not uppercase.
Since .timespan and .timestamp are unionized on top of each other this
doesn't actually matter, but it is still more correct to address it
under it's correct name.
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.
No code changes, just some rearranging of source files.
The code so far logged about some errors but was silent on others. Let's
stream-line that and make the function fully self-logging on all error
conditions.
Similar to the previous commit: in many cases no further fd processing
needs to be done in forked of children before execve() or any of its
flavours are called. In those case we can use FORK_RLIMIT_NOFILE_SAFE
instead.
Whenever we invoke external, foreign code from code that has
RLIMIT_NOFILE's soft limit bumped to high values, revert it to 1024
first. This is a safety precaution for compatibility with programs using
select() which cannot operate with fds > 1024.
This commit adds the call to rlimit_nofile_safe() to all invocations of
exec{v,ve,l}() and friends that either are in code that we know runs
with RLIMIT_NOFILE bumped up (which is PID 1 and all journal code for
starters) or that is part of shared code that might end up there.
The calls are placed as early as we can in processes invoking a flavour
of execve(), but after the last time we do fd manipulations, so that we
can still take benefit of the high fd limits for that.
This is really basic stuff and in a follow-up commit will use it all
across the codebase, including in process-util.[ch] which is in
src/basic/. Hence let's move it back to src/basic/ itself.
This helper sets RLIMIT_NOFILE's soft limit to 1024 (FD_SETSIZE) for
compatibility with apps using select().
The idea is that we use this helper to reset the limit whenever we
invoke foreign code from our own processes which have bumped
RLIMIT_NOFILE high.
This is in many cases redundant, as a similar check is done by various
callers already, but in other cases (where we read the color from a
static table for example), it's nice to let the color check be done by
the table code itself, and since it doesn't hurt in the other cases just
do it again.
it's already part of @ipc, no need to have it in both. Given that @ipc
is much more popular (as it is part of @system-service for example),
let's not define it a second time.
Let's generalize this, so that we can use this in nspawn later on, which
is pretty useful as we need to be able to mask files from the inner
child of nspawn too, where the host's /run/systemd/inaccessible
directory is not visible anymore. Moreover, if nspawn can create these
nodes on its own before the payload this means the payload can run with
fewer privileges.
add new "systemd-run-generator" for running arbitrary commands from the kernel command line as system services using the "systemd.run=" kernel command line switch
Previously, we'd link the unit file into /etc in this case, but that
should only be done if the unit file is not in the search path anyway,
and this is already done implicitly anyway for all enabled unit files,
hence no reason to duplicate this here.
Fixes: #10253
Quite often when we generate objects some fields should only be
generated in some conditions. Let's add high-level support for that.
Matching the existing JSON_BUILD_PAIR() this adds
JSON_BUILD_PAIR_CONDITIONAL() which is very similar, but takes an
additional parameter: a boolean condition. If "true" this acts like
JSON_BUILD_PAIR(), but if false then the whole pair is suppressed.
This sounds simply, but requires a tiny bit of complexity: when complex
sub-variants are used in fields, then we also need to suppress them.
Having systemctl disable/unmask remove all symlinks in /etc and /run is
unintuitive and breaks existing use cases.
systemctl should behave symmetrically.
A "systemctl --runtime unmask" should undo a "systemctl --runtime mask"
action.
Say you have a service, which was masked by the admin in /etc.
If you temporarily want to mask the execution of the service (say in a
script), you'd create a runtime mask via "systemctl --runtime mask".
It is is now no longer possible to undo this temporary mask without
nuking the admin changes, unless you start rm'ing files manually.
While it is useful to be able to remove all enablement/mask symlinks in
one go, this should be done via a separate command line switch, like
"systemctl --all unmask".
This reverts commit 4910b35078.
Fixes: #9393
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.
When not specified let's also propagate the exit status of the main
process we fork off for the unit.
Let's simplify things and drop the logic that /var/lib/machines is setup
as auto-growing btrfs loopback file /var/lib/machines.raw.
THis was done in order to make quota available for machine management,
but quite frankly never really worked properly, as we couldn't grow the
file system in sync with its use properly. Moreover philosophically it's
problematic overriding the admin's choice of file system like this.
Let's hence drop this, and simplify things. Deleting code is a good
feeling.
Now that regular file systems provide project quota we could probably
add per-machine quota support based on that, hence the btrfs quota
argument is not that interesting anymore (though btrfs quota is a bit
more powerful as it allows recursive quota, i.e. that the machine pool
gets an overall quota in addition to per-machine quota).
We invoke this usually on a temporary path before renaming it into
place. This means the log message is quite suprising as it mentions a
weird path with random characters in it. Hence, let's downgrade the
message in order not to confuse the user.
Otherwise a "reboot" or "poweroff" in the initramfs will have to wait
until systemd-fsck-root.service has completed, which might never happen
if the root device never shows up.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
This reverts 5fdf2d51c2, except for one improved
log message.
Fixes#10613.
Checking if resume= is configured is a good idea, but it turns out we cannot do
it reliably:
- the code only supported boot options with sd-boot, and it's not very widely
used. This means that for most systemd we could only check the current
commandline, not the next one.
- Various systems resume without, e.g. Debian has
/etc/initramfs-tools/conf.d/resume in the initramfs.
Making those checks better would be possible with enough effort, but there'll
be always new systems that boot in a slightly different way and we would need
to keep adding new cases. Longer term, we want to rely on autodetecting the
resume partition, and then checks like this will not be necessary at all. It is
quite clear from the number of bug reports that the number of poeple impacted
by this is quite high now, so let's just drop this.
The agent is closed after the static destuctors but before the pager.
No users of DEFINE_MAIN_FUNCTION* were using a polkit agent, so this makes no
functional difference.
This doesn't have much effect on the final build, because we link libbasic.a
into libsystemd-shared.so, so in the end, all the object built from basic/
end up in libsystemd-shared. And when the static library is linked into binaries,
any objects that are included in it but are not used are trimmed. Hence, the
size of output artifacts doesn't change:
$ du -sb /var/tmp/inst*
54181861 /var/tmp/inst1 (old)
54207441 /var/tmp/inst1s (old split-usr)
54182477 /var/tmp/inst2 (new)
54208041 /var/tmp/inst2s (new split-usr)
(The negligible change in size is because libsystemd-shared.so is bigger
by a few hundred bytes. I guess it's because symbols are named differently
or something like that.)
The effect is on the build process, in particular partial builds. This change
effectively moves the requirements on some build steps toward the leaves of the
dependency tree. Two effects:
- when building items that do not depend on libsystemd-shared, we
build less stuff for libbasic.a (which wouldn't be used anyway,
so it's a net win).
- when building items that do depend on libshared, we reduce libbasic.a as a
synchronization point, possibly allowing better parallelism.
Method:
1. copy list of .h files from src/basic/meson.build to /tmp/basic
2. $ for i in $(grep '.h$' /tmp/basic); do echo $i; git --no-pager grep "include \"$i\"" src/basic/ 'src/lib*' 'src/nss-*' 'src/journal/sd-journal.c' |grep -v "${i%.h}.c";echo ;done | less
This removes the ability to configure which cgroup controllers to mount
together. Instead, we'll now hardcode that "cpu" and "cpuacct" are
mounted together as well as "net_cls" and "net_prio".
The concept of mounting controllers together has no future as it does
not exist to cgroupsv2. Moreover, the current logic is systematically
broken, as revealed by the discussions in #10507. Also, we surveyed Red
Hat customers and couldn't find a single user of the concept (which
isn't particularly surprising, as it is broken...)
This reduced the (already way too complex) cgroup handling for us, since
we now know whenever we make a change to a cgroup for one controller to
which other controllers it applies.
Now that we don't (mis-)use the env file parser to parse kernel command
lines there's no need anymore to override the used newline character
set. Let's hence drop the argument and just "\n\r" always. This nicely
simplifies our code.
All users of the macro (except for one, in serialize.c), use the macro in
connection with read_line(), so they must include fileio.h. Let's not play
libc games and require multiple header file to be included for the most common
use of a function.
The removal of def.h includes is not exact. I mostly went over the commits that
switch over to use read_line() and add def.h at the same time and reverted the
addition of def.h in those files.
Rebooting to set change the kernel command line to set some udev parameters is
inconvenient. Let's allow setting more stuff in the config file.
Also drop quotes from around "info" in udev.conf. We need to accept them for
compatibility, but there is no reason to use them.
A race condition happens when calling ask_password_auto() multiple times
to unlock several disks on boot and effectively no password caching is
utilized. This patch fixes it by polling the cache when waiting for
the password.