Consider such configuration:
$ systemd-nspawn --read-only --timezone=copy --resolv-conf=copy-host \
--overlay="+/etc::/etc" <...>
Assuming one wants `/` to be read-only, DNS and `/etc/localtime` to
work. One way to do it is to create an overlay filesystem in `/etc/`.
However, systemd-nspawn tries to create `/etc/resolv.conf` and
`/etc/localtime` before mounting the custom paths, while `/` (and, by
extension, `/etc`) is read-only. Thus it fails to create those files.
Mounting custom paths before modifying anything in `/etc/` makes this
possible.
Full example:
```
$ debootstrap buster /var/lib/machines/t1 http://deb.debian.org/debian
$ systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -c 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
ping: example.com: Temporary failure in name resolution
Container t1 failed with error code 130.
```
With the patch:
```
$ sudo ./build/systemd-nspawn --private-users=false --timezone=copy --resolv-conf=copy-host --read-only --tmpfs=/var --tmpfs=/run --overlay="+/etc::/etc" -D /var/lib/machines/t1 ping -qc 1 example.com
Spawning container t1 on /var/lib/machines/t1.
Press ^] three times within 1s to kill container.
PING example.com (93.184.216.34) 56(84) bytes of data.
--- example.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 110.912/110.912/110.912/0.000 ms
Container t1 exited successfully.
```
This changes the calendarspec parser to allow expressions such as
"00:05..05", i.e. a range where start and end is the same. It also
allows expressions such as "00:1-2/3", i.e. where the repetition value
does not fit even once in the specified range. With this patch both
cases will now be optimized away, i.e. the range is removed and a fixed
value is used, which is functionally equivalent.
See #15030 for an issue where the inability to parse such expressions
caused confusion.
I think it's probably better to accept these gracefully and optimizing
them away instead of refusing them with a plain EINVAL. With a tool such
as "systemd-analyze" calendar it should be easy to figure out the
normalized form with the redundant bits optimized away.
Let's try to mangle table contents a bit to make them more suitable as
JSON field names. Specifically when we see "foo bar" convert this to
"foo_bar" as field name, as variable/field names are generally assumed
to be without spaces.
HUGE_SIZE was defined inconsistently.
> In file included from ../src/basic/alloc-util.h:9,
> from ../src/journal/test-compress.c:9:
> ../src/journal/test-compress.c: In function ‘main’:
> ../src/journal/test-compress.c:280:33: error: ‘HUGE_SIZE’ undeclared (first use in this function)
> 280 | assert_se(huge = malloc(HUGE_SIZE));
This adds the sd_notify_barrier function, to allow users to synchronize against
the reception of sd_notify(3) status messages. It acts as a synchronization
point, and a successful return gurantees that all previous messages have been
consumed by the manager. This can be used to eliminate race conditions where
the sending process exits too early for systemd to associate its PID to a
cgroup and attribute the status message to a unit correctly.
systemd-notify now uses this function for proper notification delivery and be
useful for NotifyAccess=all units again in user mode, or in cases where it
doesn't have a control process as parent.
Fixes: #2739
A service can specify FDSTORE=1 FDPOLL=0 to request that PID1 does not
poll the fd to remove them on error. If set, fds will only be removed on
FDSTOREREMOVE=1 or when the service is done.
Fixes: #12086
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.
This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.
Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
$ systemctl --no-pager --root /tmp/root2/ cat ctrl-alt-del.target
Failed to resolve symlink /tmp/root2/etc/systemd/system/ctrl-alt-del.target pointing to /usr/lib/systemd/system/reboot.target, ignoring: Channel number out of range
...
EFI variable access is nowadays subject to rate limiting by the kernel.
Thus, let's cache the results of checking them, in order to minimize how
often we access them.
Fixes: #14828
Callers of cg_get_keyed_attribute_full() can now specify via the flag whether the
missing keyes in cgroup attribute file are OK or not. Also the wrappers for both
strict and graceful version are provided.
This has the advantage that mac_selinux_access_check() can be used as a
function in all contexts. For example, parameters passed to it won't be
reported as unused if the "function" call is replaced with 0 on SELinux
disabled builds.
if we parse an xattr line that has no valid assignment, we might end up
with an empty ->xattr list. Don't hit assert on that, just go on.
Fixes: #15610
Let's allow more memory to be locked on beefy machines than on small
ones. The previous limit of 64M is the lower bound still. This
effectively means on a 4GB machine we can lock 512M, which should be
more than enough, but still not lock up the machine entirely under
pressure.
Fixes: #15053
WSL2 will soon (TM) include the "WSL2" string in /proc/sys/kernel/osrelease
so the workaround will no longer be necessary.
We have several different cloud images which do include the "microsoft"
string already, which would break this detection. They are for internal
usage at the moment, but the userspace side can come from all over the
place so it would be quite hard to track and downstream-patch to avoid
breakages.
This reverts commit a2f838d590.
On my laptop (Lenovo X1carbo 4th) I very occasionally see test-boot-timestamps
fail with this tb:
262/494 test-boot-timestamps FAIL 0.7348453998565674 s (killed by signal 6 SIGABRT)
08:12:48 SYSTEMD_LANGUAGE_FALLBACK_MAP='/home/zbyszek/src/systemd/src/locale/language-fallback-map' SYSTEMD_KBD_MODEL_MAP='/home/zbyszek/src/systemd/src/locale/kbd-model-map' PATH='/home/zbyszek/src/systemd/build:/home/zbyszek/.local/bin:/usr/lib64/qt-3.3/bin:/usr/share/Modules/bin:/usr/condabin:/usr/lib64/ccache:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/home/zbyszek/bin:/var/lib/snapd/snap/bin' /home/zbyszek/src/systemd/build/test-boot-timestamps
--- stderr ---
Failed to read $container of PID 1, ignoring: Permission denied
Found container virtualization none.
Failed to get SystemdOptions EFI variable, ignoring: Interrupted system call
Failed to read ACPI FPDT: Permission denied
Failed to read LoaderTimeInitUSec: Interrupted system call
Failed to read EFI loader data: Interrupted system call
Assertion 'q >= 0' failed at src/test/test-boot-timestamps.c:84, function main(). Aborting.
Normally it takes ~0.02s, but here there's a slowdown to 0.73 and things fail with EINTR.
This happens only occasionally, and I haven't been able to capture a strace.
It would be to ignore that case in test-boot-timestamps or always translate
EINTR to -ENODATA. Nevertheless, I think it's better to retry, since this gives
as more resilient behaviour and avoids a transient failure.
See
https://github.com/torvalds/linux/blob/master/fs/efivarfs/file.c#L75
and
bef3efbeb8.
Don't assume that 4MB can be allocated from stack since there could be smaller
DefaultLimitSTACK= in force, so let's use malloc(). NUL terminate the huge
strings by hand, also ensure termination in test_lz4_decompress_partial() and
optimize the memset() for the string.
Some items in /proc and /etc may not be accessible to poor unprivileged users
due to e.g. SELinux, BOFH or both, so check for EACCES and EPERM.
/var/tmp may be a symlink to /tmp and then path_compare() will always fail, so
let's stick to /tmp like elsewhere.
/tmp may be mounted with noexec option and then trying to execute scripts from
there would fail.
Detect and warn if seccomp is already in use, which could make seccomp test
fail if the syscalls are already blocked.
Unset $TMPDIR so it will not break specifier tests where %T is assumed to be
/tmp and %V /var/tmp.
It's not always mounted, e.g. during the build-time tests, it's running inside
a chroot (that's how Debian/Ubuntu build packages, in chroots) so this test
always fails because /sys/fs/cgroup isn't mounted.
When nothing at all is mounted at /sys/fs/cgroup, the fs.f_type is
SYSFS_MAGIC (0x62656572) which results in the confusing debug log:
"Unknown filesystem type 62656572 mounted on /sys/fs/cgroup."
Instead, if the f_type is SYSFS_MAGIC, a more accurate message is:
"No filesystem is currently mounted on /sys/fs/cgroup."
Split out of #15457, let's see if this is the culprit of the CI failure.
(also setting green label here, since @keszybz already greenlit it in that other PR)
We unregister binfmt_misc twice during shutdown with this change:
1. A previous commit added support for doing that in the final shutdown
phase, i.e. when we do the aggressive umount loop. This is the robust
thing to do, in case the earlier ("clean") shutdown phase didn't work
for some reason.
2. This commit adds support for doing that when systemd-binfmt.service
is stopped. This is a good idea so that people can order mounts
before the service if they want to register binaries from such
mounts, as in that case we'll undo the registration on shutdown
again, before unmounting those mounts.
And all that, just because of that weird "F" flag the kernel introduced
that can pin files...
Fixes: #14981
Let's just copy out the bit of the string we need, and let's make sure
we refuse rules called "status" and "register", since those are special
files in binfmt_misc's file system.
Apparently if the new "F" flag is used they might pin files, which
blocks us from unmounting things. Let's hence clear this up explicitly.
Before entering our umount loop.
Fixes: #14981
let's return ENOSYS in that case, to make things a bit less confusng.
Previously we'd just propagate ENOENT, which people might mistake as
applying to the object being modified rather than /proc/ just not being
there.
Let's return ENOSYS instead, i.e. an error clearly indicating that some
kernel API is not available. This hopefully should put people on a
better track.
Note that we only do the procfs check in the error path, which hopefully
means it's the less likely path.
We probably can add similar bits to more suitable codepaths dealing with
/proc/self/fd, but for now, let's pick to the ones noticed in #14745.
Fixes: #14745
Otherwise we'd not read the services input while waiting for the job to
wait, and there's no point in waiting for the job anyway if we wait for
the unit to stop ultimately.
Fixes: #15395
Let's be extra careful whenever we return from recvmsg() and see
MSG_CTRUNC set. This generally means we ran into a programming error, as
we didn't size the control buffer large enough. It's an error condition
we should at least log about, or propagate up. Hence do that.
This is particularly important when receiving fds, since for those the
control data can be of any size. In particular on stream sockets that's
nasty, because if we miss an fd because of control data truncation we
cannot recover, we might not even realize that we are one off.
(Also, when failing early, if there's any chance the socket might be
AF_UNIX let's close all received fds, all the time. We got this right
most of the time, but there were a few cases missing. God, UNIX is hard
to use)
ubsan complains that we add an offset to a NULL ptr here in some cases.
Which isn't really a bug though, since we only use it as the end
condition for a for loop, but we can still fix it...
Fixes: #15522
Let's add flavours for copying stub/uplink resolv.conf versions.
Let's add a more brutal "replace" mode, where we'll replace any existing
destination file.
Let's also change what "auto" means: instead of copying the static file,
let's use the stub file, so that DNS search info is copied over.
Fixes: #15340
When a job is skipped due its dependencies not being ready, log
a debug message saying what is holding it back.
This was very useful with transient units timing out to figure
out where the problem was.
Anyone previously using the UseRoutes=false parameter expected their
dhcp4-provided gateway route to be ignored, as well. However, with
the introduction of the UseGateway= parameter, this is no longer true.
In order to keep backwards compatibility, this sets the UseGateway=
default value to whatever UseRoutes= has been set to.
Just as log_full already does, check if the log level would result in
logging immediately in the macro in order to avoid doing
unnecessary work that adds up in hot spots.
Let's define a new, generic bus interface that any daemon can implement
for querying/setting the log level.
We can turn this into something more powerful later on, but for now,
only expose three properties: the log level, log target and the syslog
identifier (with the former two being writable).
This is supposed to be generic, so that it can be implemented by 3rd
party daemons too, eventually.
It's not that I think that "hostname" is vastly superior to "host name". Quite
the opposite — the difference is small, and in some context the two-word version
does fit better. But in the tree, there are ~200 occurrences of the first, and
>1600 of the other, and consistent spelling is more important than any particular
spelling choice.
We use udev to wait for /dev/loopX devices to be fully proped hence we
need an implicit ordering dependency on it, for RootImage= to work
reliably in early boot, too.
Fixes: #14972
When the stub listener is disabled, stub-resolv.conf is useless. Instead of
warning about this, let's just make stub-resolv.conf point to the private
resolv.conf file. (The original bug report asked for "mirroring", but I think
a symlink is nicer than a copy because it is easier to see that a redirection
was made.)
Fixes#14700.
All callers ignore the return value.
This is almost entirely theoretical, since writing to /run is unlikely to
fail..., but the user is almost certainly better off keeping the old copy
around and having working dns resolution with an out-of-date dns server list
than having having a dangling /etc/resolv.conf symlink.
An error with a full path is immediately clear. OTOH, a user might not be
familiar with concenpt like "private resolv.conf".
I opted to use %s-formatting for the path, because the code is much easier to
read this way. Any difference in t speed of execution is not important.
This is useful to raise the log level for a single transaction or a few,
without affecting other state of the resolved as a restart would.
The log level can only be set, I didn't bother with having the ability
to restore the original as in pid1.
Hiding the first column, which may contain bullet circles, with --no-legend
is undocumented and potentially unexpected. On the other hand, not printing
bullet circles with --plain is documented so hiding the column with that
switch is sensible.
The combination "--full --no-legend --no-pager --plain" is appropriate for
automated processing of systemctl output.
ebf963c551 changed the 'sep' argument to always
be either " " or "\n", which broke the indentation logic for the first line
in base64_append_width(). Since it now always is one character, and never NULL,
let's change the type to char and simplify the logic a bit.
$ COLUMNS=30 build/test-dns-packet test/test-resolve/org~20200417.pkts
============== test/test-resolve/org~20200417.pkts ==============
org IN DNSKEY 256 3 RSASHA1-NSEC3-SHA1
AwEAAcLPVEcg0hFBheXQf
QOqqLiRgckk69o2KTAsq3
lNRY0c9mnEjzZDGsGmXNy
2EQ6yelkIYYus7KLor2Fz
x59hEqcM82zqkdHV6hXvZ
yjxxSHG3nl8xQS6gF8mdI
YouDTWWhTInfjSKoIeDok
Hq3S67EjSngV7/wVCMTbI
amS0NF4H
-- Flags: ZONE_KEY
-- Key tag: 37022
...
$ COLUMNS=120 build/test-dns-packet test/test-resolve/org~20200417.pkts
============== test/test-resolve/org~20200417.pkts ==============
org IN DNSKEY 256 3 RSASHA1-NSEC3-SHA1 AwEAAcLPVEcg0hFBheXQfQOqqLiRgckk69o2KTAsq3lNRY0c9mnEjzZDGsGmXNy2EQ6yelkIYYus7KLor
2Fzx59hEqcM82zqkdHV6hXvZyjxxSHG3nl8xQS6gF8mdIYouDTWWhTInfjSKoIeDokHq3S67EjSngV7/w
VCMTbIamS0NF4H
-- Flags: ZONE_KEY
-- Key tag: 37022
...
Let's make it optional whether auditing is enabled at journald start-up
or not.
Note that this only controls whether audit is enabled/disabled in the
kernel. Either way we'll still collect the audit data if it is
generated, i.e. if some other tool enables it, we'll collect it.
Fixes: #959
There are legitimate reasons to access the file directly, as currently
discussed on fedora-devel. Hence tone things down from "must" to "should
typically not".
Also, let's use fputs() instead of fputs_unlocked() here,
fopen_temporary_label() turns off stdio locking anyway for the whole
FILE*, hence no need to do this manually each time.
Builds with recent glibc would fail with:
../src/network/netdev/fou-tunnel.c: In function ‘config_parse_ip_protocol’:
../src/basic/macro.h:380:9: error: static assertion failed: "IPPROTO_MAX-1 <= UINT8_MAX"
380 | static_assert(expr, #expr)
| ^~~~~~~~~~~~~
../src/network/netdev/fou-tunnel.c:161:9: note: in expansion of macro ‘assert_cc’
161 | assert_cc(IPPROTO_MAX-1 <= UINT8_MAX);
| ^~~~~~~~~
This is because f9ac84f92f151e07586c55e14ed628d493a5929d (present in
glibc-2.31.9000-9.fc33.x86_64) added IPPROTO_MPTCP=262, following
v5.5-rc5-1002-gfaf391c382 in the kernel.
The watchdog ping is performed for every iteration of manager event
loop. This results in a lot of ioctls on watchdog device driver
especially during boot or if services are aggressively using sd_notify.
Depending on the watchdog device driver this may have performance
impact on embedded systems.
The patch skips sending the watchdog to device driver if the ping is
requested before half of the watchdog timeout.
proot provides userspace-powered emulation of chroot and mount --bind,
lending it to be used on environments without unprivileged user
namespaces, or in otherwise restricted environments like Android.
In order to achieve this, proot makes use of the kernel's ptrace()
facility, which we can use in order to detect its presence. Since it
doesn't use any kind of namespacing, including PID namespacing, we don't
need to do any tricks when trying to get the tracer's metadata.
For our purposes, proot is listed as a "container", since we mostly use
this also as the bucket for non-container-but-container-like
technologies like WSL. As such, it seems like a good fit for this
section as well.
It's pretty, and it highlights that the pw prompt is kinda special and
needs user input.
We suppress the emoji entirel if there's no emoji support (i.e. this
means we suppress the ASCII replacement), since it carries no additional
information, it is just decoration to highlight a line.
We provide a way via the '-' symbol to ignore errors when nonexistent
executable files are passed to Exec* parameters & so on. In such a case,
the flag `EXEC_COMMAND_IGNORE_FAILURE` is set and we go on happily with
our life if that happens. However, `systemd-analyze verify` complained
about missing executables even in such a case. In such a case it is not
an error for this to happen so check if the flag is set before checking
if the file is accessible and executable.
Add some small tests to check this condition.
Closes#15218.
This commit introduces new meson build option "kernel-install" to prevent kernel-install from building if the user
sets the added option as "false".
Signed-off-by: Jakov Smolic <jakov.smolic@sartura.hr>
Signed-off-by: Luka Perkov <luka.perkov@sartura.hr>
An stdio FILE* stream usually refers to something with a file
descriptor, but that's just "usually". It doesn't have to, when taking
fmemopen() and similar into account. Most of our calls to fileno()
assumed the call couldn't fail. In most cases this was correct, but in
some cases where we didn't know whether we work on files or memory we'd
use the returned fd as if it was unconditionally valid while it wasn't,
and passed it to a multitude of kernel syscalls. Let's fix that, and do
something reasonably smart when encountering this case.
(Running test-fileio with this patch applied will remove tons of ioctl()
calls on -1).
This should hopefully help us avoid c&p mistakes. And there are plans to
add more settings like this, which should then be rather straightforward.
There is a slight functional change: the code got uplink handling wrong
and run manager_find_uplink() repeatedly. That part is fixed.
Regardless of whether a mount is setup in initrd or int the main system,
the network default dependencies _netdev should still be honored.
IOW if a mount unit use the following options "x-initrd.mount,_netdev", it
should be ordered against initrd-fs.target, network.target,
network-online.target.
/dev/vdb1 /mnt ext4 x-initrd.mount,_netdev defaults 0 0
Before this patch:
Before=umount.target initrd-fs.target
After=system.slice sysroot.mount dev-vdb1.device -.mount systemd-journald.socket blockdev@dev-vdb1.target
After this patch:
Before=initrd-fs.target umount.target
After=network-online.target -.mount blockdev@dev-vdb1.target dev-vdb1.device sysroot.mount system.slice network.target systemd-journald.socket
First After=local-fs-pre.target wasn't described in the man page although it's
part of the default dependencies automatically set by pid1.
Secondly, Before=local-fs.target was only set if the automount unit was
generated from the fstab-generator because the dep was explicitly
generated. It was also not documented as a default dependency.
Fix it by managing the dep from pid1 instead.
fstab-generator was also handling the default ordering dependencies for mount
units setup in initrd. To do that it was turning the defaults dependencies off
completely and ordered the mount unit against either local-fs.target or
initrd-fs.target or initrd-root-fs.target itself.
But it had the bad side effect to also remove all other default dependencies as
well. Thus if an initrd mount was using _netdev, the network dependencies were
missing.
In general fstab-generator shouldn't use DefaultDependecies=no because it can
handle only a small set of the default dependencies the rest are dealt by pid1.
So this patch makes pid1 handle all default dependencies.
Let's not trigger MACs needlessly.
Ideally everybody would turn on userdb, but if people insist in not
doing so, then let's not attempt to open shadow.
It's a bit ugly to implement this, since shadow information is more than
just passwords (but accound validity metadata), and thus userdb's own
"privieleged" scheme is orthogonal to this, but let's still do this for
the client side.
Fixes: #15105
This patch changes the way user managers set the default umask for the units it
manages.
Indeed one can expect that if user manager's umask is redefined through PAM
(via /etc/login.defs or pam_umask), all its children including the units it
spawns have their umask set to the new value.
Hence make user units inherit their umask value from their parent instead of
the hard coded value 0022 but allow them to override this value via their unit
file.
Note that reexecuting managers with 'systemctl daemon-reexec' after changing
UMask= has no effect. To take effect managers need to be restarted with
'systemct restart' instead. This behavior was already present before this
patch.
Fixes#6077.
When managing a home directory as LUKS image we currently place a
directory at the top that contains the actual home directory (so that
the home directory of the user won't be cluttered by lost-found and
suchlike). On btrfs let's make that a subvol though. This is a good idea
so that possibly later on we can make use of this for automatic history
management.
Fixes: #15121