If more than one errno is specified for a syscall in SystemCallFilter=,
use the last one instead of reporting an error. This is especially
useful when used with system call sets:
SystemCallFilter=@privileged:EPERM @reboot
This will block any system call requiring super-user capabilities with
EPERM, except for attempts to reboot the system, which will immediately
terminate the process. (@reboot is included in @privileged.)
This also effectively fixes#9939, since specifying different errnos for
“the same syscall” (same pseudo syscall number) is no longer an error.
Dracut has a support for unlocking encrypted drives with keyfile stored
on the external drive. This support is included in the generated initrd
only if systemd module is not included.
When systemd is used in initrd then attachment of encrypted drives is
handled by systemd-cryptsetup tools. Our generator has support for
keyfile, however, it didn't support keyfile on the external block
device (keydev).
This commit introduces basic keydev support. Keydev can be specified per
luks.uuid on the kernel command line. Keydev is automatically mounted
during boot and we look for keyfile in the keydev
mountpoint (i.e. keyfile path is prefixed with the keydev mount point
path). After crypt device is attached we automatically unmount
where keyfile resides.
Example:
rd.luks.key=70bc876b-f627-4038-9049-3080d79d2165=/key:LABEL=KEYDEV
According to RFC2616[1], HTTP header names are case-insensitive. So
it's totally valid to have a header starting with either `Date:` or
`date:`.
However, when systemd-importd pulls an image from an HTTP server, it
parses HTTP headers by comparing header names as-is, without any
conversion. That causes failures when some HTTP servers return headers
with different combinations of upper-/lower-cases.
An example:
https://alpha.release.flatcar-linux.net/amd64-usr/current/flatcar_developer_container.bin.bz2 returns `Etag: "pe89so9oir60"`,
while https://alpha.release.core-os.net/amd64-usr/current/coreos_developer_container.bin.bz2
returns `ETag: "f03372edea9a1e7232e282c346099857"`.
Since systemd-importd expects to see `ETag`, the etag for the Container Linux image
is correctly interpreted as a part of the hidden file name.
However, it cannot parse etag for Flatcar Linux, so the etag the Flatcar Linux image
is not appended to the hidden file name.
```
$ sudo ls -al /var/lib/machines/
-r--r--r-- 1 root root 3303014400 Aug 21 20:07 '.raw-https:\x2f\x2falpha\x2erelease\x2ecore-os\x2enet\x2famd64-usr\x2fcurrent\x2fcoreos_developer_container\x2ebin\x2ebz2.\x22f03372edea9a1e7232e282c346099857\x22.raw'
-r--r--r-- 1 root root 3303014400 Aug 17 06:15 '.raw-https:\x2f\x2falpha\x2erelease\x2eflatcar-linux\x2enet\x2famd64-usr\x2fcurrent\x2fflatcar_developer_container\x2ebin\x2ebz2.raw'
```
As a result, when the Flatcar image is removed and downloaded again,
systemd-importd is not able to determine if the file has been already
downloaded, so it always download it again. Then it fails to rename it
to an expected name, because there's already a hidden file.
To fix this issue, let's introduce a new helper function
`memory_startswith_no_case()`, which compares memory regions in a
case-insensitive way. Use this function in `curl_header_strdup()`.
See also https://github.com/kinvolk/kube-spawn/issues/304
[1]: https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2
On Dell machines LoadOptions is filled with:
01 00 00 00 <name of BIOS Boot Loader Entry> ... <unknown bytes>
So, in case of meaningfull LoadOptions, better check if the first char
is a printable character.
Fix#9993. When this code was split out to user-runtime-dir, it forgot to
include the call to mac_selinux_init(). So mkdir_label() stopped working.
Fixes: a9f0f5e501 ("logind: split %t directory creation to a helper
unit")
The MountEntry's added for EMPTY_DIR work very similarly to the TMPFS ones.
In both cases, .has_prefix is false. In fact, .has_prefix is false in
*all* the MountEntry's we add except for the access mounts (READONLY etc).
But EMPTY_DIR stuck out by explicitly setting .has_prefix = false.
Let's remove that.
... when no mount options are passed.
Change the code, to avoid the following failure in the newly added tests:
exec-temporaryfilesystem-rw.service: Executing: /usr/bin/sh -x -c
'[ "$(stat -c %a /var)" == 755 ]'
++ stat -c %a /var
+ '[' 1777 == 755 ']'
Received SIGCHLD from PID 30364 (sh).
Child 30364 (sh) died (code=exited, status=1/FAILURE)
(And I spotted an opportunity to use TAKE_PTR() at the end).
We can't remount the underlying superblocks, if we are inside a user
namespace and running Linux <= 4.17. We can only change the per-mount
flags (MS_REMOUNT | MS_BIND).
This type of mount() call can only change the per-mount flags, so we
don't have to worry about passing the right string options now.
Fixes#9914 ("Since 1beab8b was merged, systemd has been failing to start
systemd-resolved inside unprivileged containers" ... "Failed to re-mount
'/run/systemd/unit-root/dev' read-only: Operation not permitted").
> It's basically my fault :-). I pointed out we could remount read-only
> without MS_BIND when reviewing the PR that added TemporaryFilesystem=,
> and poettering suggested to change PrivateDevices= at the same time.
> I think it's safe to change back, and I don't expect anyone will notice
> a difference in behaviour.
>
> It just surprised me to realize that
> `TemporaryFilesystem=/tmp:size=10M,ro,nosuid` would not apply `ro` to the
> superblock (underlying filesystem), like mount -osize=10M,ro,nosuid does.
> Maybe a comment could note the kernel version (v4.18), that lets you
> remount without MS_BIND inside a user namespace.
This makes the code longer and I guess this function is still ugly, sorry.
One obstacle to cleaning it up is the interaction between
`PrivateDevices=yes` and `ReadOnlyPaths=/dev`. I've added a test for the
existing behaviour, which I think is now the correct behaviour.
Only report OOM if that was actually the error of the operation,
explicitly report the possible error that a syscall was already blocked
with a different errno and translate that into a more sensible errno
(EEXIST only makes sense in connection to the hashmap), and pass through
all other potential errors unmodified. Part of #9939.
`systemd-resolved.service` runs as `User=systemd-resolved`, and uses certain
Capabilit{y,ies} magic. By my understanding, this means it is started with a
number of "privileges". Indeed, `capabilities(7)` explains
> Linux divides the privileges traditionally
> associated with superuser into distinct units, known as capabilities,
> which can be independently enabled and disabled."
This situation appears to contradict our current code comment which said
> If we are not running as root we assume all privileges are already dropped.
This appears to be a confusion in the comment only. The rest of the code
tells a much clearer story. (Don't ask me if the story is correct.
`capabilities(7)` scares me). Let's tweak the comment to make it consistent
and avoid worrying readers about this.
If logind is not supported, don't try to schedule a shutdown,
immediately poweroff. This is the behavior indicated by the current
message given to the user, but the command is returning an error. I
believe this was broken on this commit:
7f96539d45
sfeatures is a "struct ethtool_sfeatures". Use sizeof() on the correct
data type.
Since "struct ethtool_gstrings" is larger than "struct ethtool_sfeatures",
this had no serious consequences.
Fixes: 50725d10e3
When computing the next network prefix to assign, compute the next
prefix to allocate based on the intended /64 assignment, not the
given prefix length for the whole prefix, e.g. /48, given to
systemd-networkd.
Fixes#9626.
The option is now called simply "Encapsulation=".
Also, "ignoring" is rather misleading, because we use to to mean that some line
is being ignored. Here the whole tunnel is dropped.
Use "falling back" instead of "fallback".
Also, it's not an application-specific machine ID, but rather an
application-and-machine-specific ID. Let's call it "app-machine-speicific" for
short.
This work add support to generic netlink to sd-netlink.
See https://lwn.net/Articles/208755/
networkd: add support FooOverUDP support to IPIP tunnel netdev
https://lwn.net/Articles/614348/
Example conf:
/lib/systemd/network/1-fou-tunnel.netdev
```
[NetDev]
Name=fou-tun
Kind=fou
[FooOverUDP]
Port=5555
Protocol=4
```
/lib/systemd/network/ipip-tunnel.netdev
```
[NetDev]
Name=ipip-tun
Kind=ipip
[Tunnel]
Independent=true
Local=10.65.208.212
Remote=10.65.208.211
FooOverUDP=true
FOUDestinationPort=5555
```
$ ip -d link show ipip-tun
```
5: ipip-tun@NONE: <POINTOPOINT,NOARP> mtu 1472 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ipip 10.65.208.212 peer 10.65.208.211 promiscuity 0
ipip remote 10.65.208.211 local 10.65.208.212 ttl inherit pmtudisc encap fou encap-sport auto encap-dport 5555 noencap-csum noencap-csum6 noencap-remcsum numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
```
Pretty much all intel cpus have had RDRAND in a long time. While
CPU-internal RNG are widely not trusted, for seeding hash tables it's
perfectly OK to use: we don't high quality entropy in that case, hence
let's use it.
This is only hooked up with 'high_quality_required' is false. If we
require high quality entropy the kernel is the only source we should
use.
This makes two changes to the namespacing code:
1. We'll only gracefully skip service namespacing on access failure if
exclusively sandboxing options where selected, and not mount-related
options that result in a very different view of the world. For example,
ignoring RootDirectory=, RootImage= or Bind= is really probablematic,
but ReadOnlyPaths= is just a weaker sandbox.
2. The namespacing code will now return a clearly recognizable error
code when it cannot enforce its namespacing, so that we cannot
confuse EPERM errors from mount() with those from unshare(). Only the
errors from the first unshare() are now taken as hint to gracefully
disable namespacing.
Fixes: #9844#9835
The fstab entry may contain comment/application-specific options, like
for example x-systemd.automount or x-initrd.mount.
With the recent switch to libmount, the mount options during remount are
now gathered via mnt_fs_get_options(), which returns the merged fstab
options with the effective options in mountinfo.
Unfortunately if one of these application-specific options are set in
fstab, the remount will fail with -EINVAL.
In systemd 238:
Remounting '/test-x-initrd-mount' read-only in with options
'errors=continue,user_xattr,acl'.
In systemd 239:
Remounting '/test-x-initrd-mount' read-only in with options
'errors=continue,user_xattr,acl,x-initrd.mount'.
Failed to remount '/test-x-initrd-mount' read-only: Invalid argument
So instead of using mnt_fs_get_options(), we're now using both
mnt_fs_get_fs_options() and mnt_fs_get_vfs_options() and merging the
results together so we don't get any non-relevant options from fstab.
Signed-off-by: aszlig <aszlig@nix.build>
A follow-up for commit 9d874aec45.
This patch makes "path" parameter mandatory in fd_set_*() helpers removing the
need to use fd_get_path() when NULL was passed. The caller is supposed to pass
the fd anyway so assuming that it also knows the path should be safe.
Actually, the only case where this was useful (or used) was when we were
walking through directory trees (in item_do()). But even in those cases the
paths could be constructed trivially, which is still better than relying on
fd_get_path() (which is an ugly API).
A very succinct test case is also added for 'z/Z' operators so the code dealing
with recursive operators is tested minimally.
Let's fold get_user_creds_clean() into get_user_creds(), and introduce a
flags argument for it to select "clean" behaviour. This flags parameter
also learns to other new flags:
- USER_CREDS_SYNTHESIZE_FALLBACK: in this mode the user records for
root/nobody are only synthesized as fallback. Normally, the synthesized
records take precedence over what is in the user database. With this
flag set this is reversed, and the user database takes precedence, and
the synthesized records are only used if they are missing there. This
flag should be set in cases where doing NSS is deemed safe, and where
there's interest in knowing the correct shell, for example if the
admin changed root's shell to zsh or suchlike.
- USER_CREDS_ALLOW_MISSING: if set, and a UID/GID is specified by
numeric value, and there's no user/group record for it accept it
anyway. This allows us to fix#9767
This then also ports all users to set the most appropriate flags.
Fixes: #9767
[zj: remove one isempty() call]
On the host these symlinks are created by udev, and we consider them API
and make use of them ourselves at various places. Hence when running a
private /dev, also create these symlinks so that lookups by major/minor
work in such an environment, too.
This is some extra protection for sloppy "golden master" systems, where
images are duplicated many times but the random seed is not
deleted (or reset for each copy). That golden master systems have to
reset /etc/machine-id is better known, and easier to notice (as having
the same ID will result in address conflicts and suchlike quite often).
Hence let's write the machine ID into /dev/urandom, in case it has been
initialized and unlikely the stored random seed has been provisioned
differently on each image.
Note that we don't credit the entropy either way, hence in the case
there's a cycle of a) generating the machine-id early at boot and b)
writing it back into /dev/urandom late at boot it shouldn't matter. It's
never going to make things worse, just in a few cases better.
When stdin/stdout/stderr is initialized from an fd, let's read the tty
name of it if we can, and pass that to PAM.
This makes sure that "machinectl shell" sessions have proper TTY fields
initialized that "loginctl" then shows.
This is a bit like the info link in most of GNU's --help texts, but we
don't do info but man pages, and we make them properly clickable on
terminal supporting that, because awesome.
I think it's generally advisable to link up our (brief) --help texts and
our (more comprehensive) man pages a bit, so this should be an easy and
straight-forward way to do it.
This fixes a memory leak:
```
d5070e2f67ededca022f81f2941900606b16f3196b2268e856295f59._openpgpkey.gmail.com: resolve call failed: 'd5070e2f67ededca022f81f2941900606b16f3196b2268e856295f59._openpgpkey.gmail.com' not found
=================================================================
==224==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 65 byte(s) in 1 object(s) allocated from:
#0 0x7f71b0878850 in malloc (/usr/lib64/libasan.so.4+0xde850)
#1 0x7f71afaf69b0 in malloc_multiply ../src/basic/alloc-util.h:63
#2 0x7f71afaf6c95 in hexmem ../src/basic/hexdecoct.c:62
#3 0x7f71afbb574b in string_hashsum ../src/basic/gcrypt-util.c:45
#4 0x56201333e0b9 in string_hashsum_sha256 ../src/basic/gcrypt-util.h:30
#5 0x562013347b63 in resolve_openpgp ../src/resolve/resolvectl.c:908
#6 0x562013348b9f in verb_openpgp ../src/resolve/resolvectl.c:944
#7 0x7f71afbae0b0 in dispatch_verb ../src/basic/verbs.c:119
#8 0x56201335790b in native_main ../src/resolve/resolvectl.c:2947
#9 0x56201335880d in main ../src/resolve/resolvectl.c:3087
#10 0x7f71ad8fcf29 in __libc_start_main (/lib64/libc.so.6+0x20f29)
SUMMARY: AddressSanitizer: 65 byte(s) leaked in 1 allocation(s).
```
The references to the dns_server are now setup after the tls connection is setup.
This ensures that the stream got fully stopped when the initial tls setup failed
instead of having the unref being blocked by the reference to the stream by the server.
Therefore on_stream_io would no longer be called with a half setup encrypted connection.
Fixes the issue reported in #9838.
Previously, we'd act immediately on StopWhenUnneeded= when a unit state
changes. With this rework we'll maintain a queue instead: whenever
there's the chance that StopWhenUneeded= might have an effect we enqueue
the unit, and process it later when we have nothing better to do.
This should make the implementation a bit more reliable, as the unit notify event
cannot immediately enqueue tons of side-effect jobs that might
contradict each other, but we do so only in a strictly ordered fashion,
from the main event loop.
This slightly changes the check when to consider a unit "unneeded".
Previously, we'd assume that a unit in "deactivating" state could also
be cleaned up. With this new logic we'll only consider units unneeded
that are fully up and have no job queued. This means that whenever
there's something pending for a unit we won't clean it up.
The function replaces a couple commas, a semicolon and the final newline with
zero bytes in the string passed to it. The 'const' seems to have been added
by accident during a bulk edit (more specifically 3b3154df7e).
The qgroup logic (types 'q' and 'Q') only has an effect if there's no previous
setup at all, and any explicitly configured subvolumes with their qgroups are
left entirely unmodified.
The idea is that if users want a different logic than the one we set up by
default, then by all means they should do that before hand, and tmpfiles won't
override their logic.
tmpfiles now passes an O_PATH fd to btrfs_subvol_make_fd() under the
assumption it will accept it like mkdirat() does. So far this assumption
was wrong, let's correct that.
Without that tmpfiles' on btrfs file systems failed systematically...
fd_get_path() is an ugly API, as it creates ambiguities related to the
" (deleted)" suffix /proc/$PID/fd/$FD shows. Let's use it a bit less
excessively, and whenever we have a good valid path already, let's
simply pass that along, instead of forgetting it in one stackframe and
reacquiring it in the next.
Even if built without gcrypt, show the relevant options in help message.
Otherwise, the help message diverges from the man page or suggestions
by the shell completion.
The "features" fields is parsed as a tristate value. The values
are thus not of type NetDevFeature enum but int. The NetDevFeature
enum is instead the index for the features array.
Adjust the type. In practice, this had no impact because NetDevFeature
enum commonly has size of int.
Also, don't use memset() 0xFF to initilize the int with -1. While
it works correctly in practice, it feels ugly.
This makes it possible to wait until boot is finished without having to poll
for this command repeatedly, instead using the syntax:
$ systemctl is-system-running --wait
Waiting is implemented by waiting for the StartupFinished signal to be posted
on the bus.
Register the matcher before checking for the property to avoid race conditions.
Tested by artificially delaying startup with a oneshot service and calling this
command, checked that it emitted `running` and exited with a 0 return code as
soon as the delay service completed startup.
Also tested that booting to degraded state unblocks the command.
Inserted a delay between getting the property and waiting for the signal and
confirmed this seems to work free of race conditions.
Updated the --help text (under --wait) and the man page to document the new
feature.
This function doesn't really implement ordering, but CMP() is still fine to use
there. Keep the comment in place, just update it slightly to indicate that.
Looked for definitions of functions using the *_compare_func() suffix.
Tested:
- Unit tests passed (ninja -C build/ test)
- Installed this build and booted with it.
Macro returns -1, 0, 1 depending on whether a < b, a == b or a > b.
It's safe to use on unsigned types.
Add tests to confirm corner cases are properly covered.
Drop __extension__, since we don't use gcc -Wpedantic or -ansi.
Reformat code for spacing. Add spaces after commas almost everywhere.
Reindent code blocks in macro definitions, for consistency.
--machine has been missing for a while in systemd-stdio-bridge
this syntax can be switched to be more standard.
v2: Support the old syntax too.
timedatectl -H server1.myhostingcompany.com:5555/container1
Closes: #8071
Previously, we'd only ever read 512 byte from the random seed file,
under the assumption we won't need more. With this change we'll read the
full file, even if it is larger.
The idea behind htis change is that people can dump additional data into the
random seed file offline if they like, and it can be low quality, and
we'll seed the pool with it anyway. Moreover, if people are paranoid and
want us to save/restore a bigger seed, it's easy to do: just truncate
the file to the right size and we'll save/restore as much in the future.
This also reworks the file a bit, introducing two clear if blocks that
load and that save the random seed, and that each are conditionalized
more carefully.
On target boards without RTC, `t->kernel_time` is 0 or 1 usec.
`systemd-analyze` reads this value over D-Bus from
`org.freedesktop.systemd1.Manager`, property `KernelTimestamp`.
The issue is: if `t->kernel_time` is 0, `systemd-analyze` does not print
the kernel time:
~~~~
$ systemd-analyze
Startup finished in 1.860s (userspace) = 5.957s
~~~~
This commit fixes the misbehaviour:
~~~~
$ systemd-analyze
Startup finished in 3.866s (kernel) + 2.015s (userspace) = 5.881s
~~~~
Fixes#7721.
v2: fixes one more condition (by Yu Watanabe <watanabe.yu+github@gmail.com>)
v3: fixes one more condition (by Kirill Marinushkin <kmarinushkin@de.adit-jv.com>)
RootImage= may require the following settings
```
DeviceAllow=/dev/loop-control rw
DeviceAllow=block-loop rwm
DeviceAllow=block-blkext rwm
```
This adds the following settings implicitly when RootImage= is
specified.
Fixes#9737.