Two functional changes:
- "/" is now refused. The test is adjusted.
- The trailing NUL is *not* included in the returned size for abstract size. The
comments in sockaddr_un_set_path() indicate that this is the right thing to do,
and the code in socket_address_parse() wasn't doing that.
Closes#12624.
The formatting in systemd.socket.xml is updated a bit.
Currently in_addr_port_ifindex_name_to_string() always prints the ifindex
numerically. This is not super useful since the interface numbers are
semi-random. Should we use interface names in preference?
We would set .type to a fake value. All real callers (outside of tests)
immediately overwrite .type with a proper value after calling
socket_address_parse(). So let's not set it and adjust the few places
that relied on it being set to the fake value.
socket_address_parse() is modernized to only set the output argument on
success.
One special syntax is not supported anymore: "iface:port" would be parsed as an
interface name plus numerical port, equivalent to "[::]%iface:port". This was
added in 542563babd, but was undocumented, and we had no tests for it. It seems
that this actually wasn't doing anything useful, because the kernel only uses the
scope identifier for link-local addresses.
We don't try to resolve invalid ifnames as all. A different return
code is used. This difference will be verified later in test_socket_address_parse()
when socket_address_parse() is converted to use
in_addr_port_ifindex_name_from_string_auto().
The tricky part here is that the function is not allowed to fail in this code
path. Initially, I wanted to change the return value to allow it to fail, but
this cascades through all the places where fstab_test_option() and friends are
used; updating all those sites would be a lot of work. And since quoting is not
allowed here, a simple loop with strcspn() is easy to do.
Let's make sure we keep a reference to the event source
(Note that this code is currently not used, which is why this was never
used: in all cases we do not add listener fds after the event is
attached, but before. In that case this code is not called.)
We support read-only ptyfwd options, and on those the input event source
won't be allocated. Deal with that and don't invoke a function on it
that will then instantly fail.
The new methods work as the unflavoured ones, but takes flags as a
single uint64_t DBUS parameters instead of different booleans, so
that it can be extended without breaking backward compatibility.
Add new flag to allow adding/removing symlinks in
[/etc|/run]/systemd/system.attached so that portable services
configuration files can be self-contained in those directories, without
affecting the system services directories.
Use the new methods and flags from portablectl --enable.
Useful in case /etc is read-only, with only the portable services
directories being mounted read-write.
Let's make libcryptsetup a dlopen() style dep for PID 1 (i.e. for
RootImage= and stuff), systemd-growfs and systemd-repart. (But leave to
be a regulra dep in systemd-cryptsetup, systemd-veritysetup and
systemd-homed since for them the libcryptsetup support is not auxiliary
but pretty much at the core of what they do.)
This should be useful for container images that want systemd in the
payload but don't care for the cryptsetup logic since dm-crypt and stuff
isn't available in containers anyway.
Fixes: #8249
"crypt-util.c" is such a generic name, let's avoid that, in particular
as libc's/libcrypt's crypt() function is so generically named too that
one might thing this is about that. Let's hence be more precise, and
make clear that this is about cryptsetup, and nothing else.
We already had cryptsetup-util.[ch] in src/cryptsetup/ doing keyfile
management. To avoid the needless confusion, let's rename that file to
cryptsetup-keyfile.[ch].
strv_extend_strv_utf8_only() uses a temporary buffer to make the implementation
conscise. Otherwise we'd have to rewrite all of strv_extend_strv() which didn't
seem worth the trouble for this one use outside of a hot path.
If the data is not serializable, we just pretend it doesn't exists.
This fixes#16683 and https://bugs.gentoo.org/735072 in a second way.
They both are both short and contain similar parts and various helper will be
shared between both parts of the code so it's easier to use a single file.
JSON strings must be utf-8-clean. We also verify this in json_parse_string()
so we would reject a message with invalid utf-8 anyway.
It would probably be slightly cheaper to detect non-conformaning strings in
serialization, but then we'd have to fail serialization. By doing this early,
we give the caller a chance to handle the error nicely.
The test is adjusted to contain a valid utf-8 string after decoding of the
utf-32 encoding in json ("विवेकख्यातिरविप्लवा हानोपायः।", something about the
cessation of ignorance).
Upon reception of a message which fails in json_parse(), we would proceed to
parse it again from a deferred callback and hang. Once we have realized that
the message is invalid, let's move the pointer in the buffer even if the
message is invalid. We don't want to look at this data again.
(before) $ build-rawhide/userdbctl --output=json user test.user
n/a: varlink: setting state idle-client
/run/systemd/userdb/io.systemd.Multiplexer: Sending message: {"method":"io.systemd.UserDatabase.GetUserRecord","parameters":{"userName":"test.user","service":"io.systemd.Multiplexer"}}
/run/systemd/userdb/io.systemd.Multiplexer: varlink: changing state idle-client → awaiting-reply
/run/systemd/userdb/io.systemd.Multiplexer: New incoming message: {...}
/run/systemd/userdb/io.systemd.Multiplexer: varlink: changing state awaiting-reply → pending-disconnect
/run/systemd/userdb/io.systemd.Multiplexer: New incoming message: {...}
/run/systemd/userdb/io.systemd.Multiplexer: varlink: changing state pending-disconnect → disconnected
^C
(after) $ n/a: varlink: setting state idle-client
/run/systemd/userdb/io.systemd.Multiplexer: Sending message: {"method":"io.systemd.UserDatabase.GetUserRecord","parameters":{"userName":"test.user","service":"io.systemd.Multiplexer"}}
/run/systemd/userdb/io.systemd.Multiplexer: varlink: changing state idle-client → awaiting-reply
/run/systemd/userdb/io.systemd.Multiplexer: New incoming message: {...}
/run/systemd/userdb/io.systemd.Multiplexer: Failed to parse JSON: Invalid argument
/run/systemd/userdb/io.systemd.Multiplexer: varlink: changing state awaiting-reply → pending-disconnect
/run/systemd/userdb/io.systemd.Multiplexer: varlink: changing state pending-disconnect → processing-disconnect
Got lookup error: io.systemd.Disconnected
/run/systemd/userdb/io.systemd.Multiplexer: varlink: changing state processing-disconnect → disconnected
Failed to find user test.user: Input/output error
This should fix#16683 and https://bugs.gentoo.org/735072.
We would reject various passwords that glibc accepts, for example ""
or any descrypted password. Accounts with empty password are definitely
useful, for example for testing or in scenarios where a password is not
needed. Also, using weak encryption methods is probably not a good idea,
it's not the job of our nss helpers to decide that: they should just
faithfully forward whatever data is there.
Also rename the function to make it more obvious that the returned answer
is not in any way certain.
Instead of assuming that more-recently modified directories have higher mtime,
just look for any mtime changes, up or down. Since we don't want to remember
individual mtimes, hash them to obtain a single value.
This should help us behave properly in the case when the time jumps backwards
during boot: various files might have mtimes that in the future, but we won't
care. This fixes the following scenario:
We have /etc/systemd/system with T1. T1 is initially far in the past.
We have /run/systemd/generator with time T2.
The time is adjusted backwards, so T2 will be always in the future for a while.
Now the user writes new files to /etc/systemd/system, and T1 is updated to T1'.
Nevertheless, T1 < T1' << T2.
We would consider our cache to be up-to-date, falsely.
N_DEVICE_NODE_LIST_ATTEMPTS is unconditionally used since version 246 and
ac1f3ad05f
However, this variable is only defined if HAVE_BLKID is set resulting in
the following build failure if cryptsetup is enabled but not libblkid:
../src/shared/dissect-image.c:1336:34: error: 'N_DEVICE_NODE_LIST_ATTEMPTS' undeclared (first use in this function)
1336 | for (unsigned i = 0; i < N_DEVICE_NODE_LIST_ATTEMPTS; i++) {
|
Fixes:
- http://autobuild.buildroot.org/results/67782c225c08387c1bbcbea9eee3ca12bc6577cd
On systems without an RTC, systemd currently sets the clock to a
compile-time epoch value, derived from the NEWS file in the
repository. This is not ideal as the initial clock hence depends
on the last time systemd was built, not when the image was compiled.
Let's provide a different way here and look at `/usr/lib/clock-epoch`.
If that file exists, it's timestamp for the last modification will be
used instead of the compile-time default.
I find this version much more readable.
Add replacement defines so that when acl/libacl.h is not available, the
ACL_{READ,WRITE,EXECUTE} constants are also defined. Those constants were
declared in the kernel headers already in 1da177e4c3f41524e886b7f1b8a0c1f,
so they should be the same pretty much everywhere.
It appears LOOP_CONFIGURE in 5.8 is even more broken than initially
thought: it doesn't properly propgate lo_sizelimit to the block device
layer. :-(
Let's hence check the block device size immediately after issuing
LOOP_CONFIGURE, and if it doesn't match what we just set let's fallback
to the old ioctls.
This means LOOP_CONFIGURE currently works correctly only for the most
simply case: no partition table logic and no size limit. Sad!
(Kernel people should really be told about the concepts of tests and
even CI, one day!)
This matches how we handle things everywhere else, i.e. in .mount units,
and similar: when a mount point dir is missing, we create it, let's do
so too when dealing with disk images.
This makes things a lot simpler, more robust, and systematic.
fat is a bit more limited in volume name length and UUID support. Let's
add some special support for it.
This is particularly useful to generate EFI system partitions.
Kernel 5.8 gained a hidepid= implementation that is truly per procfs,
which allows us to mount a distinct once into every unit, with
individual hidepid= settings. Let's expose this via two new settings:
ProtectProc= (wrapping hidpid=) and ProcSubset= (wrapping subset=).
Replaces: #11670
We return BUS_ERROR_NO_SUCH_UNIT a.k.a. org.freedesktop.systemd1.NoSuchUnit
in various places. In #16813:
Aug 22 06:14:48 core sudo[2769199]: pam_systemd_home(sudo:account): Failed to query user record: Unit dbus-org.freedesktop.home1.service not found.
Aug 22 06:14:48 core dbus-daemon[5311]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.home1.service': Unit dbus-org.freedesktop.home1.service not found.
Aug 22 06:14:48 core dbus-daemon[5311]: [system] Activating via systemd: service name='org.freedesktop.home1' unit='dbus-org.freedesktop.home1.service' requested by ':1.6564' (uid=0 pid=2769199 comm="sudo su ")
This particular error comes from bus_unit_validate_load_state() in pid1:
case UNIT_NOT_FOUND:
return sd_bus_error_setf(error, BUS_ERROR_NO_SUCH_UNIT, "Unit %s not found.", u->id);
It seems possible that we should return a different error, but it doesn't really
matter: if we change pid1 to return a different error, we still need to handle
BUS_ERROR_NO_SUCH_UNIT as in this patch to handle pid1 with current code.
This patch adds seccomp support to the riscv64 architecture. seccomp
support is available in the riscv64 kernel since version 5.5, and it
has just been added to the libseccomp library.
riscv64 uses generic syscalls like aarch64, so I used that architecture
as a reference to find which code has to be modified.
With this patch, the testsuite passes successfully, including the
test-seccomp test. The system boots and works fine with kernel 5.4 (i.e.
without seccomp support) and kernel 5.5 (i.e. with seccomp support). I
have also verified that the "SystemCallFilter=~socket" option prevents a
service to use the ping utility when running on kernel 5.5.
On new kernels (>= 5.8) unprivileged users may create the 0:0 character
device node. Which is great, as we can use that as inaccessible device
nodes if we run unprivileged. Hence, change how we find the right
inaccessible device inodes: when the user asks for a block device node,
but we have none, try the char device node first. If that doesn't exist,
fall back to the socket node as before.
This means that:
1. in the best case we'll return a node if the right device node type
2. otherwise we hopefully at least can return a device node if one asked
for even if the type doesn't match (i.e. we return char instead of
the requested block device node)
3. in the worst case (old kernels…) we'll return a socket node
Similarly to "setup" vs. "set up", "fallback" is a noun, and "fall back"
is the verb. (This is pretty clear when we construct a sentence in the
present continous: "we are falling back" not "we are fallbacking").
Follow the same model established for RootImage and RootImageOptions,
and allow to either append a single list of options or tuples of
partition_number:options.
The concept is flawed, and mostly useless. Let's finally remove it.
It has been deprecated since 90a2ec10f2 (6
years ago) and we started to warn since
55dadc5c57 (1.5 years ago).
Let's get rid of it altogether.
Let's make /run/host the sole place we pass stuff from host to container
in and place the "inaccessible" nodes in /run/host too.
In contrast to the previous two commits this is a minor compat break, but
not a relevant one I think. Previously the container manager would place
these nodes in /run/systemd/inaccessible/ and that's where PID 1 in the
container would try to add them too when missing. Container manager and
PID 1 in the container would thus manage the same dir together.
With this change the container manager now passes an immutable directory
to the container and leaves /run/systemd entirely untouched, and managed
exclusively by PID 1 inside the container, which is nice to have clear
separation on who manages what.
In order to make sure systemd then usses the /run/host/inaccesible/
nodes this commit changes PID 1 to look for that dir and if it exists
will symlink it to /run/systemd/inaccessible.
Now, this will work fine if new nspawn and new pid 1 in the container
work together. as then the symlink is created and the difference between
the two dirs won't matter.
For the case where an old nspawn invokes a new PID 1: in this case
things work as they always worked: the dir is managed together.
For the case where different container manager invokes a new PID 1: in
this case the nodes aren't typically passed in, and PID 1 in the
container will try to create them and will likely fail partially (though
gracefully) when trying to create char/block device nodes. THis is fine
though as there are fallbacks in place for that case.
For the case where a new nspawn invokes an old PID1: this is were the
(minor) incompatibily happens: in this case new nspawn will place the
nodes in the /run/host/inaccessible/ subdir, but the PID 1 in the
container won't look for them there. Since the nodes are also not
pre-created in /run/systed/inaccessible/ PID 1 will try to create them
there as if a different container manager sets them up. This is of
course not sexy, but is not a total loss, since as mentioned fallbacks
are in place anyway. Hence I think it's OK to accept this minor
incompatibility.
Apparently both Fedora and suse default to btrfs now, it should hence be
good enough for us too.
This enables a bunch of really nice things for us, most importanly we
can resize home directories freely (i.e. both grow *and* shrink) while
online. It also allows us to add nice subvolume based home directory
snapshotting later on.
Also, whenever we mention the three supported types, alaways mention
them in alphabetical order, which is also our new order of preference.
Some masks shouldn't be needed externally, so keep their functions in
the module (others would fit there too but they're used in tests) to
think twice if something would depend on them.
Drop unused function cg_attach_many_everywhere.
Use cgroup_realized instead of cgroup_path when we actually ask for
realized.
This should not cause any functional changes.
When we are about to derealize a controller on v1 cgroup, we first
attempt to delete the controller cgroup and migrate afterwards. This
doesn't work in practice because populated cgroup cannot be deleted.
Furthermore, we leave out slices from migration completely, so
(un)setting a control value on them won't realize their controller
cgroup.
Rework actual realization, unit_create_cgroup() becomes
unit_update_cgroup() and make sure that controller hierarchies are
reduced when given controller cgroup ceased to be needed.
Note that with this we introduce slight deviation between v1 and v2 code
-- when a descendant unit turns off a delegated controller, we attempt
to disable it in ancestor slices. On v2 this may fail (kernel enforced,
because of child cgroups using the controller), on v1 we'll migrate
whole subtree and trim the subhierachy. (Previously, we wouldn't take
away delegated controller, however, derealization was broken anyway.)
Fixes: #14149
It is possible that we will be running with an upgraded libseccomp, in which
case libseccomp might know the syscall name, even if the number is not known at
the time when systemd is being compiled. The guard only serves to break such
upgrades, by requiring that we also recompile systemd.
For s390-specific syscalls, use a define to exclude them, so that that we don't
try to filter them on other arches.
(cherry picked from commit 6cf852e79eb0eced2f77653941f9c75c3bd79386)
Also, let's move the glue for this to src/shared/ so that we later can
reuse this in sysemd-firstboot.
Given that libpwquality is a more a leaf dependency, let's make it
runtime optional, so that downstream distros can downgrade their package
deps from Required to Recommended.
If we don't succeed on the first try it's because another process is
opening the same device. Do a microsleep for 2ms to increase the
chances it has completed the next time around the loop.
The symlink /dev/mapper/dm_name is created by udev after a mapper
device is set up. So libdevmapper/libcrypsetup might tell us that
a verity device exists, but the symlink we use as the source for
the mount operation might not be there yet.
Instead of falling back to a new unique device set up, wait for
the udev event matching on the expected devlink for at least 100ms
(after which the benefits of sharing a device in terms of setup
time start to disappear - on my production machines, opening a new
verity device seems to take between 150ms and 300ms)
This effectively makes little difference because we exit soon later
anyway, which will close the fds, too. However, it's still useful since
it means the parent will get EOF events on them in the order we process
things and isn't delayed to process the data from the pipes until the
child dies.
LOOP_CONFIGURE allows us to configure a loopback device in one ioctl
instead of two, which is not just faster but also removes the race that
udev might start probing the device before we adjusted things properly.
Unfortunately LOOP_CONFIGURE is broken in regards to LO_FLAGS_PARTSCAN
as of kernel 5.8.0. This patch contains a work-around for that, to
fallback to old behaviour if partition scanning is requested but does
not work. Sucks a bit.
Proposed upstream fix for that issue:
https://lkml.org/lkml/2020/8/6/97
Let's fix up invalid GECOS fields both when we convert from NSS to JSON
and the other way round.
Kinda sucks we have to do that, but NSS does it when writing data to
/etc/passwd, so let's do the same.
Fixes: #16668
User records have the realname/gecos fields, groups never had that, but
it would really be useful to have it, hence let's add it with similar
semantics.
We enforce the same syntax as for GECOS, since it's better to start with
strict rules and losen them later instead of the opposite.
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.
Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
Apparently, if IO is still in flight at the moment we invoke LOOP_CLR_FD
it is likely simply dropped (probably because yanking physical storage,
such as a USB stick would drop it too). Let's protect ourselves against
that and always sync explicitly before we invoke it.
The explicit limit is dropped, which means that we return to the kernel default
of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it
seems. It turns out various applications need more space in /dev/shm and we
would break them by imposing a low limit.
While at it, rename the define and use a single macro for various tmpfs mounts.
We don't really care what the purpose of the given tmpfs is, so it seems
reasonable to use a single macro.
This effectively reverts part of 7d85383edb. Fixes#16617.
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:
RootImageOptions=1:ro,dev 2:nosuid nodev
In absence of a partition number, 0 is assumed.
This should be enough to fix https://bugzilla.redhat.com/show_bug.cgi?id=1856514.
But the limit should be significantly higher than 10% anyway. By setting a
limit on /tmp at 10% we'll break many reasonable use cases, even though the
machine would deal fine with a much larger fraction devoted to /tmp.
(In the first version of this patch I made it 25% with the comment that
"Even 25% might be too low.". The kernel default is 50%, and we have been using
that seemingly without trouble since https://fedoraproject.org/wiki/Features/tmp-on-tmpfs.
So let's just make it 50% again.)
See 7d85383edb.
(Another consideration is that we learned from from the whole initiative with
zram in Fedora that a reasonable size for zram is 0.5-1.5 of RAM, and that pretty
much all systems benefit from having zram or zswap enabled. Thus it is reasonable
to assume that it'll become widely used. Taking the usual compression effectiveness
of 0.2 into account, machines have effective memory available of between
1.0 - 0.2*0.5 + 0.5 = 1.4 (for zram sized to 0.5 of RAM) and
1.0 - 0.2*1.5 + 1.5 = 2.2 (for zram 1.5 sized to 1.5 of RAM) times RAM size.
This means that the 10% was really like 7-4% of effective memory.)
A comment is added to mount-util.h to clarify that tmp.mount is separate.
Opening a verity device is an expensive operation. The kernelspace operations
are mostly sequential with a global lock held regardless of which device
is being opened. In userspace jumps in and out of multiple libraries are
required. When signatures are used, there's the additional cryptographic
checks.
We know when two devices are identical: they have the same root hash.
If libcrypsetup returns EEXIST, double check that the hashes are really
the same, and that either both or none have a signature, and if everything
matches simply remount the already open device. The kernel will do
reference counting for us.
In order to quickly and reliably discover if a device is already open,
change the node naming scheme from '/dev/mapper/major:minor-verity' to
'/dev/mapper/$roothash-verity'.
Unfortunately libdevmapper is not 100% reliable, so in some case it
will say that the device already exists and it is active, but in
reality it is not usable. Fallback to an individually-activated
unique device name in those cases for robustness.
This changes the code to allow looking at multiple files with different
prefixes, but uses "/etc" and "/usr/lib". rpm-ostree uses
/usr/lib/{passwd,group} with nss-altfiles. I see no harm in simply trying both
paths on all systems.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1857530.
A minor memory leak is fixed: hashmap_put() returns -EEXIST is the key is
present *and* and the value is different. It return 0 if the value is the
same. Thus, we would leak the user/group name if it was specified multiple
times with the same uid/gid. I opted to remove the warning message completely:
with multiple files it is reasonable to have the same name defined more than
once. But even with one file the warning is dubious: all tools that read those
files deal correctly with duplicate entries and we are not writing a linter.
The test binary has two modes: in the default argument-less mode, it
just checks that "root" can be resolved. When invoked manually, a root
prefix and user/group names can be specified.
let's separate things out a bit, to make it easier to discern log output
and catalog data.
catalog data is now colored green (which is a color we don't use for log
data currently), and prefixed with a block shade.
sssd people don't like enumeration and for some other cases it's not
nice to support either, in particular when synthesizing records for
container/userns UID/GID ranges.
Hence, let's make enumeration optional.
Currently systemd-user-runtime-dir does not create the files in
/run/user/$UID/systemd/inaccessible with the default SELinux label.
The user and role part of these labels should be based on the user
related to $UID and not based on the process context of
systemd-user-runtime-dir.
Since v246-rc1 (9664be199a) /run/user/$UID/systemd is also created by
systemd-user-runtime-dir and should also be created with the default
SELinux context.
The call would always fail with:
systemd-userwork[780]: Failed to dlopen(libnss_systemd.so.2), ignoring: /usr/lib64libnss_systemd.so.2: cannot open shared object file: No such file or directory
Add support for creating a MACVLAN interface in "source" mode by
specifying Mode=source in the [MACVLAN] section of a .netdev file.
A list of allowed MAC addresses for the corresponding MACVLAN can also
be specified with the SourceMACAddress= option of the [MACVLAN] section.
An example .netdev file:
[NetDev]
Name=macvlan0
Kind=macvlan
MACAddress=02:DE:AD:BE:EF:00
[MACVLAN]
Mode=source
SourceMACAddress=02:AB:AB:AB:AB:01 02:CD:CD:CD:CD:01
SourceMACAddress=02:EF:EF:EF:EF:01
The same keys can also be specified in [MACVTAP] for MACVTAP kinds of
interfaces, with the same semantics.
This partially undoes the parent commit. We follow the symlink and
if it appears to be a symlink to /dev/null, even if /dev/null is not
present, we treat it as such. The addition of creation of /dev/null
in the test is reverted.
Right now systemd-update-utmp.service would fail on read-only /var because
it was not able to write the wtmp record. But it still writes the utmp
record just fine, so runtime information is OK. I don't think we need to
make too much fuss about not being able to save wtmp info.
There's some inconsistency in the what is considered a masked unit:
some places (i.e. load-fragment.c) use `null_or_empty()` while others
check if the file path is symlinked to "/dev/null". Since the latter
doesn't account for things like non-absolute symlinks to "/dev/null",
this commit switches the check for "/dev/null" to use `null_or_empty_path()`
We'd like to use it for FIDO2 tokens too, and the concept is entirely
generic, hence let's just reuse the field, but rename it. Read the old
name for compatibility, and treat the old name and the new name as
identical for most purposes.