for planned nft backend we have three choices:
- open/close a new nfnetlink socket for every operation
- keep a nfnetlink socket open internally
- expose a opaque fw_ctx and stash all internal data here.
Originally I opted for the 2nd option, but during review it was
suggested to avoid static storage duration because of perceived
problems with threaded applications.
This adds fw_ctx and new/free functions, then converts the existing api
and nspawn and networkd to use it.
In a nutshell:
1. git mv firewall-util.c firewall-util-iptables.c
2. existing external functions gain _iptables_ in their names
3. firewall-util.c provides old function names
4. build system always compiles firewall-util.c,
firewall-util-iptables.c is conditional instead (libiptc).
5. On first call to any of the 'old' API functions performs
a probe that should return the preferred backend.
In a future step, can add firewall-util-FOOTYPE.c, add its
probe function to firewall-util.c and then have calls to
fw_add_masq/local_dnat handed to the detected backend.
For now, only iptables backend exists, and no special probing
takes place for it, i.e. when systemd was built with iptables,
that will be used. If not, requets to add masquerade/dnat will
fail with same error (-EOPNOTSUPP) as before this change.
For reference, the rules added by the libiptc/iptables backend look like this:
for service export (via systemd-nspawn):
[0:0] -A PREROUTING -p tcp -m tcp --dport $exportedport -m addrtype --dst-type LOCAL -j DNAT --to-destination $containerip:$port
[0:0] -A OUTPUT ! -d 127.0.0.0/8 -p tcp -m tcp --dport $exportedport -m addrtype --dst-type LOCAL -j DNAT --to-destination $containerip:$port
for ip masquerade:
[0:0] -A POSTROUTING -s network/prefix -j MASQUERADE
Make sure we don't add masquerading rules without a explicitly
specified network range we should be masquerading for.
The only caller aside from test case is
networkd-address.c which never passes a NULL source.
As it also passes the network prefix, that should always be > 0 as well.
This causes expected test failure:
Failed to modify firewall: Invalid argument
Failed to modify firewall: Invalid argument
Failed to modify firewall: Invalid argument
Failed to modify firewall: Protocol not available
Failed to modify firewall: Protocol not available
Failed to modify firewall: Protocol not available
Failed to modify firewall: Protocol not available
The failing test cases are amended to expect failure on
NULL source or prefix instead of success.
Previously, we'd already have explicit logging for the case where
$XDG_RUNTIME_DIR is not set. Let's also add some explicit logging for
the EPERM/ACCESS case. Let's also in both cases suggest the
--machine=<user>@.host syntax.
And while we are at it, let's remove side-effects from the macro.
By checking for both the EPERM/EACCES case and the $XDG_RUNTIME_DIR case
we will now catch both the cases where people use "su" to issue a
"systemctl --user" operation, and those where they (more correctly, but
still not good enough) call "su -".
Fixes: #17901
This is unfortunately harder to implement than it sounds. The user's bus
is bound a to the user's lifecycle after all (i.e. only exists as long
as the user has at least one PAM session), and the path dynamically (at
least theoretically, in practice it's going to be the same always)
generated via $XDG_RUNTIME_DIR in /run/.
To fix this properly, we'll thus go through PAM before connecting to a
user bus. Which is hard since we cannot just link against libpam in the
container, since the container might have been compiled entirely
differently. So our way out is to use systemd-run from outside, which
invokes a transient unit that does PAM from outside, doing so via D-Bus.
Inside the transient unit we then invoke systemd-stdio-bridge which
forwards D-Bus from the user bus to us. The systemd-stdio-bridge makes
up the PAM session and thus we can sure tht the bus exists at least as
long as the bus connection is kept.
Or so say this differently: if you use "systemctl -M lennart@foobar"
now, the bus connection works like this:
1. sd-bus on the host forks off:
systemd-run -M foobar -PGq --wait -pUser=lennart -pPAMName=login systemd-stdio-bridge
2. systemd-run gets a connection to the "foobar" container's
system bus, and invokes the "systemd-stdio-bridge" binary as
transient service inside a PAM session for the user "lennart"
3. The systemd-stdio-bridge then proxies our D-Bus traffic to
the user bus.
sd-bus (on host) → systemd-run (on host) → systemd-stdio-bridge (in container)
Complicated? Well, to some point yes, but otoh it's actually nice in
various other ways, primarily as it makes the -H and -M codepaths more
alike. In the -H case (i.e. connect to remote host via SSH) a very
similar three steps are used. The only difference is that instead of
"systemd-run" the "ssh" binary is used to invoke the stdio bridge in a
PAM session of some other system. Thus we get similar implementation and
isolation for similar operations.
Fixes: #14580
The immediately following container_get_leader() call validate the name
anyway, no need to twice exactly the same way twice immediately after
each other.
When something fails, we need some logs to figure out what happened.
This is primarily relevant for connection errors, but in general we
want to log about all errors, even if they are relatively unlikely.
We want one log on failure, and generally no logs on success.
The general idea is to not log in static functions, and to log in the
non-static functions. Non-static functions which call other functions
may thus log or not log as appropriate to have just one log entry in the
end.
Normally, the udev rules operate on "change" events. But when
coldplugging, there's an "add" event present. The udev rules have to
recognize this and do some actions in this particular situation, too.
Also, we don't want the nodes to be created prematurely on "add"
events while not coldplugging. The udev rules will check
DM_UDEV_PRIMARY_SOURCE_FLAG to see if the device was activated
correctly before and if not, it ignore the "add" event totally.
This way the udev rules can support udev triggers generating "add"
events (e.g. "udevadm trigger --action=add" or
"echo add > /sys/block/<dm_device>/uevent").
In this case, the udevd service is started after
systemd-cryptsetup@config.service, is started, which will cause udevd
service to miss the "change" uevent with DM_UDEV_PRIMARY_SOURCE_FLAG
flag generated by systemd-cryptsetup@config.service. To solve this
issue, we let the cryptsetup service be started after the udevd
service.
When seccomp_restrict_archs is called, architectures that are blocked
are replaced by the SECCOMP_LOCAL_ARCH_BLOCKED marker so that they are
not disabled again and filters are not installed for them.
This can make some service that use SystemCallArchitecture= and
SystemCallFilter= start faster.
E.g. in nss-resolve it is still useful to print the location of the error:
src/test/test-nss.c:231: dlsym(0x0x1dc6fb0, _nss_resolve_gethostbyname2_r) → 0x0x7fdbfc53f626
(string):1:40: JSON field ifindex is out of bounds for an interface index.
I opted to use a partially duplicated if condition to avoid nesting. It's nice
to have the log calls vertically aligned. The compiler will optimize this nicely.
Let's add a dlopen_qrencode() function that does the actual dlopen()
stuff and caches the result.
This is useful so that we later can automatically test for all dlopen
hookups to work correctly.
Similar to the previous commit. All callers pass NULL. This will
ease initial nftables backend implementation (less features to cover).
Add the function parameters as local variables and let compiler
remove branches. Followup patch can remove the if (NULL) conditionals.
All users pass a NULL/0 for those, things haven't changed since 2015
when this was added originally, so remove the arguments.
THe paramters are re-added as local function variables, initalised
to NULL or 0. A followup patch can then manually remove all
if (NULL) rather than leaving dead-branch optimization to compiler.
Reason for not doing it here is to ease patch review.
Not requiring support for this will ease initial nftables backend
implementation.
In case a use-case comues up later this feature can be re-added.
Less 568 properly shows urlified strings.
Putative NEWS entry:
* Urlification is now enabled by default even when a pager is used.
Previously it was disabled, because less would not show such markup
properly. This has been fixed in less 568.
Please either upgrade less, or use SYSTEMD_URLIFY=0 to disable the
feature.
This reverts the gist of da1921a5c3 and
0d9fca76bb (for ppc).
Quoting #17559:
> libseccomp 2.5 added socket syscall multiplexing on ppc64(el):
> https://github.com/seccomp/libseccomp/pull/229
>
> Like with i386, s390 and s390x this breaks socket argument filtering, so
> RestrictAddressFamilies doesn't work.
>
> This causes the unit test to fail:
> /* test_restrict_address_families */
> Operating on architecture: ppc
> Failed to install socket family rules for architecture ppc, skipping: Operation canceled
> Operating on architecture: ppc64
> Failed to add socket() rule for architecture ppc64, skipping: Invalid argument
> Operating on architecture: ppc64-le
> Failed to add socket() rule for architecture ppc64-le, skipping: Invalid argument
> Assertion 'fd < 0' failed at src/test/test-seccomp.c:424, function test_restrict_address_families(). Aborting.
>
> The socket filters can't be added so `socket(AF_UNIX, SOCK_DGRAM, 0);` still
> works, triggering the assertion.
Fixes#17559.
In many cases the tables are largely the same, hence define a common set
of macros to generate the common parts.
This adds in a couple of missing specifiers here and there, so is more
thant just refactoring: it actually fixes accidental omissions.
Note that some entries that look like they could be unified under these
macros can't really be unified, since they are slightly different. For
example in the DNSSD service logic we want to use the DNSSD hostname for
%H rather than the unmodified kernel one.
These three syscalls are internally used by libc's memory allocation
logic, i.e. ultimately back malloc(). Allocating a bit of memory is so
basic, it should just be in the default set.
This fixes a couple of issues with asan/msan and the seccomp tests: when
asan/msan is used some additional, large memory allocations take place
in the background, and unless mmap/mmap2/brk are allowlisted these will
fail, aborting the test prematurely.