Use symlink_atomic_label() instead of symlink_atomic() as the symlink
may need a different label than the parent directory.
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
```
❯ ssh sus@xx.xx.xx.xx
Last login: Sat Nov 14 17:32:08 2020 from 10.104.45.138
17:36:19 up 0 min, 0 users, load average: 0.00, 0.00, 0.00
> systemd-analyze blame
Bootup is not yet finished (org.freedesktop.systemd1.Manager.FinishTimestampMonotonic=0).
Please try again later.
Hint: Use 'systemctl list-jobs' to see active jobs
> systemd-analyze blame
43.954s systemd-time-wait-sync.service
1.969s systemd-networkd-wait-online.service
1.559s cloud-init-local.service
1.039s cloud-init.service
414ms cloud-final.service
387ms dracut-initqueue.service
382ms initrd-switch-root.service
380ms cloud-config.service
198ms systemd-journal-flush.service
136ms systemd-udev-trigger.service
115ms initrd-parse-etc.service
97ms systemd-timesyncd.service
84ms systemd-journald.service
```
After made it configurable and set to 5s
```
❯ ssh sus@xx.xx.xx.xx
Last login: Sat Nov 14 18:41:42 2020 from 10.104.45.138
18:42:36 up 0 min, 0 users, load average: 0.16, 0.03, 0.01
> systemd-analyze blame
10.450s systemd-time-wait-sync.service
8.303s systemd-networkd-wait-online.service
1.621s cloud-init-local.service
1.068s cloud-init.service
```
This changes the retransmission timeout algorithm for requests
other than RENEW and REBIND. Previously, the retransmission timeout
started at 2 seconds, then doubling each retransmission up to a max
of 64 seconds. This is changed to match what RFC2131 section 4.1 describes,
which skips the initial 2 second timeout and starts with a 4 second timeout
instead. Note that -1 to +1 seconds of random 'fuzz' is added to each
timeout, in previous and current behavior.
This change is therefore slightly slower than the previous behavior in
attempting retransmissions when no server response is received, since the
first transmission times out in 4 seconds instead of 2.
Since TRANSIENT_FAILURE_ATTEMPTS is set to 3, the previous length of time
before a transient failure was reported back to systemd-networkd was
2 + 4 + 8 = 14 seconds, plus, on average, 3 seconds of random 'fuzz' for
a transient failure timeout between 11 and 17 seconds. Now, since the
first timeout starts at 4, the transient failure will be reported at
4 + 8 + 16 = 28 seconds, again plus 3 random seconds for a transient
failure timeout between 25 and 31 seconds.
Additionally, if MaxAttempts= is set, it will take slightly longer to
reach than with previous behavior.
Use the request timeout algorithm specified in RFC2131 section 4.4.5 for
handling timed out RENEW and REBIND requests.
This changes behavior, as previously only 2 RENEW and 2 REBIND requests
were sent, no matter how long the lease lifetime. Now, requests are
send according to the RFC, which results in starting with a timeout
of 1/2 the t1 or t2 period, and halving the timeout for each retry
down to a minimum of 60 seconds.
Fixes: #17909
The parsing of the dhcpv4 lease lifetime, as well as the t1/t2
times, is simplified by this commit.
This differs from previous behavior; previously, the lease lifetime and
t1/t2 values were modified by random 'fuzz' by subtracting 3, then adding
a random number between 0 and (slightly over) 2 seconds. The resulting
values were therefore always between 1-3 seconds shorter than the value
provided by the server (or the default, in case of t1/t2). Now, as
described in RFC2131, the random 'fuzz' is between -1 and +1 seconds,
meaning the actual t1 and t2 value will be up to 1 second earlier or
later than the server-provided (or default) t1/t2 value.
This also differs in handling the lease lifetime, as described above it
previously was adjusted by the random 'fuzz', but the RFC does not state
that the lease expiration time should be adjusted, so now the code uses
exactly the lease lifetime as provided by the server with no adjustment.
RFC2131, providing the details for dhcpv4, has specific retransmission
intervals that it outlines. This adds functions to compute the timeouts
as the RFC describes.
When something fails, we need some logs to figure out what happened.
This is primarily relevant for connection errors, but in general we
want to log about all errors, even if they are relatively unlikely.
We want one log on failure, and generally no logs on success.
The general idea is to not log in static functions, and to log in the
non-static functions. Non-static functions which call other functions
may thus log or not log as appropriate to have just one log entry in the
end.
The commit 6f3ac0d517 drops the prefix and
suffix in TAGS= property. But there exists several rules that have like
`TAGS=="*:tag:*"`. So, the property must be always prefixed and suffixed
with ":".
Fixes#17930.
The ret_size result is a bit of an awkward optimization that in a
sense enables bypassing the mmap-cache API, while encouraging
duplication of logic it already implements.
It's only utilized in one place; journal_file_move_to_object(),
apparently to avoid the overhead of remapping the whole object
again once its header, and thus its actual size, is known.
With mmap-cache's context cache, the overhead of simply
re-getting the object with the now known size should already be
negligible. So it's not clear what benefit this brings, unless
avoiding some function calls that do very little in the hot
context-cache hit case is of such a priority.
There's value in having all object-sized gets pass through
mmap_cache_get(), as it provides a single entrypoint for
instrumentation in profiling/statistics gathering. When
journal_file_move_to_object() bypasses getting the full object
size, you don't capture the full picture on the mmap-cache side
in terms of object sizes explicitly loaded from a journal file.
I'd like to see additional accounting in mmap_cache_get() in a
future commit, taking advantage of this change.
Quoting Andy Lutomirski:
> The upcoming Linux SGX driver has a device node /dev/sgx. User code opens
> it, does various setup things, mmaps it, and needs to be able to create
> PROT_EXEC mappings. This gets quite awkward if /dev is mounted noexec.
We already didn't use noexec in spawn, and this extends this behaviour to other
systems.
Afaik, the kernel would refuse execve() on a character or block device
anyway. Thus noexec on /dev matters only for actual binaries copied to /dev,
which requires root privileges in the first place.
We don't do noexec on either /tmp or /dev/shm (because that causes immediate
problems with stuff like Java and cffi). And if you have those two at your
disposal anyway, having noexec on /dev doesn't seem important. So the 'noexec'
attribute on /dev doesn't really mean much, since there are multiple other
similar directories which don't require root privileges to write to.
C.f. 33c10ef43b.
When an interface gains carrier but udev have not initialized the
interface or link_initialized_handler() has not been called yet,
then link_configure will be called twice. Thus LLDP client will be
configured twice, and triggers assertion.
Fixes#17929.
When the .so module is loaded, it gets a separate copy of stuff in src/basic,
including the log level variables. So any logging settings are unaffected by
the loading program calling log_parse_environment() or such. Let's also parse
the environment here so that we can have nice logging.
Initialization is done from each exported function, and pthread_once_t is used
to avoid duplicate initialization. I didn't merge PROTECT_ERRNO into
NSS_ENTRYPOINT_BEGIN because UNPROTECT_ERRNO is called in a bunch of places
and it would feel strange to have PROTECT_ERRNO hidden, but not UNPROTECT_ERRNO.
The most interesting stuff in this module is the varlink messages, and any
potential errors in json. So let's enable json logging when debug messages are
enabled.
With those changes, figuring out the issue in
https://github.com/systemd/systemd/pull/17823 is trivial:
$ LD_LIBRARY_PATH=build/ SYSTEMD_LOG_COLOR=1 SYSTEMD_LOG_LOCATION=1 SYSTEMD_LOG_LEVEL=debug getent hosts mirrors.fedoraproject.org
src/shared/varlink.c:237: n/a: varlink: setting state idle-client
src/shared/varlink.c:1240: n/a: Sending message: {"method":"io.systemd.Resolve.ResolveHostname","parameters":{"name":"mirrors.fedoraproject.org","family":10}}
src/shared/varlink.c:240: n/a: varlink: changing state idle-client → calling
src/shared/varlink.c:588: n/a: New incoming message: {"parameters":{"addresses":[{"ifindex":0,"family":10,"address":[42,5,208,20,0,16,120,3,247,116,77,124,226,119,164,87]},{"ifindex":0,"family":10,"address":[42,5,208,28,12,106,204,3,38,58,132,9,185,97,126,2]},{"ifindex":0,"family":10,"address":[38,32,0,82,0,3,0,1,222,173,190,239,202,254,254,215]},{"ifindex":0,"family":10,"address":[38,5,188,128,48,16,6,0,222,173,190,239,202,254,254,217]},{"ifindex":0,"family":10,"address":[38,4,21,128,254,0,0,0,222,173,190,239,202,254,254,209]},{"ifindex":0,"family":10,"address":[38,32,0,82,0,3,0,1,222,173,190,239,202,254,254,214]},{"ifindex":0,"family":10,"address":[38,16,0,40,48,144,48,1,222,173,190,239,202,254,254,211]},{"ifindex":0,"family":10,"address":[32,1,65,120,0,2,18,105,0,0,0,0,0,0,254,210]}],"name":"wildcard.fedoraproject.org","flags":1}}
src/shared/varlink.c:240: n/a: varlink: changing state calling → called
src/shared/varlink.c:240: n/a: varlink: changing state called → idle-client
src/nss-resolve/nss-resolve.c:84: (string):1:40: JSON field 'ifindex' is out of bounds for an interface index.
Normally, the udev rules operate on "change" events. But when
coldplugging, there's an "add" event present. The udev rules have to
recognize this and do some actions in this particular situation, too.
Also, we don't want the nodes to be created prematurely on "add"
events while not coldplugging. The udev rules will check
DM_UDEV_PRIMARY_SOURCE_FLAG to see if the device was activated
correctly before and if not, it ignore the "add" event totally.
This way the udev rules can support udev triggers generating "add"
events (e.g. "udevadm trigger --action=add" or
"echo add > /sys/block/<dm_device>/uevent").
In this case, the udevd service is started after
systemd-cryptsetup@config.service, is started, which will cause udevd
service to miss the "change" uevent with DM_UDEV_PRIMARY_SOURCE_FLAG
flag generated by systemd-cryptsetup@config.service. To solve this
issue, we let the cryptsetup service be started after the udevd
service.
Back in 5248e7e1f1 (July 2017) we moved over to
"_gateway", with the old name declared to be temporary measure. Since we're
doing a bunch of changes to resolved now, it seems to be a good moment to make
this simplification and not add support for the compat name in new code.
When seccomp_restrict_archs is called, architectures that are blocked
are replaced by the SECCOMP_LOCAL_ARCH_BLOCKED marker so that they are
not disabled again and filters are not installed for them.
This can make some service that use SystemCallArchitecture= and
SystemCallFilter= start faster.
There are no mmap_cache_get() users that actually deviate prot
from the JournalFile's f->prot.
So there's no point in making this a separate parameter to
mmap_cache_get(), nor is there any need to store it in
JournalFile's f->prot.
Instead just pass it to mmap_cache_add_fd() at MMapFileDescriptor
creation, storing it in there for the mmap() callers, which
already receive MMapFileDescriptor *.
For functions receiving both an MMapFileDescriptor * and prot,
the prot argument has been simply removed and call sites updated.
Formalizing this fd:prot binding at the public API also enables
discarding the prot check in window_matches(), which is a hot
function on long window lists, so a minor CPU efficiency gain
should be had there as seen with the past removal of the fd
check. Unnoticable for uncached journals, but maybe a little
runtime improvement when cached in specific circumstances.
window_matches_fd() has also been simplified to treat the
MMapFileDescrptor * as equivalent to its fd and prot.
E.g. in nss-resolve it is still useful to print the location of the error:
src/test/test-nss.c:231: dlsym(0x0x1dc6fb0, _nss_resolve_gethostbyname2_r) → 0x0x7fdbfc53f626
(string):1:40: JSON field ifindex is out of bounds for an interface index.
I opted to use a partially duplicated if condition to avoid nesting. It's nice
to have the log calls vertically aligned. The compiler will optimize this nicely.
Account and log these statistics separately since their overheads
are potentially quite different when the window lists are large.
There should probably be a histogram of window list traversal
counts too.
On really old kernels (< 4.14+) a bug in /proc/1/sched handling in the
kernel could be used to determine whether we are running in a PID
namespace. This hasn't worked for a long time, and there's little point
in making things work on old kernels we can't make work on current
kernels, hence let's drop that old cruft.
See: #8153
Avoid warning messages when booting systemd-nspawn containers and using
hybrid or legacy cgroups. systemd-nspawn mounts the cgroups v1 controller
tree as read-only so these errors are expected and not problematic.
Partially fixes#17862.
Test plan:
- Before: `mkosi --default .mkosi/mkosi.fedora boot`
```
‣ Processing default...
Spawning container image on /home/daan/projects/systemd/image.raw.
Press ^] three times within 1s to kill container.
systemd 247 running in system mode. (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization systemd-nspawn.
Detected architecture x86-64.
Welcome to Fedora 33 (Thirty Three)!
Queued start job for default target Graphical Interface.
-.slice: Failed to migrate controller cgroups from , ignoring: Read-only file system
system.slice: Failed to delete controller cgroups /system.slice, ignoring: Read-only file system
[ OK ] Created slice system-getty.slice.
[ OK ] Created slice system-modprobe.slice.
user.slice: Failed to delete controller cgroups /user.slice, ignoring: Read-only file system
[ OK ] Created slice User and Session Slice.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Started Forward Password Requests to Wall Directory Watch.
[ OK ] Reached target Local Encrypted Volumes.
[ OK ] Reached target Paths.
[ OK ] Reached target Remote File Systems.
[ OK ] Reached target Slices.
[ OK ] Reached target Swap.
[ OK ] Listening on Process Core Dump Socket.
[ OK ] Listening on initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket (/dev/log).
[ OK ] Listening on Journal Socket.
[ OK ] Listening on User Database Manager Socket.
dev-hugepages.mount: Failed to delete controller cgroups /dev-hugepages.mount, ignoring: Read-only file system
Mounting Huge Pages File System...
sys-fs-fuse-connections.mount: Failed to delete controller cgroups /sys-fs-fuse-connections.mount, ignoring: Read-only file system
Mounting FUSE Control File System...
Starting Journal Service...
Starting Remount Root and Kernel File Systems...
system.slice: Failed to delete controller cgroups /system.slice, ignoring: Read-only file system
```
After: `mkosi --default .mkosi/mkosi.fedora boot`
```
‣ Processing default...
Spawning container image on /home/daan/projects/systemd/mkosi.output/image.raw.
Press ^] three times within 1s to kill container.
systemd 247 running in system mode. (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization systemd-nspawn.
Detected architecture x86-64.
Welcome to Fedora 33 (Thirty Three)!
Queued start job for default target Graphical Interface.
[ OK ] Created slice system-getty.slice.
[ OK ] Created slice system-modprobe.slice.
[ OK ] Created slice User and Session Slice.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Started Forward Password Requests to Wall Directory Watch.
[ OK ] Reached target Local Encrypted Volumes.
[ OK ] Reached target Paths.
[ OK ] Reached target Remote File Systems.
[ OK ] Reached target Slices.
[ OK ] Reached target Swap.
[ OK ] Listening on Process Core Dump Socket.
[ OK ] Listening on initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket (/dev/log).
[ OK ] Listening on Journal Socket.
[ OK ] Listening on User Database Manager Socket.
Mounting Huge Pages File System...
Mounting FUSE Control File System...
Starting Journal Service...
Starting Remount Root and Kernel File Systems...
[ OK ] Mounted Huge Pages File System.
[ OK ] Mounted FUSE Control File System.
[ OK ] Finished Remount Root and Kernel File Systems.
Starting Create Static Device Nodes in /dev...
[ OK ] Finished Create Static Device Nodes in /dev.
[ OK ] Reached target Local File Systems (Pre).
[ OK ] Reached target Local File Systems.
Starting Restore /run/initramfs on shutdown...
[ OK ] Finished Restore /run/initramfs on shutdown.
[ OK ] Started Journal Service.
Starting Flush Journal to Persistent Storage...
[ OK ] Finished Flush Journal to Persistent Storage.
Starting Create Volatile Files and Directories...
[ OK ] Finished Create Volatile Files and Directories.
Starting Network Name Resolution...
Starting Update UTMP about System Boot/Shutdown...
[ OK ] Finished Update UTMP about System Boot/Shutdown.
[ OK ] Reached target System Initialization.
[ OK ] Started Daily Cleanup of Temporary Directories.
[ OK ] Reached target Timers.
[ OK ] Listening on D-Bus System Message Bus Socket.
[ OK ] Reached target Sockets.
[ OK ] Reached target Basic System.
Starting Home Area Manager...
Starting User Login Management...
Starting Permit User Sessions...
[ OK ] Finished Permit User Sessions.
[ OK ] Started Console Getty.
[ OK ] Reached target Login Prompts.
Starting D-Bus System Message Bus...
[ OK ] Started D-Bus System Message Bus.
[ OK ] Started Home Area Manager.
[ OK ] Started User Login Management.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.
Starting Update UTMP about System Runlevel Changes...
[ OK ] Finished Update UTMP about System Runlevel Changes.
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
Fedora 33 (Thirty Three) (built from systemd tree)
Kernel 5.9.11-arch2-1 on an x86_64 (console)
```
This test should ensure we notice if distros update shared libraries
that broke so name, and we still use the old soname.
(In contrast to what the commit summary says, this currently doesn#t
cover really all such deps, specifically xkbcommon and PCRE are missing,
since they currently aren't loaded from src/shared/. This is stuff to
fix later)
Let's add a dlopen_qrencode() function that does the actual dlopen()
stuff and caches the result.
This is useful so that we later can automatically test for all dlopen
hookups to work correctly.
When manager_{set|override}_watchdog is called, set the watchdog timeout
regardless of whether the hardware watchdog was successfully initialized. If
the watchdog was requested but could not be initialized, then instead of
pinging it, attempt to initialize it again. This ensures that the hardware
watchdog is initialized even if the kernel module for it isn't loaded when
systemd starts (which is quite likely, unless it is compiled in).
This builds on work by @danc86 in https://github.com/systemd/systemd/pull/17460,
but fixes the issue of not updating the watchdog timeout with the actual value
from the hardware.
Fixes https://github.com/systemd/systemd/issues/17838
Co-authored-by: Dan Callaghan <djc@djc.id.au>
Co-authored-by: Michael Marley <michael@michaelmarley.com>
This is a fix of #17751. Specifically:
1. Sort #include headers again
2. Remove tabs, as per coding style
3. Don't install fds in half-initialized objects
4. Use asynchronous_close() everywhere
That all said:
Quit frankly, I am not convinced we should do all this at all. If
close()ing of these input devices is really that slow, then this should
probably be fixed in the kernel, not worked around in userspace like
this.
Commit [1] added a workaround when unified cgroups are used but missed
legacy cgroups where there is the same issue.
[1] <2dbc45aea7>
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
After #17834, unreachable routes generated through DHCP6 are managed by
Manager. But they are referrenced by the DHCP6 uplink. So, the routes
managed by Manager must be freed after all Link objects are freed.
Follow-up for 575f14eef0.
Fixes SIGABRT reproted in #17831.
This partially reverts fe841414ef and
2a236f9fc0.
For IPv4, kernel compares the local address, prefix, and prefixlen.
For IPv6, kernel compares only the local address.
Let's follow the kernel's comparison way.
Fixes#17831.
The old code was only able to pass the value 0 for the inheritable
and ambient capability set when a non-root user was specified.
However, sometimes it is useful to run a program in its own container
with a user specification and some capabilities set. This is needed
when the capabilities cannot be provided by file capabilities (because
the file system is mounted with MS_NOSUID for additional security).
This commit introduces the option --ambient-capability and the config
file option AmbientCapability=. Both are used in a similar way to the
existing Capability= setting. It changes the inheritable and ambient
set (which is 0 by default). The code also checks that the settings
for the bounding set (as defined by Capability= and DropCapability=)
and the setting for the ambient set (as defined by AmbientCapability=)
are compatible. Otherwise, the operation would fail in any way.
Due to the current use of -1 to indicate no support for ambient
capability set the special value "all" cannot be supported.
Also, the setting of ambient capability is restricted to running a
single program in the container payload.
It's highly confusing to reference the command line parameters via
argv[] indexes. Let's clean this up, and introduce properly named local
variables that make this easier to follow.
No actualy code changes, just some renaming of variables.
In preparation for logging more mmap-cache statistics get rid of this
piecemeal stats accessor api and just have a debug log output function
for producing the stats.
Updates the one call site using these accessors, moving what that site
did into the new log function. So the output is unchanged for now,
just a trivial refactor.
The BLKID and ELFUTILS strings were present twice. Let's reaarange things so that
each times requires definition in exactly one place.
Also let's sort things a bit:
the "heavy hitters" like PAM/MAC first,
then crypto libs,
then other libs, alphabetically,
compressors,
and external compat integrations.
I think it's useful for users to group similar concepts together to some extent.
For example, when checking what compression is available, it helps a lot to have
them listed together.
FDISK is renamed to LIBFDISK to make it clear that this is about he library and
the executable.
Similar to the previous commit. All callers pass NULL. This will
ease initial nftables backend implementation (less features to cover).
Add the function parameters as local variables and let compiler
remove branches. Followup patch can remove the if (NULL) conditionals.
All users pass a NULL/0 for those, things haven't changed since 2015
when this was added originally, so remove the arguments.
THe paramters are re-added as local function variables, initalised
to NULL or 0. A followup patch can then manually remove all
if (NULL) rather than leaving dead-branch optimization to compiler.
Reason for not doing it here is to ease patch review.
Not requiring support for this will ease initial nftables backend
implementation.
In case a use-case comues up later this feature can be re-added.
After fe841414ef, broadcast address is
also compared with existing one to determine whether the address is
foregin or not. So, the address object should not contain unnecessary
information.
Fixes#17803.
Either suppress the entry entirely, or not at all. But do not suppress
the "localhost" names we recognize, leaving the ones we do not in place.
On Fedora, where "localhost4.localdomain4" is among those listed in
/etc/hosts for 127.0.0.1 we'd thus otherwise drop the "localhost" but
keep the "localhost4.localdomain4" and then on reverse lookups only
return that, which is highly confusing.
Make them rather fail than go to the network.
Previously we'd filter them on LLMNR (explicitly) and MDNS (implicitly,
because it doesn't have .local suffix), but not on DNS.
In order to make _gateway truly reliable, let's not allow it to go to
DNS either, and keep it local.
This is particular relevant, as clients can now request lookups without
local RR synthesis, where we'd rather have NXDOMAIN returned for
_gateway than have it hit the network.
Apparently 30s is a bit too long for some cases, see #5552. But not
caching SERVFAIL at all also breaks stuff, see explanation in
201d99584e.
Let's try to find some middle ground, by lowering the cache timeout to
10s. This should be ample for the problem
201d99584e attackes, but not as long as
half a miute, as #5552 complains.
Fixes: #5552